Sorry for replying late. I had to rerun a few evaluations when summarizing the comparison results. I switched to other works while waiting for the results then it completely slipped my mind..

The use case here is to use LLM to evaluate if the GenAI messages for flights delayed due to weather reasons include the accurate weather location (departure/arrival airports or enroute) information when it is available in the context provided in the message generation process. The evaluator follows a 3-step CoT:

  • Find the weather location info in the context if exists
  • Find the weather location info in the GenAI delay message
  • Compare and results of the previous two steps to see if the weather location info included in the GenAI message is accurate. In the attached csv file, you can find the prompts, raw responses, and parsed evaluation results of all three steps from column AC to AT. And the corresponding manual evaluation results for these three steps are in columns X, V, and Z.

Claude v2 was able to evaluate all 100 messages following the above three steps and have 96/100 evaluations consistent with manual evaluation results (notes on column AV shows that only one of the four inconsistent evaluations is a wrong evaluation; the other three are arguably due to incorrect manual evaluations). But with the same prompts, Claude v2 did not complete 14/100 evaluations because at least one of the responses in those three steps are not a valid json. And the 53 out of 100 messages have inconsistent evaluation results with the manual evaluations.