Hello I’m Chaojie Wang. I’m a senior data scientist in Analytics and Innovation. In this project I mainly worked with Joel, Xing, and Musa on the message testing side. As Joel mentioned, in our initial rollout, When GenAI msgs pass the generic evaluation that Musa introduced, or in another word, when they outperform Today’s messages in terms of empathy, transparency, clarity, and appropriateness, they will still need to go through human review before being sent out to our customers. Why? Cause in the message generation step, when we are pushing hard on LLM to share more details on the root delay reasons, the model tends to randomly pickup anything from the prompt or even make up something from nowhere and put them in the generated message as the details. This is called GenAI hallucination. We definitely want to avoid hallucinations in messages sent to customers. So our human reviewer would evaluate whether we do have some additional information about the delays to share, and check if GenAI has include inaccurate information in the message if additional info is available, or whether it has made up some misinformation if additional info is not available.

But as we gradually rolling out the GenAI message generation functionality for more delay reasons, the volume of the messages would just become unbearable for the human reviewer to handle. Therefore, we have developed unit testing evaluators for different delay reasons, which also leverages GenAI’s capability to do auto accuracy evaluations. These evaluators are specifically designed for each delay reason such that in each step of the chain of thought, their prompts consists of the most straight forward questions and the minimum relevant information to avoid hallucinations. The human evaluation results from the initial rollout will be used to fine tune these unit evaluators such that they can produce evaluations consistent with human reviews and be deployed in AutoQC to eventually steamline the accuracy testing of GenAI messages.