A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT — an LLM agent built using OpenAI's Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other state-of-the-art LLM-based evaluation approaches such as G-Eval.
Understandably, knowing that Confident AI is the world's first open-source evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:
I thought it's all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.
So here you go, dear anonymous internet stranger, this article is dedicated to you.
The authors of G-Eval state that:
Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity.
For those who don't already know, G-Eval is a framework that utilizes Large Language Models (LLMs) with chain-of-thought (CoT) processing to evaluate the quality of generated texts in a form-filling paradigm, and if you've ever tried implementing a version of your own, you'll quickly find that using LLMs for evaluation presents its own set of problems:
In fact, another paper that explored LLM-as-a-judge pointed out that using LLMs as an evaluator is flawed in several ways. For example, GPT-4 gives preferential treatment to self-generated outputs, is not very good at math (but neither am I), and is prone to verbosity bias. Verbosity bias means it favors longer, verbose responses instead of accurate, shorter alternatives. (In fact, an initial study has shown that GPT-4 exhibits verbosity bias 8.75% of the time)
Can you see how this becomes a problem if you're trying to evaluate a summarization task?
Here's a surprise — JudgementalGPT isn't composed of one evaluator built using the new OpenAI Assistant API, but multiple. That's right, behind the scenes, JudgementalGPT is a proxy for multiple assistants that perform different evaluations depending on the evaluation task at hand. Here are the problems JudgementalGPT was designed to solve:
Another insight we gained while integrating G-Eval into our open-source project DeepEval was the realization that LLM-generated evaluation steps tend to be arbitrary and generally does not help in providing guidance for evaluation. Some of you might also wonder what happens when JudgementalGPT can't find a suitable evaluator for a particular evaluation task. For this edge case, we default back to G-Eval. Here's a quick architecture diagram on how JudgementalGPT works:
As I'm writing this article, I discovered recent paper introducing Prometheus, "a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied", which also requires evaluation steps to be explicitly defined instead.
One unresolved issue pertains to the accuracy challenges stemming from the predominance of a single digit in evaluation scores. This phenomenon, theoretically, isn't exclusive to older models and is likely to affect advanced versions like gpt-4–1106-preview as well. So, I'm keeping an open mind about how this might affect JudgementalGPT. We're really looking forward to more research that'll either back up what we think or give us a whole new perspective — either way, I'm all ears.
Lastly, there can still be intricacies involved in defining our own set of evaluators. For example, just like how G-Eval isn't a one-size-fits-all solution, neither is summarization, or relevancy. Any metric that is subject to interpretability is guaranteed to disappoint users who expect something different. For now, the best solution would be to have users clearly define their evaluation criteria to rid LLMs of any evaluation ambiguity.
At the end of the day, there's no one-size-fits-all solution for LLM-based evaluations, which is why engineers/data scientists are frequently disappointed by non-human evaluation scores. However, by defining specific and concise evaluation steps for different use cases, LLMs are able to navigate ambiguity better, as they are provided more guidance into what a human might expect for different evaluation criteria.
P.S. By now, those of you who read between the lines will probably know the key to building a better evaluator is to tailor them for specific use cases, and OpenAI's new Assistant API along with its code interpreter functionality is merely the icing on the cake (and a good marketing strategy!).
So, dear anonymous internet stranger, I hope you're satisfied, and till next time.
Subscribe to our weekly newsletter to stay confident in the AI systems you build.