Announcing Lens for LLMs: Combining Human and Automated LLM Evaluation

Filed under:

23 Apr 2024

Blog

We’re excited to announce Lens for LLMs – our product for evaluating LLM applications with human and automated feedback.

Improving LLM applications is hard because there often isn’t a clear ground truth. Without ground truth, human evaluators are the gold standard, but are expensive and slow. Automated evaluators are fast and fine-grained, but may not accurately reflect human preferences.

Lens combines the best of both approaches: it integrates a wide array of automated evaluators with a small set of human feedback to efficiently evaluate a single LLM, compare multiple LLMs, and monitor LLM performance over time. This builds on the capabilities of LangCheck, our open-source Python library for LLM evaluation with thousands of monthly downloads.

Lens for LLMs is currently in private beta with a small group of users. If you’re interested in trying out Lens, sign up here!

Example LLM Application

To demonstrate Lens for LLMs, let’s pretend we’ve built a RAG application to answer questions about international AI regulations and standards. We want to evaluate if this RAG application works better with Google’s Gemini 1.5 Pro or OpenAI’s GPT-4¹.

Using Automated Evaluators

First, let’s take a look at the automated evaluators in Lens. We can select a few evaluators that are relevant for this application, such as Factual Consistency, Toxicity, and Answer Relevance.

After generating the report, this table will show a list of input questions and output answers from Gemini 1.5 and GPT-4 (this list can come from an evaluation dataset, application logs, etc). The Metrics column shows how the automated evaluators rate each answer, and helps you quickly identify good and bad examples.

For example, the first row shows the question “Can the use of the NIST AI RMF guarantee that an AI system is ethical and unbiased?”, with answers from both Gemini 1.5 (Output A) and GPT-4 (Output B). While GPT-4 answers the question appropriately, Gemini 1.5 refrains from answering, which is reflected by the difference in the Answer Relevance scores.

In addition to individual examples, you can compare the two models on the entire dataset. According to the chart below, GPT-4 (Model B) does generate responses with higher Answer Relevance than Gemini 1.5 (Model A) across the full dataset.

Last, you’ll usually want to evaluate specific use cases (“data segments”). For example, we can see that Gemini 1.5 has particularly poor performance on questions about the NIST AI RMF.

Combining Human + AI Evaluation

As we saw above, automated evaluators are fast, fine-grained, and customizable. However, it’s important to check that these automated results align well with human judgment. Lens enables this by combining a small set of human feedback with the automated evaluations.

Lens includes several tools to collect human feedback, which are designed to be as efficient as possible. For example, to compare Gemini 1.5 and GPT-4, you can use this built-in pairwise comparison tool:

After completing a small number of comparisons, Lens automatically combines the automated and human evaluations to determine whether Gemini 1.5 or GPT-4 works better in this use case.

The chart above shows that GPT-4 (Model B) slightly outperforms Gemini 1.5 (Model A). The “Combined” interval is the final result of merging the human and automated evaluations – it’s a confidence interval showing that Model B wins between 55-75% of the time.

The rest of the chart shows more detail. Interestingly, while the automated evaluator strongly prefers GPT-4, our manual evaluation showed a more neutral result. Lens merged both results into a statistically valid confidence interval showing that GPT-4 has a higher win rate.

This one comparison isn’t the end of the story – as you try out newer models, prompts, retrieval pipelines, Lens enables you to easily monitor and improve your LLM application’s performance over time.