05 Feb 2025 2 min read

AI Quality Control: The Unsexy Secret to Spectacular Results 💎

Hello folks!

In today's Technical Tuesday, I've some highly practical tips on improving your generative AI applications from Hamel Husain and Hugo Bowne-Anderson!

Their secret sauce? "Looking at your data" (I know, I know - not exactly Netflix-worthy excitement 😅). But stick with me here...

Picture yourself as an AI detective 🔍 You're collecting evidence (LLM calls) and hunting for clues (errors). Sure, scrolling through text rows isn't as thrilling as binge-watching your favorite show, but it's how you catch your AI when it's being... less than brilliant. Whether you're building a CRM that converses in natural language with your users or an article summarizer, you want to be sure that your application is delivering on its promises for various conditions.

Convinced of the importance? Hamel has provided us with a 7 step process for error analysis, and some suggestions for tools.

Here are the 7 steps:

Start by collecting the telemetry of LLM calls. No real users yet? Hamel has some suggestions for you. We will cover this later on.
You want to have ground truth for what the output of your LLM calls should be. If your use case do not have ground truths (e.g. summaries), then you need good taste.
Inspect your data. Spreadsheets can be a good start!
Create a column named 'notes' where you indicate what went wrong with that particular row (if any).
Once you've gone through enough rows, you can create error categories, and assign each row an error category.
Next, create a pivot table to conduct analysis on the errors.
From here, you can decide your next steps- be it correcting certain type of errors, creating unit tests to stress test those categories, or developing LLM as a judge.

Now that you're armed with this spicy recipe, here are some tools that Hamel recommended:

Arize Phoenix: open-source AI observability platform designed for experimentation, evaluation, and troubleshooting
LangSmith: Similar to Arize Phoenix- AI observability platform
Braintrust: an end to end platform for building AI apps, but it also has features similar to those offered by Phoenix and LangSmith

Lastly- what if you do not have any users and real data for this process? Hamel recommends using synthetic data- asking a LLM to simulate a user and actual response pair. For example, you may want a frustrated user input along with the corresponding correct output.

Try these suggestions and let us know how it has helped your AI applications!