AI Benchmarks Are Useless for Your Business—Here’s What to Do Instead
Hello folks! Welcome to another issue of Technical Tuesday.
Usually, we go deep into the technical weeds of LLMs, but today, we’re switching things up. This one’s especially practical if you’re a business owner—because we’re talking about how to set up your personal AI eval set. 🛠️
I stole this idea from Shaan Puri—because, let’s be real, those fancy benchmark scores from OpenAI, Anthropic, and DeepSeek? They don’t mean much for your business. What actually matters is how well the latest models handle the tasks you care about.
That’s why you need to build your own business eval set. Here’s how:
- 📝 List the tasks you want to automate. Think of the repetitive, boring, or expensive ones. Start with 3-5—keep it simple and actionable.
- 🔍 Find the best-performing model for each task. Every 1-3 months, check which model is leading in your area. For AI agents, you might try OpenAI o3 or DeepSeek r1. Need to generate a product video? ByteDance’s OmniHuman-1 might be the way to go.
- 🚀 Put them to the test. Run these models against your eval set. Track where AI succeeds and where it flops.
- 📊 Build intuition over time. This process helps you spot AI’s growing potential in your workflow. You’ll know exactly when a model is good enough to be deployed.
- ⚙️ Move to full automation. Once a model reliably handles a task, integrate it. Use no-code tools or programming, and don’t forget to track AI calls and outputs for quality (see my previous post on "Look at your data"!).
That’s it! Setting up your business eval set takes just a couple of hours, but the impact compounds over time. It’s not urgent, but it’s how you stay ahead at the intersection of AI and your industry. 🔥
Member discussion