Confident AI Introduction

caution

Without best LLM evaluation practices in place, your testing results aren't really valid, and you might be iterating back and fourth between the wrong things, which means your LLM application isn't nearly as performant as they should be.

Confident AI is the LLM evaluation platform for DeepEval. It is native to DeepEval, and was designed for teams building LLM applications to maximize its performance, and to safeguard against unsatisfactory LLM outputs. Whilst DeepEval's open-source metrics are great for running evaluations, there is so much more to building a robust LLM evaluation workflow than collecting metric scores.

If you're serious about LLM evaluation, Confident AI is for you.

Apart from running the actual evaluations, you'll need a way to:

Curate a robust testing dataset
Perform LLM benchmark analysis
Tailor evaluation metrics to your opinions
Improve your testing dataset over time

Confident AI enforces this by offering an opinionated, centralized platform to manage all the things mentioned above, which for you means "more accurate, informative, and faster insights", and allows you to "identify any performance gaps and identify how to improve my LLM system".

DID YOU KNOW?

\textbf{Great LLM Evaluation} == \textbf{Quality of Dataset} \times \textbf{Quality of Metrics}

Why Confident AI?

If your team has ever tried building its own LLM evaluation pipeline, here are the list of problems your team has likely encountered (and it's a long list):

Dataset Curation Is Fragmented And Annoying
- Your team often juggle between tools like Google Sheets or Notion to curate and update datasets, leading to constant back-and-forth between engineers and domain expert annotators.
- There is no "source of truth" since datasets aren't in-sync with your codebase for evaluations.
Evaluation Results Are (Still) More Vibe Checks Rather Than Experimentation
- You basically just look at failing test cases, but they don’t provide actionable insights, and sharing it among your team is hard.
- It’s impossible to compare benchmarks side-by-side to understand how changes impact performance for each unit test, making it more guesswork than experimentation.
Testing Data Are Static With No Easy Way To Keep Them Updated
- Your LLM application needs and priorities evolves in production, but your datasets don’t.
- Figuring out how to query and incorporate real-world interactions into evaluation datasets is tedious and error-prone.
Building A/B Testing Infrastructure Is Hard And Current Tools Don't Cut It
- Setting up A/B testing for prompts/models to route traffic between versions is easy, but figuring out which version performed better and on what areas is hard.
- Tools like PostHog or Mixpanel give user-level analytics, while other LLM observability tools focus too much on cost and latency, none of which tell you anything about the end output quality.
Human Feedback Doesn't Lead to Improvements
- Teams spend time collecting feedback from end-users or internal reviewers, but there’s no clear path to integrate it back into datasets.
- A lot of manual effort is needed to make good use of feedback, and unfortunately it is a waste of everyone's time.
There's No End To Manual Human Intervention
- Teams rely on human reviewers to gatekeep LLM outputs before it reaches users in production, but the process is random, unstructured, and never ending.
- No automation to focus reviewers on high-risk areas or repetitive tasks.

Confident AI solves all of your LLM evaluation problems so you can stop going around in circles. Here's a diagram outlining how Confident AI works:

Everything in deepeval is already automatically integrated with Confident AI, including any custom metrics you've built on deepeval. To start using Confident AI with deepeval, simply login in the CLI:

deepeval login

Follow the instructions displayed on the CLI (to create an account, get your Confident API key, paste it in the CLI), and you're good to go.

tip

You can also login directly in Python if you already have a Confident API Key:

deepeval.login_with_confident_api_key("your-confident-api-key")

Or, via the CLI:

deepeval login --confident-api-key "your-confident-api-key"

Why Confident AI?​

Login to Confident AI​

Why Confident AI?

Login to Confident AI