OpenAI Evals & Evals API: Complete Guide

Leanware Editorial Team
2 days ago
9 min read

You can't reliably improve what you don't measure. LLM applications need systematic testing to catch regressions when you modify prompts, switch models, or adjust parameters. OpenAI Evals provides a framework for running repeatable tests against your model outputs.

You can use the framework in two ways: through the dashboard for quick setup, or programmatically via API for integration with your development workflow. You define test cases with expected behavior, run evaluations against your prompts, and get metrics showing what passed or failed.

This guide covers how you can set up evals with the API, structure your test data, define evaluation criteria, run the evals, and inspect the results.

What Is an "Eval"?

An eval is a structured test that measures how well an LLM performs on a specific task. Each eval contains a data source with test cases, testing criteria that define correctness, and a prompt or model configuration to evaluate. Unlike unit tests that check exact outputs, evals measure quality through programmatic checks or model-based grading.

The framework separates test definition from execution. You create an eval once, then run it multiple times with different prompts or models to compare performance.

Why Evals Matter for LLM Workflows

Model outputs change when you modify prompts, switch models, or adjust parameters. Without systematic evaluation, you can't measure whether changes improve or degrade performance. Evals provide repeatable testing that catches regressions before they reach production.

Manual review doesn't scale. Applications handling thousands of requests need automated quality checks across representative samples. Evals let you detect issues like formatting errors, instruction-following failures, or hallucinations systematically.

OpenAI's Evals Ecosystem & Registry

OpenAI maintains an open-source Evals repository on GitHub containing the framework and a registry of community-contributed evaluations. The registry includes evals for tasks like factual accuracy, reasoning, and instruction following.

You can configure evals in the OpenAI dashboard or programmatically through the API. The open-source repository lets you contribute custom evals or use existing ones for your applications.

Evals API: How It Works

Core Objects & Data Model

The Evals API structures evaluations around four core concepts. An Eval defines the task and testing criteria. A Run executes the eval against a model with specific prompts. The data_source_config specifies the schema for test data. Testing criteria define how to grade outputs.

Each eval receives a unique ID when created. You reference this ID when launching runs or retrieving results. Runs also receive unique IDs for tracking execution status and fetching results.

API Endpoints & Reference

The Evals API provides endpoints for creating evaluations, launching runs, and retrieving results. The /v1/evals endpoint creates new eval configurations. The /v1/evals/{eval_id}/runs endpoint launches evaluation runs. The /v1/evals/{eval_id}/runs/{run_id} endpoint retrieves run status and results.

Requests use JSON payloads with the eval configuration. Responses include metadata about execution status, timing, and computed metrics. The API follows REST conventions with standard HTTP methods.

Versioning & Backwards Compatibility Considerations

The Evals API maintains stability for core endpoints. New features may introduce additional fields in responses, but existing fields remain consistent. The open-source framework releases new versions through GitHub, which you can pin in your requirements to ensure consistent behavior.

Track repository releases to benefit from improvements while testing thoroughly before upgrading evaluation pipelines that run in production.

Setup & Prerequisites

Installing the Evals Library

Install the Evals package using pip:

pip install evals

The library requires Python 3.9 or later. For contributing new evals or modifying existing ones, clone the GitHub repository and install in editable mode:

git clone https://github.com/openai/evals.git
cd evals
pip install -e .

The -e flag makes changes immediately available without reinstalling.

Environment & API Key Configuration

Set your OpenAI API key as an environment variable:

export OPENAI_API_KEY='your-api-key-here'

The Evals library reads this automatically. Obtain API keys from the OpenAI platform dashboard under the API keys section. Keep your key secure and never commit it to version control.

Permissions, Rate Limits, and Quotas

Evals consume API credits when making model calls during runs. Standard API rate limits apply based on your account tier. Large evaluations with many parallel requests may hit rate limits without proper throttling.

Monitor usage through the OpenAI dashboard to track costs. Eval runs can accumulate charges quickly when testing against datasets with hundreds or thousands of samples.

Building Your First Eval with the API

Designing the Data Schema

Define your test data schema when creating an eval. For an IT ticket categorization task, the schema includes the ticket text and the correct label:

{
    "type": "object",
    "properties": {
        "ticket_text": { "type": "string" },
        "correct_label": { "type": "string" }
    },
    "required": ["ticket_text", "correct_label"]
}

This schema validates that test data contains the required fields. Set include_sample_schema to true to enable this validation.

Implementing the Target Function

When using the Chat Completions API, define your prompt using developer and user messages:

from openai import OpenAI
client = OpenAI()

instructions = """
You are an expert in categorizing IT support tickets. Given the support
ticket below, categorize the request into one of "Hardware", "Software",
or "Other". Respond with only one of those words.
"""

ticket = "My monitor won't turn on - help!"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "developer", "content": instructions},
        {"role": "user", "content": ticket}
    ]
)

print(response.choices[0].message.content)

This function structure converts into the eval run configuration using templates.

Defining Evaluation Criteria

Testing criteria specify how to grade model outputs. A string check criterion compares output against expected values:

{
    "type": "string_check",
    "name": "Match output to human label",
    "input": "{{ sample.output_text }}",
    "operation": "eq",
    "reference": "{{ item.correct_label }}"
}

The double curly brace syntax templates values from the test data and model output. The eq operation checks exact equality after processing.

Configuring the Data Source

Prepare test data as JSONL with each line containing one test case:

{ "item": { "ticket_text": "My monitor won't turn on!", "correct_label": "Hardware" } }
{ "item": { "ticket_text": "I'm in vim and I can't quit!", "correct_label": "Software" } }
{ "item": { "ticket_text": "Best restaurants in Cleveland?", "correct_label": "Other" } }

This format includes both inputs and ground truth labels for comparison.

Creating the Eval via API

Create the eval configuration:

curl https://api.openai.com/v1/evals \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "name": "IT Ticket Categorization",
        "data_source_config": {
            "type": "custom",
            "item_schema": {
                "type": "object",
                "properties": {
                    "ticket_text": { "type": "string" },
                    "correct_label": { "type": "string" }
                },
                "required": ["ticket_text", "correct_label"]
            },
            "include_sample_schema": true
        },
        "testing_criteria": [
            {
                "type": "string_check",
                "name": "Match output to human label",
                "input": "{{ sample.output_text }}",
                "operation": "eq",
                "reference": "{{ item.correct_label }}"
            }
        ]
    }'

The response includes a unique eval ID needed for launching runs.

Launching a Run

Upload your test data file first:

curl https://api.openai.com/v1/files \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F purpose="evals" \
  -F file="@tickets.jsonl"

Note the file ID from the response. Then create an eval run:

curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
        "name": "Categorization test run",
        "data_source": {
            "type": "responses",
            "model": "gpt-4",
            "input_messages": {
                "type": "template",
                "template": [
                    {"role": "developer", "content": "You are an expert in categorizing IT support tickets. Given the support ticket below, categorize the request into one of Hardware, Software, or Other. Respond with only one of those words."},
                    {"role": "user", "content": "{{ item.ticket_text }}"}
                ]
            },
            "source": { "type": "file_id", "id": "YOUR_FILE_ID" }
        }
    }'

The run executes asynchronously, processing each test case with the specified prompt.

Fetching & Interpreting Results

Check run status:

curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs/YOUR_RUN_ID \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json"

Completed runs include result counts and per-criteria results:

{
    "status": "completed",
    "result_counts": {
        "total": 3,
        "errored": 0,
        "failed": 0,
        "passed": 3
    },
    "per_testing_criteria_results": [
        {
            "testing_criteria": "Match output to human label",
            "passed": 3,
            "failed": 0
        }
    ]
}

The response includes a report_url linking to the dashboard for visual analysis.

Common Use Cases & Examples

A/B Testing & Model Comparisons

Run the same eval with different models by creating separate runs:

# Test with GPT-4
curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{ "data_source": { "model": "gpt-4", ... } }'

# Test with GPT-3.5
curl https://api.openai.com/v1/evals/YOUR_EVAL_ID/runs \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{ "data_source": { "model": "gpt-3.5-turbo", ... } }'

Compare result metrics to determine which model performs better for your task.

Regression Detection in Prompt Changes

Create runs with different prompt versions:

prompt_v1 = "Categorize this IT ticket: {{ item.ticket_text }}"
prompt_v2 = "You are an IT expert. Categorize: {{ item.ticket_text }}"

# Run with v1, check metrics
# Run with v2, compare results

Flag runs where metrics drop below baseline thresholds, indicating prompt regressions.

Tool / Structured Output Evaluation

Test applications using function calling by validating structured outputs. Create testing criteria that check whether the model selected correct tools and provided valid parameters.

Bulk Experimentation & Prompt Sweeps

Test multiple prompt variations systematically by creating separate runs for each variant. Compare metrics across all runs to identify the best-performing prompt formulation.

Best Practices & Design Guidelines

1. Choosing Good Test Cases & Edge Cases

Include diverse inputs covering common scenarios and edge cases. Test data should represent actual production patterns. Add samples where models typically struggle, like ambiguous instructions or multi-step reasoning tasks.

Balance coverage with practical dataset size. Start with 50-100 representative samples, then expand based on observed failure patterns.

2. Avoiding Overfitting to Eval Data

Don't optimize prompts solely to maximize eval scores. This creates overfitting where prompts perform well on test cases but poorly on new production data. Maintain separate eval sets for development and final validation.

Regularly refresh test data with real production samples to ensure evals remain representative of actual usage.

3. Metric Selection & Aggregation Strategy

Choose testing criteria matching your quality requirements. String checks work for categorical outputs with specific expected values. More complex criteria can use model-based grading for subjective quality assessment.

Aggregate results appropriately. Simple pass/fail counts suffice for many cases. Calculate percentages to track performance over time and compare across runs.

4. Graders: LLM vs Programmatic vs Hybrid

Programmatic graders like string checks provide deterministic scoring without additional API costs. Use them for outputs with clear correctness criteria like exact matches or format validation.

Model-based graders handle nuanced assessment but add cost and latency. The Evals framework supports custom grading logic for complex evaluation requirements.

5. Monitoring & Continuous Evaluation in Production

Run evals periodically on production traffic samples to detect quality drift. Sample requests with metadata, store them in eval format, then evaluate regularly against quality criteria.

Set up webhooks to receive notifications when eval runs complete. Subscribe to eval .run.succeeded, eval .run.failed, and eval .run.canceled events for automated monitoring.

Integration with CI/CD & Pipelines

Add eval runs to deployment pipelines to gate releases on quality metrics. Create automated workflows that run evals and check results before allowing deployments to proceed.

Store eval results over time to track quality trends and catch gradual degradation that might not trigger single-run thresholds.

Advanced Patterns & Scaling

1. Multi-Stage Eval Pipelines

Complex applications need sequential evaluation. First stage validates tool selection, second stage checks parameters, final stage assesses output quality. Structure separate evals for each stage and combine results for comprehensive assessment.

2. Dynamic & Adaptive Evaluations

Adjust evaluation behavior based on model outputs. If outputs indicate certain failure modes, apply different testing criteria. This handles cases where standard evaluation doesn't fit due to model behavior variations.

3. Parallelization & Distributed Eval Runs

The Evals API processes eval runs asynchronously by default. For very large datasets, consider splitting into multiple smaller eval runs that execute in parallel. This reduces total time while respecting rate limits.

Limitations, Challenges & Future Directions

The eval framework is still evolving. You can check the GitHub repository for open issues and feature requests, including richer scoring options, improved visualization, and better debugging for failed samples. Custom code evals aren’t accepted in the public registry yet, though you can implement them privately.

Cost and latency are important considerations. Running large datasets against GPT-4 can get expensive, so balance coverage with budget by sampling strategically or using cheaper models for grading. Not every evaluation requires the most capable model.

The framework currently focuses on text. Multimodal tasks, such as evaluating images or audio, require custom setups beyond the standard templates.

Your Next Step

Start with a small set of critical tests to establish a baseline, then expand to edge cases and real production failures to prevent repeats.

OpenAI offers resources for more advanced workflows: tracking prompt regressions, bulk testing across prompts and models, monitoring stored completions, fine-tuning models, and distilling capabilities to smaller, faster models. Gradually building your eval library helps you track quality without getting overwhelmed.

You can also connect with our ML and backend experts for help with evaluation workflows and automated model testing.

Frequently Asked Questions

How do I evaluate OpenAI Assistants API responses with custom knowledge bases using Evals?

Structure your eval to test both retrieval and generation quality. Create test cases with questions where you know which information should be retrieved. Define testing criteria that validate whether responses use appropriate knowledge and provide accurate answers. You can template assistant interactions into the eval data source and grade outputs against expected behavior.

Can I run Evals locally without making API calls for testing my evaluation logic?

Yes, but with limitations. You can test data loading and schema validation without API calls by using the open-source framework locally. However, actually generating model outputs for evaluation requires API calls. For development, work with small test datasets to minimize costs while iterating on eval configuration. The framework includes utilities for working with local data files before uploading them for production eval runs.