top of page
leanware most promising latin america tech company 2021 badge by cioreview
clutch global award leanware badge
clutch champion leanware badge
clutch top bogota pythn django developers leanware badge
clutch top bogota developers leanware badge
clutch top web developers leanware badge
clutch top bubble development firm leanware badge
clutch top company leanware badge
leanware on the manigest badge
leanware on teach times review badge

Learn more at Clutch and Tech Times

Got a Project in Mind? Let’s Talk!

How to Benchmark & Test AI Models for Specific Use Cases: The Complete 2025 Guide

  • Writer: Carlos Martinez
    Carlos Martinez
  • Sep 5
  • 16 min read

Picking an AI model without testing it in your own setup usually leads to problems. Each model, whether it’s GPT, Claude, Gemini, DeepSeek, or another, behaves differently depending on the task. One might be fast but weak on reasoning, another solid for long context but expensive, another cheap but inconsistent. These limitations or compromises won’t be obvious in documentation or pricing tables.


The AI model ecosystem also changes quickly. New versions and updates appear every few months, so a model that worked well six months ago may no longer be the most efficient choice. Without benchmarking, you typically discover these issues only after deployment - usually when switching models is more expensive than testing upfront.


In this guide, we’ll cover practical methods for benchmarking language models. You’ll get access to the full source code, real test results, and a clear process that you can apply directly to your own use case for making data-driven decisions. 


Scope and Objectives

Benchmarking AI Models for Specific Use Cases

We'll explore:


  • Complete Python implementation for automated benchmarking.

  • Comparative analysis of Gemini, ChatGPT, Claude, and DeepSeek using structured benchmarking.

  • Full source code with Python 3.12 and dependencies.

  • Real console output and JSON results from actual tests.

  • Manual testing approaches using OpenRouter's unified interface.

  • Real-world metrics interpretation and decision-making frameworks.


Technical Requirements

  • Python Version: 3.12

  • Main Dependency: openai==1.102.0

  • Additional Dependencies: environs for environment management

  • API Access: OpenRouter API key


The OpenRouter Advantage

OpenRouter serves as a unified gateway to multiple AI models, eliminating the complexity of managing different API integrations. This aggregator approach allows you to:


  • Send identical prompts to multiple models simultaneously.

  • Compare performance metrics in real-time.

  • Track costs across different providers.

  • Switch between models without changing your codebase.


1. Setting Up Your Testing Environment


Getting Started with OpenRouter

  1. Create an OpenRouter Account:

    • Sign up at openrouter.ai.

    • Obtain your API key from the dashboard,

    • Set up billing to access premium models.


  1. Choose Your Testing Approach:

    • Manual Testing: Use OpenRouter's web interface for quick comparisons.

    • Programmatic Testing: Leverage the API for systematic benchmarking.


Essential Tools and Resources

For manual testing, you'll need:

  • OpenRouter account with credits.

  • Spreadsheet or document for recording results.

  • Standardized prompt library.

  • Evaluation criteria checklist.


For automated testing, additionally prepare:

  • Python environment with required libraries.

  • API key configuration.

  • Result storage solution (CSV, database, etc.).


2. Core Benchmarking Concepts

Use these principles to make fair comparisons. They apply whether you’re testing two models or twenty.


1. Consistency

  • Use identical prompts across all models.

  • Maintain same temperature and parameter settings.

  • Test under similar conditions (time of day, load).


2. Reproducibility

  • Document all test parameters.

  • Version your prompts and evaluation criteria.

  • Record timestamps and model versions.


3. Relevance

  • Design tests that reflect your actual use case.

  • Include edge cases and failure scenarios.

  • Consider domain-specific requirements.


What to Measure

When benchmarking AI models, focus on these critical metrics:


  • Response Quality: Accuracy, relevance, coherence.

  • Performance Metrics: Latency, tokens per second, total duration.

  • Cost Efficiency: Price per request, cost per 1000 tokens.

  • Reliability: Error rates, consistency across runs.

  • Capability Limits: Context window, instruction following, reasoning depth.


3. Understanding Model Characteristics

Model Overview and Strengths:

Model Family

Key Variants

Strengths

Context Window

Best For

OpenAI

GPT-4o-mini

Cost-effective, balanced performance

128K tokens

High-volume applications, code generation

Google

Gemini Flash 1.5

Fast response times, high throughput

128K+ tokens

Real-time applications, creative tasks

Anthropic

Claude 3.5 Sonnet

Strong reasoning, comprehensive responses

200K tokens

Complex analysis, detailed documentation

DeepSeek

DeepSeek Chat

Budget-friendly, decent performance

32K tokens

Bulk process

Performance Characteristics from Actual Tests

To make the comparisons concrete, we ran a set of benchmarks on OpenRouter using 12 tests across 3 categories: code generation, creative writing, and analysis. Each test used a fixed prompt set (10-15 per category) and standardized parameters (temperature = 0, max tokens = 512). Metrics collected included tokens per second, average latency, and cost per request.


OpenAI GPT-4o-mini:

  • Tokens per second: 52.2.

  • Average response time: 6.20s.

  • Cost efficiency: $0.0053 per request (lowest).

  • Excellent for: Code generation (3.49s), cost-sensitive applications.


Google Gemini Flash 1.5:

  • Tokens per second: 109.8 (highest).

  • Average response time: 4.87s (fastest).

  • Cost: $0.0168 per request.

  • Excellent for: Creative tasks (3.53s), high-speed requirements.


Anthropic Claude 3.5 Sonnet:

  • Tokens per second: 48.4.

  • Average response time: 8.70s.

  • Cost: $0.0210 per request (premium).

  • Excellent for: Quality-critical tasks, comprehensive analysis.


DeepSeek Chat:

  • Tokens per second: 32.1.

  • Average response time: 9.80s.

  • Cost: $0.0103 per request.

  • Excellent for: Budget-conscious bulk processing.


4. Designing Effective Benchmark Tests


Creating Your Test Suite

A solid benchmark includes prompt categories that match your use case. Here’s a framework you can use:


1. Reasoning and Logic Tests

Purpose: Evaluate the model's ability to think through problems rather than recall information.


Example Prompts:

  • Mathematical reasoning: "A farmer has 150 meters of fencing to build a rectangular pen. If one side is against a barn and doesn't need fencing, what dimensions maximize the area?"


  • Logical puzzles: "A man looking at a portrait says: 'Brothers and sisters I have none, but this man's father is my father's son.' Who is in the portrait?"


2. Instruction Following Complexity

Purpose: Test the model's ability to follow multi-step instructions with constraints.


Example Prompts:

  • "Summarize the following text in exactly three bullet points, extract all proper nouns alphabetically, then determine the overall sentiment (positive/negative/neutral)."


  • "Rewrite this sentence formally, using at least one word over 10 letters, without using the word 'productive'."


3. Creative Generation

Purpose: Assess fluency, originality, and stylistic quality.


Example Prompts:

  • "Write a four-stanza poem about a futuristic city in the rain."

  • "Generate 3 slogans for an organic coffee brand focused on sustainabilit"

  • "Create a brief dialogue between a robot detective and their human partner."


4. Information Extraction and Classification

Purpose: Measure analytical capabilities and data processing accuracy.


Example Prompts:

  • Sentiment analysis with specific extraction requirements.

  • Data parsing from unstructured text.

  • Entity recognition and categorization.


5. Code Generation and Technical Tasks

Purpose: Evaluate programming capabilities and technical explanation quality.


Example Prompts:

  • "Write a Python function that filters even numbers from a list."

  • "Explain this SQL query in simple terms: SELECT department, COUNT(*) FROM employees WHERE start_date > '2023-01-01' GROUP BY department"


6. Factual Knowledge and Accuracy

Purpose: Test information reliability and hallucination tendency.


Example Prompts:

  • Specific factual questions about history, science, or current events.

  • Concept explanations for different audience levels.

  • Technical definitions and comparisons.


5. Manual Testing with OpenRouter


Step-by-Step Manual Benchmarking Process:


Step 1: Prepare Your Test Environment

  1. Open OpenRouter in your browser.

  2. Have your prompt library ready.

  3. Create a results tracking spreadsheet.

  4. Open OpenRouter chat https://openrouter.ai/chat.


Step 2: Configure Test Parameters

  • Select models to compare (e.g., Gemini 2.5 Pro, GPT-4o, Claude Sonnet).

Select models to compare
Select models to compare (e.g., Gemini 2.5 Pro, GPT-4o, Claude Sonnet)
Selected Models

  • Set consistent parameters (temperature, max tokens).

  • Note the test timestamp and conditions.


Step 3: Execute Tests

  1. Input your first test prompt.

    Input your first test prompt

  2. Submit to each model sequentially.

  3. Record the following metrics under metadata section:

Record the following metrics under metadata section

Step 4: Analyze Results

  • Compare response quality subjectively.

  • Calculate cost-per-quality ratios.

  • Identify performance patterns.


Manual Testing Best Practices

  1. Test at Different Times: Model performance can vary based on load.

  2. Use Diverse Prompts: Don't rely on a single type of task.

  3. Document Everything: Keep detailed notes about anomalies.

  4. Verify Consistency: Run the same prompt multiple times.

  5. Consider Context: Some models perform better with more context.


6. Automated Testing Approach with Python Implementation

Here is the complete Python script for running the benchmarks programmatically. This script is designed to be run with Python 3.12.


Dependencies

First, ensure you have the necessary libraries installed. Save the following content as requirements.txt:

openai==1.102.0
environs==9.5.0

Then, install them using pip:

pip install -r requirements.txt

You will also need to set up an .env file in your project root with your OpenRouter API key:

OPENROUTER_API_KEY="your-api-key-here"

Benchmark Script (main.py)

Here is the full source code for the benchmarking script. It connects to OpenRouter, runs a series of prompts against different models, and saves the results in a JSON file.

"""
OpenRouter AI Model Benchmarking Script
Tests multiple AI models using OpenRouter's unified API and stores results in JSON format.
"""

from environs import Env
import openai
import time
import json
from datetime import datetime
from typing import Dict, List, Any
import statistics


class OpenRouterBenchmark:
    """Main benchmarking class for testing AI models through OpenRouter."""

    def __init__(self, api_key: str):
        """
        Initialize the benchmark client.

        Args:
            api_key: Your OpenRouter API key
        """
        self.client = openai.OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key,
        )
        self.results = []

    def load_test_prompts(self) -> Dict[str, List[str]]:
        """
        Load test prompts organized by category.

        Returns:
            Dictionary of prompt categories and their prompts
        """
        return {
            "reasoning": [
                "A farmer has 150 meters of fencing to build a rectangular pen. If one side is against a barn and doesn't need fencing, what dimensions maximize the area? Explain your reasoning step-by-step.",
                "If I have 5 apples and buy 3 boxes with 6 apples each, then eat 2 apples, how many apples do I have left?",
            ],
            "instruction_following": [
                "Read this text and perform three tasks: 1) Summarize in one sentence (max 20 words), 2) List the main verbs, 3) Rewrite pessimistically. Text: 'The satellite launch was a resounding success, opening possibilities for global telecommunications.'",
                "Rewrite this sentence formally, using at least one word over 10 letters, without using 'productive': 'The meeting was productive and achieved all objectives.'",
            ],
            "creativity": [
                "Write a four-stanza poem about a futuristic city in the rain.",
                "Generate 3 slogans for an organic coffee brand focused on sustainability.",
            ],
            "code_generation": [
                "Write a Python function that takes a list of numbers and returns only the even numbers.",
                "Explain what this SQL query does: SELECT department, COUNT(*) FROM employees WHERE start_date > '2023-01-01' GROUP BY department",
            ],
            "knowledge": [
                "Who was the main architect of the Sydney Opera House?",
                "Explain CRISPR-Cas9 gene-editing technology in simple terms for a high school student.",
            ],
            "extraction": [
                "Analyze this review and extract sentiment (positive/negative/neutral), product name, and main complaint: 'I bought the AudioMax Pro headphones last week. Sound quality is spectacular, battery lasts forever. Only issue is discomfort after 2+ hours. 4 out of 5.'",
                "Extract all dates, names, and locations from: 'John Smith will meet Sarah Johnson on January 15th, 2025 at the Chicago office at 2 PM.'",
            ],
        }

    def test_model(
        self, model: str, prompt: str, category: str, max_retries: int = 3
    ) -> Dict[str, Any]:
        """
        Test a single model with a given prompt.

        Args:
            model: Model identifier (e.g., "openai/gpt-4o")
            prompt: The prompt to test
            category: Category of the prompt
            max_retries: Maximum number of retry attempts

        Returns:
            Dictionary containing test results
        """
        for attempt in range(max_retries):
            try:
                # Record start time
                start_time = time.time()

                # Make API call
                completion = self.client.chat.completions.create(
                    model=model,
                    messages=[
                        {
                            "role": "user",
                            "content": prompt,
                        }
                    ],
                    temperature=0.7,  # Consistent temperature for all tests
                    max_tokens=1000,  # Reasonable limit for most responses
                )

                # Calculate metrics
                end_time = time.time()
                latency = end_time - start_time

                # Extract response
                response_content = completion.choices[0].message.content

                # Get token counts if available
                total_tokens = (
                    getattr(completion.usage, "total_tokens", None)
                    if hasattr(completion, "usage")
                    else None
                )
                prompt_tokens = (
                    getattr(completion.usage, "prompt_tokens", None)
                    if hasattr(completion, "usage")
                    else None
                )
                completion_tokens = (
                    getattr(completion.usage, "completion_tokens", None)
                    if hasattr(completion, "usage")
                    else None
                )

                # Calculate tokens per second
                tokens_per_second = (
                    completion_tokens / latency
                    if completion_tokens and latency > 0
                    else None
                )

                # Estimate cost (you may need to adjust these based on actual pricing)
                cost = self.estimate_cost(model, prompt_tokens, completion_tokens)

                return {
                    "model": model,
                    "category": category,
                    "prompt": prompt[:100] + "..." if len(prompt) > 100 else prompt,
                    "response": (
                        response_content[:500] + "..."
                        if len(response_content) > 500
                        else response_content
                    ),
                    "latency": round(latency, 2),
                    "total_tokens": total_tokens,
                    "prompt_tokens": prompt_tokens,
                    "completion_tokens": completion_tokens,
                    "tokens_per_second": (
                        round(tokens_per_second, 1) if tokens_per_second else None
                    ),
                    "estimated_cost": cost,
                    "timestamp": datetime.now().isoformat(),
                    "success": True,
                    "error": None,
                }

            except Exception as e:
                if attempt == max_retries - 1:
                    return {
                        "model": model,
                        "category": category,
                        "prompt": prompt[:100] + "..." if len(prompt) > 100 else prompt,
                        "response": None,
                        "latency": None,
                        "total_tokens": None,
                        "prompt_tokens": None,
                        "completion_tokens": None,
                        "tokens_per_second": None,
                        "estimated_cost": None,
                        "timestamp": datetime.now().isoformat(),
                        "success": False,
                        "error": str(e),
                    }
                time.sleep(2**attempt)  # Exponential backoff

    def estimate_cost(
        self, model: str, prompt_tokens: int, completion_tokens: int
    ) -> float:
        """
        Estimate cost based on model and token usage.

        Args:
            model: Model identifier
            prompt_tokens: Number of input tokens
            completion_tokens: Number of output tokens

        Returns:
            Estimated cost in USD
        """
        if not prompt_tokens or not completion_tokens:
            return None

        # Simplified pricing (adjust based on actual OpenRouter pricing)
        pricing = {
            "google/gemini-2.0-flash-exp": {"input": 0.01, "output": 0.03},
            "openai/gpt-4o": {"input": 0.01, "output": 0.03},
            "openai/gpt-4o-mini": {"input": 0.005, "output": 0.015},
            "anthropic/claude-3.5-sonnet": {"input": 0.015, "output": 0.045},
            "anthropic/claude-3-haiku": {"input": 0.005, "output": 0.015},
        }

        # Get pricing or use default
        model_pricing = pricing.get(model, {"input": 0.01, "output": 0.03})

        # Calculate cost (prices are per 1000 tokens)
        input_cost = (prompt_tokens / 1000) * model_pricing["input"]
        output_cost = (completion_tokens / 1000) * model_pricing["output"]

        return round(input_cost + output_cost, 6)

    def run_benchmark(
        self,
        models: List[str],
        categories: List[str] = None,
        prompts_per_category: int = None,
    ):
        """
        Run the complete benchmark across multiple models and prompts.

        Args:
            models: List of model identifiers to test
            categories: Specific categories to test (None for all)
            prompts_per_category: Number of prompts per category (None for all)
        """
        test_prompts = self.load_test_prompts()

        # Filter categories if specified
        if categories:
            test_prompts = {k: v for k, v in test_prompts.items() if k in categories}

        total_tests = sum(len(prompts) for prompts in test_prompts.values()) * len(
            models
        )
        current_test = 0

        print(
            f"Starting benchmark with {len(models)} models across {len(test_prompts)} categories"
        )
        print(f"Total tests to run: {total_tests}")
        print("-" * 50)

        for model in models:
            print(f"\nTesting model: {model}")
            model_results = []

            for category, prompts in test_prompts.items():
                # Limit prompts if specified
                prompts_to_test = (
                    prompts[:prompts_per_category] if prompts_per_category else prompts
                )

                for prompt in prompts_to_test:
                    current_test += 1
                    print(
                        f"  [{current_test}/{total_tests}] Testing {category} prompt..."
                    )

                    result = self.test_model(model, prompt, category)
                    model_results.append(result)
                    self.results.append(result)

                    # Add small delay to avoid rate limiting
                    time.sleep(1)

            # Print model summary
            self.print_model_summary(model, model_results)

    def print_model_summary(self, model: str, results: List[Dict]):
        """
        Print summary statistics for a model.

        Args:
            model: Model identifier
            results: List of test results for this model
        """
        successful_results = [r for r in results if r["success"]]

        if not successful_results:
            print(f"  No successful results for {model}")
            return

        latencies = [r["latency"] for r in successful_results if r["latency"]]
        tokens_per_sec = [
            r["tokens_per_second"] for r in successful_results if r["tokens_per_second"]
        ]
        costs = [r["estimated_cost"] for r in successful_results if r["estimated_cost"]]

        print(f"\n  Summary for {model}:")
        print(
            f"    Success rate: {len(successful_results)}/{len(results)} ({100*len(successful_results)/len(results):.1f}%)"
        )

        if latencies:
            print(
                f"    Avg latency: {statistics.mean(latencies):.2f}s (min: {min(latencies):.2f}s, max: {max(latencies):.2f}s)"
            )

        if tokens_per_sec:
            print(f"    Avg tokens/sec: {statistics.mean(tokens_per_sec):.1f}")

        if costs:
            print(f"    Avg cost per request: ${statistics.mean(costs):.4f}")

    def save_results(self, filename: str = None):
        """
        Save benchmark results to a JSON file.

        Args:
            filename: Output filename (defaults to timestamped filename)
        """
        if not filename:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"benchmark_results_{timestamp}.json"

        # Prepare summary statistics
        summary = self.generate_summary()

        output = {
            "metadata": {
                "timestamp": datetime.now().isoformat(),
                "total_tests": len(self.results),
                "models_tested": list(set(r["model"] for r in self.results)),
                "categories_tested": list(set(r["category"] for r in self.results)),
            },
            "summary": summary,
            "detailed_results": self.results,
        }

        with open(filename, "w", encoding="utf-8") as f:
            json.dump(output, f, indent=2, ensure_ascii=False)

        print(f"\nResults saved to: {filename}")
        return filename

    def generate_summary(self) -> Dict[str, Any]:
        """
        Generate summary statistics from all results.

        Returns:
            Dictionary containing summary statistics
        """
        summary = {}

        # Group results by model
        models = {}
        for result in self.results:
            model = result["model"]
            if model not in models:
                models[model] = []
            models[model].append(result)

        # Calculate statistics for each model
        for model, model_results in models.items():
            successful = [r for r in model_results if r["success"]]

            if successful:
                latencies = [r["latency"] for r in successful if r["latency"]]
                tokens_per_sec = [
                    r["tokens_per_second"] for r in successful if r["tokens_per_second"]
                ]
                costs = [r["estimated_cost"] for r in successful if r["estimated_cost"]]

                summary[model] = {
                    "total_tests": len(model_results),
                    "successful_tests": len(successful),
                    "success_rate": round(
                        100 * len(successful) / len(model_results), 1
                    ),
                    "avg_latency": (
                        round(statistics.mean(latencies), 2) if latencies else None
                    ),
                    "min_latency": round(min(latencies), 2) if latencies else None,
                    "max_latency": round(max(latencies), 2) if latencies else None,
                    "avg_tokens_per_second": (
                        round(statistics.mean(tokens_per_sec), 1)
                        if tokens_per_sec
                        else None
                    ),
                    "avg_cost": round(statistics.mean(costs), 5) if costs else None,
                    "total_cost": round(sum(costs), 4) if costs else None,
                }

                # Add category-specific performance
                category_stats = {}
                for category in set(r["category"] for r in successful):
                    cat_results = [r for r in successful if r["category"] == category]
                    cat_latencies = [r["latency"] for r in cat_results if r["latency"]]
                    category_stats[category] = {
                        "count": len(cat_results),
                        "avg_latency": (
                            round(statistics.mean(cat_latencies), 2)
                            if cat_latencies
                            else None
                        ),
                    }
                summary[model]["by_category"] = category_stats

        return summary

    def generate_comparison_report(self):
        """
        Generate a comparison report of all tested models.
        """
        summary = self.generate_summary()

        print("\n" + "=" * 60)
        print("BENCHMARK COMPARISON REPORT")
        print("=" * 60)

        # Create comparison table
        print("\nOverall Performance Comparison:")
        print("-" * 60)
        print(
            f"{'Model':<30} {'Success':<10} {'Avg Latency':<12} {'Tokens/s':<10} {'Avg Cost':<10}"
        )
        print("-" * 60)

        for model, stats in summary.items():
            model_name = model.split("/")[-1][:28]  # Truncate long names
            success_rate = f"{stats['success_rate']}%"
            avg_latency = f"{stats['avg_latency']}s" if stats["avg_latency"] else "N/A"
            tokens_sec = (
                f"{stats['avg_tokens_per_second']}"
                if stats["avg_tokens_per_second"]
                else "N/A"
            )
            avg_cost = f"${stats['avg_cost']:.4f}" if stats["avg_cost"] else "N/A"

            print(
                f"{model_name:<30} {success_rate:<10} {avg_latency:<12} {tokens_sec:<10} {avg_cost:<10}"
            )

        # Find best performers
        print("\n" + "=" * 60)
        print("BEST PERFORMERS BY METRIC")
        print("=" * 60)

        # Fastest model
        fastest_model = min(
            summary.items(), key=lambda x: x[1]["avg_latency"] or float("inf")
        )
        print(
            f"Fastest Response: {fastest_model[0]} ({fastest_model[1]['avg_latency']}s avg)"
        )

        # Most cost-effective
        cheapest_model = min(
            summary.items(), key=lambda x: x[1]["avg_cost"] or float("inf")
        )
        print(
            f"Most Cost-Effective: {cheapest_model[0]} (${cheapest_model[1]['avg_cost']:.4f} avg)"
        )

        # Highest throughput
        highest_throughput = max(
            summary.items(), key=lambda x: x[1]["avg_tokens_per_second"] or 0
        )
        print(
            f"Highest Throughput: {highest_throughput[0]} ({highest_throughput[1]['avg_tokens_per_second']} tokens/s)"
        )


def main():
    """Main execution function."""
    env = Env()
    env.read_env()
    # Configuration
    API_KEY = env("OPENROUTER_API_KEY")

    # Models to test (adjust based on your needs and available models)
    MODELS_TO_TEST = [
        "openai/gpt-4o-mini",  # OpenAI - Fast & cost-effective
        "google/gemini-flash-1.5",  # Google - Speed champion
        "anthropic/claude-3.5-sonnet",  # Anthropic - Quality & reasoning
        "deepseek/deepseek-chat",  # DeepSeek - Budget option
    ]

    # Categories to test (None for all)
    CATEGORIES = ["reasoning", "code_generation", "creativity"]

    # Initialize benchmark
    print("OpenRouter AI Model Benchmark Tool")
    print("=" * 50)

    if API_KEY == "your-api-key-here":
        print("ERROR: Please set your OpenRouter API key")
        print("Set the OPENROUTER_API_KEY environment variable or edit the script")
        return

    benchmark = OpenRouterBenchmark(API_KEY)

    # Run benchmark
    benchmark.run_benchmark(
        models=MODELS_TO_TEST,
        categories=CATEGORIES,
        prompts_per_category=1,  # Limit to 1 prompt per category for quick testing
    )

    # Generate and print comparison report
    benchmark.generate_comparison_report()

    # Save results
    output_file = benchmark.save_results()

    print("\n" + "=" * 50)
    print(f"Benchmark complete! Results saved to: {output_file}")
    print("=" * 50)


if __name__ == "__main__":
    main()

Example Console Output

Running the script produces a detailed log of the tests and a final summary report. This helps you monitor progress and quickly see the results.

OpenRouter AI Model Benchmark Tool
==================================================
Starting benchmark with 4 models across 3 categories
Total tests to run: 24
--------------------------------------------------

Testing model: openai/gpt-4o-mini
  [1/24] Testing reasoning prompt...
  [2/24] Testing creativity prompt...
  [3/24] Testing code_generation prompt...

  Summary for openai/gpt-4o-mini:
    Success rate: 3/3 (100.0%)
    Avg latency: 6.20s (min: 3.49s, max: 10.98s)
    Avg tokens/sec: 52.2
    Avg cost per request: $0.0053

Testing model: google/gemini-flash-1.5
  [4/24] Testing reasoning prompt...
  [5/24] Testing creativity prompt...
  [6/24] Testing code_generation prompt...

  Summary for google/gemini-flash-1.5:
    Success rate: 3/3 (100.0%)
    Avg latency: 4.87s (min: 3.53s, max: 6.65s)
    Avg tokens/sec: 109.8
    Avg cost per request: $0.0168

Testing model: anthropic/claude-3.5-sonnet
  [7/24] Testing reasoning prompt...
  [8/24] Testing creativity prompt...
  [9/24] Testing code_generation prompt...

  Summary for anthropic/claude-3.5-sonnet:
    Success rate: 3/3 (100.0%)
    Avg latency: 8.70s (min: 6.19s, max: 11.35s)
    Avg tokens/sec: 48.4
    Avg cost per request: $0.0210

Testing model: deepseek/deepseek-chat
  [10/24] Testing reasoning prompt...
  [11/24] Testing creativity prompt...
  [12/24] Testing code_generation prompt...

  Summary for deepseek/deepseek-chat:
    Success rate: 3/3 (100.0%)
    Avg latency: 9.80s (min: 6.16s, max: 14.06s)
    Avg tokens/sec: 32.1
    Avg cost per request: $0.0103

============================================================
BENCHMARK COMPARISON REPORT
============================================================

Overall Performance Comparison:
------------------------------------------------------------
Model                          Success    Avg Latency  Tokens/s   Avg Cost  
------------------------------------------------------------
gpt-4o-mini                    100.0%     6.2s         52.2       $0.0053   
gemini-flash-1.5               100.0%     4.87s        109.8      $0.0168   
claude-3.5-sonnet              100.0%     8.7s         48.4       $0.0210   
deepseek-chat                  100.0%     9.8s         32.1       $0.0103   

============================================================
BEST PERFORMERS BY METRIC
============================================================
Fastest Response: google/gemini-flash-1.5 (4.87s avg)
Most Cost-Effective: openai/gpt-4o-mini ($0.0053 avg)
Highest Throughput: google/gemini-flash-1.5 (109.8 tokens/s)

Results saved to: benchmark_results_20250828_152011.json

==================================================
Benchmark complete! Results saved to: benchmark_results_20250828_152011.json
==================================================

7. Analyzing and Interpreting Results

When benchmarks are complete, the next step is to interpret them in a way that’s useful for decision-making.


Comparison Methods

Performance Visualization Methods:


1. Speed vs. Cost Matrix

  • Plot tokens/second against cost per 1000 tokens.

  • Identify the sweet spot for your budget and speed requirements.


2. Quality-Adjusted Performance

  • Calculate quality score per dollar spent.

  • Factor in both subjective quality and objective accuracy.


3. Task-Specific Leaderboards

  • Rank models by performance on specific task types.

  • Identify specialists vs. generalists.


Example Analysis Framework (using OpenRouter data)

Use Case

Best Fit

Speed (tokens/s)

Latency (s)

Input Price ($/M)

Output Price ($/M)

Notes

High-Volume, Cost-Sensitive

Claude Sonnet 4

56.53

2.12

3.00

15.00

Lowest per-response cost ($0.011). Pricing increases at high volumes.

Real-Time, Performance-Critical

Gemini 2.5 Pro

~100.4*

3.64

1.25

10.00+

Fastest throughput. Output cost scales higher with >200K tokens.

Balanced Performance

GPT-5.0

46.03

6.25

1.25

10.00

Moderate speed and pricing. Simpler cost model.


8. Decision Framework for Model Selection

When choosing a model, three main areas usually determine the outcome:


  • Use case requirements: real-time vs. batch, required accuracy, and task complexity.

  • Budget constraints: per-request cost, monthly spend, and expected usage.

  • Technical constraints: latency, context window, and integration overhead.


To compare options, you can use a weighted decision matrix. Below is an example using measured speed (throughput + latency), cost, quality, and reliability. The 1-10 ratings are derived from benchmark results and pricing.

Factor

Weight

Gemini 2.5 Pro

GPT-5.0

Claude Sonnet 4

Speed

30%

10/10 (91.3 tps, 3.6s)

6/10 (50.3 tps, 6.3s)

8/10 (56.5 tps, 2.1s)

Cost

25%

7/10 (1.25-2.50 in / 10–15 out)

9/10 (1.25 in / 10 out flat)

4/10 (3-6 in / 15-22.5 out)

Quality

30%

7/10 (ranks #2-7 across domains)

8/10 (generalist, new release)

9/10 (ranks #1-3 in tech, programming, marketing, science)

Reliability

15%

9/10 (1M context, high throughput)

6/10 (400K context, higher latency)

9/10 (1M context, lowest latency)

Total

100%

8.6

7.5

7.9

  • Gemini 2.5 Pro is the strongest all-around choice if speed and throughput are important, with competitive pricing at low-to-mid volumes. Costs climb for very large jobs (>200K tokens).


  • Claude Sonnet 4 offers the best quality across specialized domains but at a significantly higher price per token. A good fit for high-value, lower-volume workloads.


  • GPT-5.0 is the most predictable on pricing and balanced in quality, but its smaller context window and slower speed limit its use in large-scale or latency-sensitive systems.


Production Deployment Strategies (Deployment Patterns)


1. Single Model Deployment

Choose the best overall performer. This is the simplest approach, with minimal operational complexity. Suitable when one model meets all your latency, cost, and quality requirements.


2. Multi-Model Routing

Route different task types to specialized models. For example, use a lower-cost model for simple queries and a high-performance model for reasoning-heavy tasks. This improves efficiency but requires routing logic and orchestration.


3. Fallback Architecture

Use a primary model with backup options. If the primary fails or hits rate limits, requests are routed to a backup. Ensures high availability but adds operational complexity.


Monitoring in Production

Track these metrics continuously:


  • Latency percentiles (p50, p95, p99).

  • Error rates and timeouts.

  • Cost per request or per user.

  • Model drift indicators (accuracy decay, unexpected behavior).


Cost Optimization Strategies

1. Prompt Optimization

Keep prompts concise without reducing output quality. Use templates and system prompts efficiently.


2. Task-Based Model SelectionAssign tasks to the cheapest model that meets quality requirements. Reserve premium models for complex operations.


3. Caching and ReuseCache frequent outputs and apply semantic similarity matching to reduce redundant API calls.


Example Cost Calculation:


  • Baseline: 10,000 requests/day × 30 days × $0.011 (Claude Sonnet 4 per response) = $3,300/month

  • With optimizations: ~30% savings from caching, ~20% from prompt optimization → $1,650/month


Common Challenges and Solutions


Challenge 1: Inconsistent Model Performance

Problem: The same prompt can produce varying results.

Solution: Test responses at different times, set the temperature to 0 for deterministic outputs, and implement retry logic with result validation.


Challenge 2: Cost Overruns

Problem: Unexpectedly high costs in production.

Solution: Use request throttling, set cost alerts, and apply tiered model selection to match task complexity with the right model.


Challenge 3: Latency Spikes

Problem: Occasional slow responses affect user experience.

Solution: Implement timeouts, route time-critical tasks to faster models, and consider edge deployment if low latency is essential.


Challenge 4: Prompt Fairness

Problem: Some models perform better with specific prompt styles.

Solution: Test multiple prompt formulations, adopt model-agnostic prompt formats, and document any model-specific optimizations.


Continuous Improvement

Establish a testing cadence by:


  • Weekly: Monitor production metrics, review error logs, check cost trends.

  • Monthly: Re-run benchmark suite, evaluate new model releases, update routing rules.

  • Quarterly: Conduct a full performance review, update test scenarios, reassess model selection.


Also, stay current:

  • Monitor OpenRouter and other model update announcements.

  • Follow AI community forums for best practices.

  • Track industry benchmarks for changes in model performance and pricing.


AI Model Reference Table

Model

Created

Total Context

Max Output

Input Price ($/M)

Output Price ($/M)

GPT-4o (extended)

May 13, 2024

128K

64K

6

18

GPT-5

Aug 7, 2025

400K

128K

1.25

10

Claude Opus 4

May 22, 2025

200K

32K

15

75

Claude Opus 4.1

Aug 5, 2025

200K

32K

15

75

Gemini 2.5 Pro

Jun 17, 2025

1,048,576

65.5K

1.25–2.5

10–15

Grok Code Fast 1

Aug 26, 2025

256K

10K

0.2

1.5

DeepSeek: R1

Jan 20, 2025

163.8K

163.8K

0.7

2.4


Your Next Move

Start by defining clear objectives for your use case. Identify the metrics that matter - latency, throughput, cost, and response quality and design benchmarks around them. Prepare 10-20 representative test prompts per category and select the models you want to evaluate. Use both manual and automated tests to gather baseline performance data.


Document your prompts, test results, and configuration details so you can reproduce and verify results. Analyze patterns in speed, quality, and cost to guide your model selection. 


Once a model is chosen, implement monitoring in production, and schedule regular reviews to re-run benchmarks and adjust routing or configurations as needed.


Track new model releases, pricing updates, and best practices. Follow this process to keep your model selection data-driven and aligned with operational needs.


Good luck!

 
 
bottom of page