Claude Sonnet 4.5: Features, Benchmarks & Pricing (2026)

Carlos Martinez
Sep 30, 2025
9 min read

Updated: Jan 15

TLDR: Claude Sonnet 4.5 scores 77.2% on SWE-bench Verified (82.0% with parallel compute), 50.0% on Terminal-Bench, and 61.4% on OSWorld. It reaches 100% on AIME with Python and 83.4% on GPQA Diamond. Pricing is $3 per million input tokens and $15 per million output tokens; you can use it on web, iOS, Android, the Claude Developer Platform, Amazon Bedrock, and Google Cloud Vertex AI.

Anthropic released Claude Sonnet 4.5 on September 29, 2025, as the latest model in the Claude 4 family. It improves coding performance, supports long-running agent workflows, and handles computer-use tasks more reliably.

Let’s analyze its benchmarks, pricing, and how it compares with GPT-5 and Gemini 2.5 Pro in production use.

Claude Sonnet 4.5: Features, Benchmarks & Pricing (2025)

Core Features

Category	Core Capability	Key Metric / Benchmark
Code & Agents	Strong coding and agent workflows Sustains tasks for 30+ hours	SWE-bench: 77.2% (82% parallel) Observed behavior
Computer Use	Browser and desktop task handling	OSWorld: 61.4%
Reasoning & Math	High math and reasoning performance Graduate-level reasoning	AIME 2025: 100% (tools), 87% (no tools) GPQA Diamond: 83.4%
Alignment & Safety	Fewer misaligned behaviors; stronger defenses	ASL-3 protections
Developer Tools	Code checkpoints, VS Code extension, Agent SDK	Product updates
Availability & Pricing	API, apps, Bedrock, Vertex, Chrome	$3 input / $15 output per million tokens

Advanced Coding Capabilities

On SWE-bench Verified, Sonnet 4.5 scores 77.2% in standard runs and 82.0% with parallel compute. It can resolve real GitHub issues more reliably than earlier Claude models and slightly ahead of GPT-5 Codex at 74.5%.

It also reaches 50.0% on Terminal-Bench, which measures command-line work. It supports Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby, Swift, and Kotlin, with stronger results in Python and JavaScript.

Anthropic released related updates: checkpoints in Claude Code, a VS Code extension, and file creation in the Claude apps. Internal testing shows code editing error rates fell from 9% to 0%.

Long-Running Tasks and Agents

Sonnet 4.5 runs for more than 30 hours on complex tasks. Opus 4 managed about 7. Test runs included application builds, database setup, domain registration, and SOC 2 audit steps.

Anthropic reports a 65% reduction in shortcut behavior compared to Sonnet 3.7. The model requests clarification more often when the inputs are ambiguous. Context tracking reduces token waste during long sessions.

The release also includes the Claude Agent SDK. It provides tools for memory handling, permissions, and coordination of sub-agents.

Computer Use & Browser Automation

On OSWorld, Sonnet 4.5 scores 61.4%. Claude 4 scored 42.2%. OSWorld measures interactions like site navigation, spreadsheet entry, and desktop tasks. A Chrome extension brings this into the browser.

On domain-specific Tau-bench tests, it scores 86.2% in Retail, 70.0% in Airline, and 98.0% in Telecom.

Reasoning and Math

Sonnet 4.5 gives developers more control over reasoning depth. The API allows short responses or step-by-step reasoning.

On AIME 2025, Sonnet 4.5 scores 100% with Python tools and 87% without them. On GPQA Diamond it reaches 83.4%. GPT-5 performs slightly higher without tools but close when tools are enabled.

Other benchmarks show solid but not perfect results: 89.1% on MMMLU (multilingual Q&A), 77.8% on MMMU (visual reasoning), and 55.3% on Finance Agent.

Context and Memory

The model supports a 200K token context window, with a 1M option available. It can output up to 64K tokens, which is enough for long code generations or technical documents.

The new context editing feature helps manage token limits, cutting token use by 84% in a 100-turn web search evaluation. The memory tool allows persistence across sessions, which supports long-running agents.

Safety and Alignment

Sonnet 4.5 shows lower rates of sycophancy, deception, and power-seeking than earlier Claude models. It resists prompt injection better.

It runs under AI Safety Level 3 (ASL-3). Safeguards include classifiers for harmful inputs and outputs, including CBRN-related prompts. UK AISI and US AISI evaluated these protections.

Deployment Options & Platform Availability

You can access Claude Sonnet 4.5 through:

Claude API (claude-sonnet-4-5).
Claude.ai web interface.
Claude iOS and Android apps.
Amazon Bedrock.
Google Cloud Vertex AI.
Native VS Code extension.

In Claude Code, the new checkpoints feature lets you save progress and roll back changes. The terminal interface also received a refresh.

Max subscribers get access to Imagine with Claude, a 5-day research preview (Sept 29–Oct 4, 2025) where the model generates software on the fly without prewritten code.

Observability & Developer Tools

The Claude Agent SDK gives developers the same infrastructure Anthropic uses for building Claude Code.

New API features include context editing and a memory tool, which support long-running and multi-session workflows.

Benchmarks Against Other Models

Here’s a quick overview of Claude Sonnet 4.5’s benchmark results, compared with Claude Opus 4.1, Claude Sonnet 4, GPT-5, and Gemini 2.5 Pro. Scores are reported by Anthropic, covering coding, tool use, computer interaction, reasoning, and math.

Benchmark	Claude Sonnet 4.5	Claude Opus 4.1	Claude Sonnet 4	GPT-4o / GPT-5	Gemini 2.5 Pro
Agentic coding (SWE-bench Verified)	77.2% 82.0% (with parallel test-time compute)	74.5% 79.4% (with parallel test-time compute)	72.7% 80.2% (with parallel test-time compute)	72.8% (GPT-5) 74.5% (GPT-5 Coding)	67.2%
Agentic terminal coding (Terminal-Bench)	50.0%	46.5%	36.4%	43.8%	25.3%
Agentic tool use (t2-bench)	Retail 86.2% Airline 70.0% Telecom 98.0%	Retail 86.8% Airline 63.0% Telecom 71.5%	Retail 83.8% Airline 63.0% Telecom 49.6%	Retail 81.1% Airline 62.6% Telecom 96.7%	-
Computer use (OSWorld)	61.4%	44.4%	42.2%	-	-
High school math competition (AIME 2025)	100% (python) 87.0% (no tools)	78.0%	70.5%	99.6% (python) 94.6% (no tools)	88.0%
Graduate-level reasoning (GPQA Diamond)	83.4%	81.0%	76.1%	85.7%	86.4%
Multilingual Q&A (MMMLU)	89.1%	89.5%	86.5%	89.4%	-
Visual reasoning (MMMU (validation))	77.8%	77.1%	74.4%	84.2%	82.0%
Financial analysis (Finance Agent)	55.3%	50.9%	44.5%	46.9%	29.4%

Coding Benchmarks

On SWE-bench Verified, Claude Sonnet 4.5 edges past Opus 4.1 and GPT-5, scoring 77.2% (82.0% with parallel compute). The gains over Sonnet 4 are incremental, but the model closes the gap with premium competitors while costing less than Opus.

In Terminal-Bench, Sonnet 4.5 reaches 50.0%, ahead of Opus 4.1 (46.5%) and GPT-5 (43.8%). This is reliable for command-line automation, though performance still leaves room for improvement compared to coding benchmarks.

Tool Use

On t2-bench, Sonnet 4.5 performs competitively. It nearly matches Opus 4.1 in Retail (86.2% vs 86.8%) and leads significantly in Airline (70.0% vs 63.0%) and Telecom (98.0% vs 71.5%).

The Telecom score stands out, showing that the model can handle structured tool workflows with a high degree of accuracy.

Computer Use

In OSWorld, Sonnet 4.5 achieves 61.4%, a clear lead over Opus 4.1 (44.4%) and Sonnet 4 (42.2%).

Neither GPT-5 nor Gemini 2.5 Pro report results here. This makes Claude the current reference point for desktop and browser interaction.

Reasoning and Math

On AIME 2025, Sonnet 4.5 achieves 100% with Python tools and 87% without. GPT-5 remains slightly stronger without tools (94.6%), but both models handle advanced math better.

For GPQA Diamond, Sonnet 4.5 reaches 83.4%, sitting between GPT-5 (85.7%) and Gemini 2.5 Pro (86.4%). It improves on Opus 4.1 (81.0%) and Sonnet 4 (76.1%).

On MMMLU (multilingual Q&A), Sonnet 4.5 scores 89.1%, effectively tied with Opus 4.1 and GPT-5. This suggests Anthropic maintained strong multilingual coverage without sacrificing performance in other areas.

Visual and Financial Tasks

For visual reasoning (MMMU validation), Sonnet 4.5 delivers 77.8%, close to Opus 4.1 but behind GPT-5 (84.2%) and Gemini 2.5 Pro (82.0%). Visual tasks remain a relative weakness compared to reasoning and coding.

In Finance Agent, Sonnet 4.5 scores 55.3%, improving on Anthropic’s older models and surpassing GPT-5 (46.9%) and Gemini 2.5 Pro (29.4%). While not high in absolute terms, it shows progress in specialized domains like financial analysis.

Claude Sonnet 4.5 API Pricing

Pricing is based on prompt size, with separate rates for input, output, and prompt caching. Rates remain the same as Sonnet 4.

Type	≤200K tokens	>200K tokens
Input	$3 / 1M tokens	$6 / 1M tokens
Output	$15 / 1M tokens	$22.50 / 1M tokens
Prompt caching - Write	$3.75 / 1M tokens	$7.50 / 1M tokens
Prompt caching - Read	$0.30 / 1M tokens	$0.60 / 1M tokens

API Pricing Comparison

Claude Sonnet 4.5
Input Price	Output Price	Caching / Context Pricing
≤200K tokens: $3 / MTok	≤200K tokens: $15 / MTok	Write: ≤200K tokens: $3.75 / MTok Read: ≤200K tokens: $0.30 / MTok
>200K tokens: $6 / MTok	>200K tokens: $22.50 / MTok	Write: >200K tokens: $7.50 / MTok Read: >200K tokens: $0.60 / MTok
GPT-5
$1.25 / MTok	$10.00 / MTok	Cached input: $0.125 / MTok
Gemini 2.5 Pro
Free tier available	Free tier available	Not available
Paid tier: ≤200K tokens: $1.25 / MTok	≤200K tokens: $10.00 / MTok	≤200K tokens: $0.31 / MTok
>200K tokens: $2.50 / MTok	>200K tokens: $15.00 / MTok	>200K tokens: $0.625 / MTok $4.50 / MTok per hour storage

Claude Sonnet 4.5 keeps the same rates as Sonnet 4. Input pricing is mid-range, but output pricing is higher than the others, which could increase costs for large outputs.

GPT-5 has lower input costs and lower caching costs, but a fixed output rate of $10/MTok. Gemini 2.5 Pro offers a free tier and lower rates for small inputs, but costs grow for larger workloads and include storage charges. The choice depends on workload size, caching needs, and output volume.

Sonnet 4.5 vs. GPT-5

When comparing per-token pricing, GPT-5 is generally cheaper for high-volume workloads. For workloads above 200K tokens:

Sonnet 4.5 input: $6 / million tokens
GPT-5 input: $1.25 / million tokens
Sonnet 4.5 output: $22.50 / million tokens
GPT-5 output: $10 / million tokens

Example: For a workload of 100M input tokens and 20M output tokens monthly:

Model	Input Cost	Output Cost	Total Cost
Sonnet 4.5	$600	$450	$1,050
GPT-5	$125	$200	$325

GPT-5 is significantly cheaper in this example. Sonnet 4.5’s value lies more in accuracy for coding and agent tasks rather than raw cost efficiency.

Sonnet 4.5 vs. Gemini 2.5 Pro

Gemini 2.5 Pro offers a free tier for low‑volume use. Its paid tier costs are lower than Sonnet 4.5’s for both input and output in many cases, but rise for high‑volume workloads.

Model	Input Cost (/M tokens)	Output Cost (/M tokens)
Gemini 2.5 Pro	≤200K: $1.25 >200K: $2.50	≤200K: $10.00 >200K: $15.00
Sonnet 4.5	$6.00	$22.50

Gemini’s pricing is competitive for general use, but Sonnet 4.5 may justify its higher price if its accuracy reduces retries.

Who Should Choose Claude Sonnet 4.5?

Some of the best use cases are:

Software Development:Strong performance on coding tasks (SWE‑bench ~77.2%). Works well for pull request generation, bug fixing, and multi‑step feature building. Integrates with tools like Cursor or GitHub Copilot.
Agentic Applications:Suitable for long workflows with extended context (up to 1M tokens with special access). Can handle multi‑step research and task automation.
Automation & Browser Tasks:Competitive performance on browser automation tasks (OSWorld ~61.4%). Useful for workflows requiring web interaction, such as competitive analysis or onboarding.
Security & Compliance:Faster vulnerability processing and improved accuracy over older models. Applicable to SOC 2 audits and security reviews.
Finance & Legal:Useful for market research, modeling, compliance, and document analysis. Effective where reasoning over long context is needed.
Research:Supports large‑scale literature review and data analysis with extended context.
Safety‑Critical Workflows:Includes alignment measures that reduce undesirable behavior, making it suited for production.
Budget‑Sensitive Coding Workflows:Lower cost than GPT‑5 for coding. Higher accuracy may offset retry costs.

When GPT‑5 Is Better?

Deep integration with OpenAI tools.
Requires memory features offered by GPT‑5.
Slightly higher reasoning benchmarks (GPQA ~85.7% vs ~83.4%).
Needs a standard large context window (400K tokens).
Existing GPT fine‑tuning infrastructure.
Plugin ecosystem integration.

Pricing: GPT‑5 input $1.25/M tokens, output $10/M tokens. Sonnet 4.5 input $3-6/M tokens, output $15-22.5/M tokens. GPT‑5 often more cost‑effective for high‑volume workloads.

When Gemini 2.5 Pro is Better?

Lower per‑token costs: input $1.25-2.50/M tokens, output $10-15/M tokens.
1M tokens context standard without special access.
Multimodal and video processing.
Easier integration with Google Cloud infrastructure.
General‑purpose tasks where coding performance is not critical.

Benchmark: Gemini 2.5 Pro SWE‑bench ~67.2%. Sonnet 4.5 offers higher coding performance, which can justify higher cost in coding‑heavy workflows.

Getting Started

If you need solid coding, autonomous operation, and general computing tasks, Claude Sonnet 4.5 handles it well. Its performance, safety, and deployment flexibility make it a best choice for coding, agent workflows, or specialized tasks.

You can also connect with our experts for guidance on integrating and deploying Claude Sonnet 4.5 and optimizing performance for production workflows.

Good luck!

Frequently Asked Questions

What is Claude Sonnet 4.5 used for?

Claude Sonnet 4.5 handles software development, agent workflows, and computer-use tasks. It's used for pull request generation, bug fixing, multi-step feature building, browser automation, security audits, financial analysis, legal document review, and research with long context. It integrates with tools like Cursor and GitHub Copilot.

Is Claude Sonnet 4.5 better than GPT-5?

Claude Sonnet 4.5 scores higher on coding benchmarks (77.2% vs 72.8% on SWE-bench Verified). GPT-5 costs less ($1.25/$10 vs $3-6/$15-22.5 per million tokens) and performs slightly better on some reasoning tasks (85.7% vs 83.4% on GPQA Diamond). The better choice depends on whether you prioritize coding performance or cost efficiency.

How much does Claude Sonnet 4.5 cost?

For prompts ≤200K tokens: $3/M input, $15/M output. For prompts >200K tokens: $6/M input, $22.5/M output. Prompt caching offers additional savings (write: $3.75-7.50/M, read: $0.30-0.60/M). Same pricing as Sonnet 4.

Can Claude Sonnet 4.5 run autonomously?

Yes, it runs for over 30 hours on complex tasks, compared to 7 hours for Opus 4. Test runs showed it building applications, setting up databases, registering domains, and performing SOC 2 audit steps without human intervention.

What is Claude Sonnet 4.5's context window?

Standard 200K tokens, with 1M tokens available through special access. Supports up to 64K output tokens. Context editing reduces token consumption by 84% in long workflows. Long-context pricing applies above 200K tokens.

Is Claude Sonnet 4.5 safe for enterprise use?

Yes. It runs under AI Safety Level 3 protections with content filters and classifiers. Shows 65% reduction in shortcut behavior vs Sonnet 3.7, lower rates of sycophancy and deception, and better prompt injection resistance. Evaluated by UK AISI and US AISI.

How does Claude Sonnet 4.5 compare to Gemini 2.5 Pro?

Sonnet 4.5 leads in coding (77.2% vs 67.2% on SWE-bench) and computer use (61.4% on OSWorld). Gemini 2.5 Pro costs less ($1.25-2.50 input, $10-15 output), offers 1M standard context, and includes video processing. Gemini is better for budget-conscious general use; Sonnet for coding-heavy workflows.

Where can I access Claude Sonnet 4.5?

Claude API (claude-sonnet-4-5), Claude.ai web interface, iOS/Android apps, Amazon Bedrock, Google Cloud Vertex AI, VS Code extension, and integrated in Cursor, Windsurf, and GitHub Copilot. Free tier available on Claude.ai.