Claude Capybara Benchmark: How Mythos Performs Against Opus and GPT-5

No official benchmark scores for Claude Capybara have been published. What we have are Anthropic’s own claims from leaked internal documents and the verified performance of Claude Opus 4.6 as a baseline. Since Anthropic states Capybara scores “dramatically higher” than Opus 4.6, we can project where Mythos sits relative to current frontier models. We cover this further in our Capybara vs Opus comparison article.

Claude Capybara benchmark scores and performance

Opus 4.6 already leads or ties GPT-5.4 on most major benchmarks. A model dramatically exceeding those scores would represent the largest single performance jump in AI benchmarking history.

What Anthropic Claims About Capybara Performance

The leaked draft blog post, confirmed by an Anthropic spokesperson, makes three specific performance claims about the Capybara tier.

The Three Core Claims

Coding: “Dramatically higher scores on tests of software coding” compared to Opus 4.6. This is notable because Opus 4.6 already achieves 80.8% on SWE-Bench Verified, leading GPT-5.4’s 77.2%.

Academic reasoning: “Dramatically higher scores on tests of academic reasoning” compared to Opus 4.6. Opus holds a 91.31% score on GPQA Diamond (graduate-level science), the highest among all public models.

Cybersecurity: The model is “currently far ahead of any other AI model in cyber capabilities.” This is the most emphatic claim — not just better than Opus, but ahead of all competitors in the entire industry.

What “Dramatically Higher” Likely Means

In AI benchmarking, the word “dramatically” is not used lightly. Current improvements between model versions typically measure 2-5 percentage points on hard benchmarks. For Anthropic to use “dramatically,” the gains likely exceed 5 points — potentially reaching 10+ on some tests.

For context, the jump from Claude Opus 4.5 to Opus 4.6 on SWE-Bench was roughly 3 points. A “dramatic” improvement would suggest Capybara might push past 85% or higher on the same benchmark.

Current Opus 4.6 Benchmark Scores (The Baseline)

Since Capybara is positioned as dramatically better than Opus 4.6, understanding the baseline is essential. Here are Opus 4.6’s verified scores across major benchmarks:

Coding Benchmarks

Benchmark	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro	Leader
SWE-Bench Verified	80.8%	77.2%	80.6%	Opus
Terminal-Bench 2.0	65.4%	81.8%	—	GPT-5.4
SWE-Bench Pro	~45.9%	57.7%	—	GPT-5.4

Opus leads on SWE-Bench Verified but trails GPT-5.4 on Terminal-Bench 2.0 and the harder SWE-Bench Pro. Capybara’s “dramatically higher” coding scores could close or reverse these gaps.

Reasoning Benchmarks

Benchmark	Claude Opus 4.6	GPT-5.4	Leader
GPQA Diamond	91.31%	~88% (est.)	Opus
ARC-AGI-2	68.8%	54.2% (5.2)	Opus
AIME 2025	99.79%	100% (5.2)	GPT-5

Opus dominates reasoning benchmarks with a 14.6-point lead on ARC-AGI-2 over GPT-5.2 — the largest gap on any major test. Capybara extending this lead would push into territory that researchers considered unreachable by current architectures.

Agentic and Real-World Tasks

Benchmark	Claude Opus 4.6	GPT-5.4	Leader
OSWorld (computer use)	72.7%	75%	GPT-5.4
Multi-turn dialogue ELO	+40 pts	Baseline	Opus
Context window	200K+ tokens	Shorter	Opus

Projected Capybara Performance

Based on the leaked claims and historical improvement patterns, here is where Capybara likely sits:

Coding Projections

If “dramatically higher” means a 5-10 point improvement over Opus 4.6:

SWE-Bench Verified: 85-90% (vs Opus 80.8%, GPT-5.4 77.2%)
Terminal-Bench 2.0: 75-85% (potentially closing the gap with GPT-5.4’s 81.8%)
SWE-Bench Pro: 55-65% (approaching or matching GPT-5.4’s 57.7%)

These projections would make Capybara the undisputed leader in automated code generation and debugging. A 90% SWE-Bench score would mean the model can solve 9 out of 10 real GitHub issues autonomously.

Reasoning Projections

GPQA Diamond: 94-97% (approaching human expert ceiling)
ARC-AGI-2: 78-85% (extending the already massive lead over GPT-5)
Humanity’s Last Exam: significant improvement (this frontier benchmark still challenges all models)

Cybersecurity Projections

This is the area with the least comparable data. No standardized “AI cybersecurity benchmark” exists that all frontier models compete on. Anthropic’s claim of being “far ahead of any other AI model” suggests they have internal evaluation frameworks that show Capybara outperforming both Claude Opus and GPT-5 series models on vulnerability discovery, exploit analysis, and defensive code generation.

Why Benchmark Claims Need Independent Verification

Leaked internal benchmarks carry important caveats. The numbers come from Anthropic’s own testing environment, not independent evaluation.

The GPT-5 Lesson

OpenAI positioned GPT-5 as a breakthrough before its August 2025 launch. In practice, many users and reviewers considered it disappointing relative to promises. The gap between internal benchmarks and real-world user experience was significant enough to damage GPT-5’s reputation.

Anthropic could face the same challenge. Curated benchmark results often outperform messy, real-world tasks. A model that scores 90% on SWE-Bench may still struggle with edge cases in specific programming languages or frameworks that aren’t well-represented in the test set.

What Independent Testing Would Tell Us

When Capybara eventually reaches independent evaluators, three things will become clear:

Consistency: Do the high scores hold across diverse problem types, or do they cluster around certain domains? Opus 4.6 shows relatively even performance; Capybara’s “dramatically higher” could be concentrated in specific areas.

Latency-quality trade-off: Higher capability often comes with slower response times. If Capybara takes significantly longer per query, its practical advantage shrinks for interactive use cases.

Cost-performance ratio: Even if Capybara leads all benchmarks, the question is whether the improvement justifies 2-5x higher pricing. For most applications, Opus 4.6 may deliver 95% of the quality at a fraction of the cost.

Cybersecurity Benchmarks: A New Frontier

Capybara’s cybersecurity claims are the hardest to verify because standardized AI cybersecurity benchmarks are still emerging.

What Exists Today

Current cybersecurity evaluation frameworks include CTF (Capture The Flag) challenge performance, vulnerability discovery rates in controlled environments, and exploit code generation quality assessments. None of these are as standardized or widely tracked as SWE-Bench or GPQA.

Anthropic’s claim that Capybara is “far ahead of any other AI model in cyber capabilities” implies they have internal benchmarks showing a decisive gap over both Claude Opus and competitors like OpenAI’s GPT-5.3-Codex (which OpenAI classified as having “high capability” for cybersecurity).

Why This Matters for Benchmarking

The cybersecurity dimension adds a new axis to model comparison. Previously, AI models were compared primarily on coding, reasoning, and language tasks. Capybara introduces cybersecurity as a first-class benchmark category, potentially spawning new standardized evaluations.

Questions About Claude Capybara Benchmarks

Are Claude Capybara benchmark scores available?

No official scores have been published. Anthropic’s leaked documents claim “dramatically higher scores” than Opus 4.6 in coding, reasoning, and cybersecurity, but independent verification has not been possible since the model is in restricted testing.

How does Capybara compare to GPT-5 on benchmarks?

Direct comparison is not possible yet. However, Claude Opus 4.6 already leads GPT-5.4 on SWE-Bench Verified (80.8% vs 77.2%) and GPQA Diamond. Capybara is described as dramatically better than Opus, implying a significant gap over GPT-5.

What benchmark does Capybara score highest on?

Based on leaked claims, cybersecurity is where Capybara shows the largest advantage, described as “far ahead of any other AI model.” For coding and reasoning, the improvement over Opus is described as “dramatic” rather than the stronger “far ahead” language.

Will Capybara break 90% on SWE-Bench?

If “dramatically higher” means 5-10 points above Opus 4.6’s 80.8%, then 85-90% is plausible. A 90% score would be unprecedented and would mean the model can solve 9 out of 10 real GitHub issues autonomously.

Can I trust leaked benchmark claims?

Leaked internal benchmarks should be treated as preliminary. OpenAI’s GPT-5 underperformed its pre-release promises. Until independent evaluators test Capybara, the leaked claims remain unverified. However, Anthropic has a track record of conservative claims relative to actual performance.

How does Capybara’s cybersecurity capability compare?

Anthropic claims it is “far ahead of any other AI model in cyber capabilities.” No direct comparison with GPT-5 cybersecurity performance exists because OpenAI has not published cybersecurity-specific benchmarks for GPT-5.