<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="/feed.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>AI Benchmark Digest</title><subtitle>AI benchmark changes — new models, leader shifts, and trends</subtitle><link href="https://aibenchmarks.dev/data/feed.xml" rel="self" /><link href="https://aibenchmarks.dev/#/digest" rel="alternate" /><id>https://aibenchmarks.dev/feed</id><icon>https://aibenchmarks.dev/favicon.svg</icon><updated>2026-06-18T07:17:43.853003+00:00</updated><entry><title>AI Benchmark Digest — 2026-06-18</title><id>https://aibenchmarks.dev/digest/2026-06-18</id><updated>2026-06-18T07:17:43.853003+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AISI Cyber Cooling Tower 10M&lt;/strong&gt; (Avg Steps (/7)): leader Claude Opus 4.6 (0.1), 7 models&lt;br&gt;&lt;span&gt;AISI cyber range: &amp;quot;Cooling Tower&amp;quot; — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 10M token budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AISI Cyber Cooling Tower 100M&lt;/strong&gt; (Avg Steps (/7)): leader Claude Opus 4.6 (1.4), 5 models&lt;br&gt;&lt;span&gt;AISI cyber range: &amp;quot;Cooling Tower&amp;quot; — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 100M token budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI CTF (Professional)&lt;/strong&gt; (pass@12 (%)): leader GPT-5.5 (96.3), 3 models&lt;br&gt;&lt;span&gt;OpenAI system-card subset of professional capture-the-flag tasks, reporting pass@12 over offensive-security rollouts with a Linux tool harness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CVE-Bench&lt;/strong&gt; (pass@1 (%)): leader GPT-5.5 (93.1), 4 models&lt;br&gt;&lt;span&gt;Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI Cyber Ranges&lt;/strong&gt; (Combined Pass Rate (%)): leader GPT-5.5 (93.33), 4 models&lt;br&gt;&lt;span&gt;OpenAI internal cyber-range suite measuring end-to-end cyber operations across realistic emulated networks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ExploitGym&lt;/strong&gt; (Successful Intended Exploits (#)): leader Claude Mythos Preview (157.0), 7 models&lt;br&gt;&lt;span&gt;Real-world cybersecurity agent benchmark measuring whether AI agents can turn known software vulnerabilities into working, intended exploits across userspace, V8, and Linux kernel targets.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CyScenarioBench&lt;/strong&gt; (Average Success Rate (%)): leader Claude Mythos 5 (36.7), 9 models&lt;br&gt;&lt;span&gt;Irregular scenario-based offensive security benchmark measuring whether agents can plan and complete full multi-stage attack scenarios in realistic environments.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Lyptus Cyber Time Horizons - InterCode-CTF&lt;/strong&gt; (pass@1 at 2M tokens (%)): leader Claude Opus 4.6 (100.0), 3 models&lt;br&gt;&lt;span&gt;Lyptus Research offensive cyber time-horizon run of InterCode-CTF, measuring pass@1 on CTF tasks at a 2M token budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Lyptus Cyber Time Horizons - NL2Bash&lt;/strong&gt; (pass@1 at 2M tokens (%)): leader GPT-5.3 Codex (100.0), 3 models&lt;br&gt;&lt;span&gt;Lyptus Research offensive cyber time-horizon run of NL2Bash, measuring command-generation success at a 2M token budget.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 Pro&lt;/strong&gt; on FrontierMath - Tier 4 (v2): 58.54 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 Pro&lt;/strong&gt; on FrontierMath - Tiers 1-3 (v2): 82.46 (#4)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Claude Code)&lt;/strong&gt;: Claude 5 Fable (83.1) beat Claude Opus 4.8 by 4.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Terminus 2)&lt;/strong&gt;: Claude 5 Fable (80.4) beat GPT-5.5 by 2.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-18

=== DAILY ===
NEW BENCHMARKS (9)
  - AISI Cyber Cooling Tower 10M (Avg Steps (/7)): leader Claude Opus 4.6 (0.1), 7 models
      AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 10M token</summary></entry><entry><title>AI Benchmark Digest — 2026-06-17</title><id>https://aibenchmarks.dev/digest/2026-06-17</id><updated>2026-06-17T07:26:00.903157+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Finance Agent v2)&lt;/strong&gt; (Score (%)): leader Gemini 3.5 Flash (57.86), 25 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (FrontierSWE)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (90.0), 13 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Legal Agent Benchmark)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (13.3), 11 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (SkillsBench)&lt;/strong&gt; (Score (%)): leader Qwen3.7 Max (59.2), 5 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (12)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on SWE-Marathon: 24.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on BenchLM: 94.0 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on LLM Stats (HMMT 2025): 94.4 (#9)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on LLM Stats (HMMT Feb 26): 92.5 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on LLM Stats (IMO-AnswerBench): 91.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on LLM Stats (MCP Atlas): 76.8 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on LLM Stats (Toolathlon): 48.2 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on PinchBench: 87.79 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on RuneBench: 3230.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on SWE-Marathon: 13.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.2&lt;/strong&gt; on ZeroEval GPQA Diamond: 91.2 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on LLM Stats (GDPval-AA): 1308.0 (#12)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (15)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (DeepPlanning)&lt;/strong&gt;: Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus by 20.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Coding Agent Leaderboard - swe-bench-pro--ansible&lt;/strong&gt;: Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code by 19.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MRCR v2)&lt;/strong&gt;: Qwen 3.7 Plus (91.7) beat U2 by 15.09&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Coding Agent Leaderboard&lt;/strong&gt;: Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code by 13.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Website)&lt;/strong&gt;: silo (1357.0) beat Claude Fable 5 by 12.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Coding Agent Leaderboard - swe-bench-verified&lt;/strong&gt;: Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code by 7.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (ERQA)&lt;/strong&gt;: Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (SimpleVQA)&lt;/strong&gt;: Qwen 3.7 Plus (81.7) beat GLM-5V Turbo by 3.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (AIME 2026)&lt;/strong&gt;: GLM-5.2 (99.2) beat Kimi K2.6 by 2.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (IMO-AnswerBench)&lt;/strong&gt;: Nemotron 3 Ultra (550B A55B) (92.3) beat Qwen 3.7 Max by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (NL2Repo)&lt;/strong&gt;: GLM-5.2 (48.9) beat Qwen 3.7 Max by 1.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (RealWorldQA)&lt;/strong&gt;: Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (LVBench)&lt;/strong&gt;: Qwen 3.7 Plus (76.2) beat Kimi K2.5 by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Video-MME)&lt;/strong&gt;: Qwen 3.7 Plus (88.0) beat MiMo-V2.5 by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MLVU)&lt;/strong&gt;: Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-17

=== DAILY ===
NEW BENCHMARKS (4)
  - LLM Stats (Finance Agent v2) (Score (%)): leader Gemini 3.5 Flash (57.86), 25 models
  - LLM Stats (FrontierSWE) (Score (%)): leader Claude Fable 5 (90.0), 13 models
  - LLM Stats (Legal Agent Benchmark) (Score (%)): leader Claud</summary></entry><entry><title>AI Benchmark Digest — 2026-06-16</title><id>https://aibenchmarks.dev/digest/2026-06-16</id><updated>2026-06-16T08:27:51.523101+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (7)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SWE-Marathon&lt;/strong&gt; (Pass@1 (%)): leader Claude Opus 4.8 (26.0), 9 models&lt;br&gt;&lt;span&gt;Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for end-to-end completed tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;InferenceBench&lt;/strong&gt; (Speedup Score): leader Claude Fable 5 (Low) (8.74), 22 models&lt;br&gt;&lt;span&gt;Benchmark for coding agents optimizing inference workloads. Agents tune serving configurations and implementation choices across latency, throughput, and all-in-one scenarios.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AgenticVBench&lt;/strong&gt; (Average Success (%)): leader Claude Fable 5 (32.4), 9 models&lt;br&gt;&lt;span&gt;Agentic video benchmark where autonomous agents perform multi-step video repurposing, sequencing, repair, and assembly tasks, scored by average task success.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;TERMS-Bench&lt;/strong&gt; (Mean Utility): leader GLM 5.1 (11.7), 15 models&lt;br&gt;&lt;span&gt;Negotiation benchmark for LLM agents bargaining over terms under changing utility, urgency, and no-deal regimes, reporting mean utility and agreement metrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Structured Output Benchmark&lt;/strong&gt; (Overall (%)): leader GPT-5.4 (87.0), 28 models&lt;br&gt;&lt;span&gt;Structured-output benchmark measuring schema-constrained generation with value accuracy, faithfulness, JSON validity, path recall, type safety, and perfect-output rates.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BenGER&lt;/strong&gt; (Aggregate Accuracy (%)): leader Gemini 3.1 Pro (77.0), 12 models&lt;br&gt;&lt;span&gt;German-law benchmark for subsumption-based legal reasoning, evaluating model answers across Benchathon, ZJS, and doctrinal-principles corpora.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BenchLM&lt;/strong&gt; (Overall Score): leader Claude Mythos 5 (99.0), 123 models&lt;br&gt;&lt;span&gt;Composite LLM leaderboard aggregating current model performance across agentic, coding, reasoning, grounded multimodal, knowledge, multilingual, instruction-following, and math categories.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Search): 1237.0 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Epoch AI - ECI: 160.87 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Chatbot Arena (Search): 1203.0 (#11)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MRCR v2)&lt;/strong&gt;: U2 (76.61) beat Gemma 4 31B by 10.21&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - ECI&lt;/strong&gt;: Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) by 1.97&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-16

=== DAILY ===
NEW BENCHMARKS (7)
  - SWE-Marathon (Pass@1 (%)): leader Claude Opus 4.8 (26.0), 9 models
      Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for e</summary></entry><entry><title>AI Benchmark Digest — 2026-06-15</title><id>https://aibenchmarks.dev/digest/2026-06-15</id><updated>2026-06-15T08:24:20.247016+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (145)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - IFEval&lt;/strong&gt; (Score): leader Llama-3.3-70B-Instruct (89.98), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - BBH&lt;/strong&gt; (Score): leader Benchmaxx-Llama-3.2-1B-Instruct (76.7), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - MATH Level 5&lt;/strong&gt; (Score): leader AceMath-72B-Instruct (71.45), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - GPQA&lt;/strong&gt; (Score): leader L3.3-MS-Nevoria-70b (29.42), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - MuSR&lt;/strong&gt; (Score): leader T3Q-Qwen2.5-14B-Instruct-1M-e3 (38.69), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Open LLM Leaderboard - MMLU-Pro&lt;/strong&gt; (Score): leader calme-3.2-instruct-78b (70.03), 4576 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (92.1), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Maths&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.1 Pro (94.44), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Primary&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (96.71), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Science&lt;/strong&gt; (Accuracy (%)): leader Qwen3.5 Plus (95.08), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Secondary&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (91.04), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Social studies&lt;/strong&gt; (Accuracy (%)): leader o3 (91.82), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Pedagogy - Technology&lt;/strong&gt; (Accuracy (%)): leader Kimi K2.5 (89.62), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education SEND&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (88.07), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (89.87), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths - Algebra&lt;/strong&gt; (Accuracy (%)): leader Gemini-2.5 Pro (100.0), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths - Geometry&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (88.46), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths - Measurement&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (97.3), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths - Number and Operations&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (83.78), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Maths - Statistics and Probability&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (85.71), 61 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.5 Flash (86.0), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - match (figure)&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.5 Flash (85.2), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - match (process)&lt;/strong&gt; (Accuracy (%)): leader Gemini-3 Flash (77.8), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - odd one out&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.5 Flash (80.5), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - pattern completion (2d)&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.1 Pro (86.3), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - pattern completion (linear)&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.5 Flash (91.5), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AI for Education Visual Reasoning - reasoning by analogy&lt;/strong&gt; (Accuracy (%)): leader Gemini-3.5 Flash (88.8), 63 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Verified (Opus 4.6 System Card)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.5 (Thinking) (80.9), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.0 (Opus 4.6 System Card)&lt;/strong&gt; (Pass Rate (%)): leader Claude Opus 4.6 (Thinking) (65.4), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau2 Bench Retail (Opus 4.6 System Card)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (91.9), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau2 Bench Telecom (Opus 4.6 System Card)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (99.3), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MCP-Atlas (Opus 4.6 System Card)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.5 (Thinking) (62.3), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ARC-AGI-2 Verified (Opus 4.6 System Card)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (68.8), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPQA Diamond (Opus 4.6 System Card)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.2 (93.2), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MMMU-Pro No Tools (Opus 4.6 System Card)&lt;/strong&gt; (Score (%)): leader Gemini 3 Pro (81.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MMMLU (Opus 4.6 System Card)&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro (91.8), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Verified (Fable/Mythos)&lt;/strong&gt; (Resolved (%)): leader Claude Mythos 5 (95.5), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Fable/Mythos)&lt;/strong&gt; (Mean Reward (%)): leader Claude Mythos 5 (88.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BrowseComp (Fable/Mythos Single-Agent)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (88.0), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BrowseComp (Fable/Mythos Multi-Agent)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (93.3), 2 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Fable/Mythos No Tools)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (59.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Fable/Mythos Tools)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (64.7), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Fable/Mythos No Tools)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (88.9), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Fable/Mythos Tools)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (93.5), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Human Solvable (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (83.9), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Human Difficult (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (46.1), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OSWorld-Verified (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (85.4), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CritPt (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (28.6), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ArxivMath (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (78.5), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;RiemannBench (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (55.0), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks BFS 256K (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (91.1), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks Parents 256K (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (99.96), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierCode Diamond (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (29.3), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GDPval-AA (Fable/Mythos)&lt;/strong&gt; (Elo): leader Claude Fable 5 (1932.0), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GDP.pdf (Fable/Mythos)&lt;/strong&gt; (Strict Pass Rate (%)): leader Claude Fable 5 (29.8), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AutomationBench (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (17.4), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2 (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (38.6), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Legal Agent Benchmark Public Set (Fable/Mythos)&lt;/strong&gt; (All-Pass Rate (%)): leader Claude Mythos 5 (16.9), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HealthBench (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (62.7), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HealthBench Professional (Fable/Mythos)&lt;/strong&gt; (Score (%)): leader Claude Mythos 5 (66.0), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - GDPval (wins or ties)&lt;/strong&gt; (Score (%)): leader GPT-5.5 (84.9), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - FinanceAgent v1.1&lt;/strong&gt; (Score (%)): leader Claude Opus 4.7 (64.4), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - Investment Banking Modeling Tasks&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (88.6), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - BrowseComp&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (90.1), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - GeneBench&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (33.2), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - FrontierMath Tier 1-3&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (52.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - FrontierMath Tier 4&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (39.6), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - GPQA Diamond&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (94.4), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - Humanity's Last Exam (no tools)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.7 (46.9), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - Humanity's Last Exam (with tools)&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (58.7), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - ARC-AGI-1 (Verified)&lt;/strong&gt; (Score (%)): leader Gemini 3.1 Pro (98.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 Launch - ARC-AGI-2 (Verified)&lt;/strong&gt; (Score (%)): leader GPT-5.5 (85.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - GDPval&lt;/strong&gt; (Score (%)): leader GPT-5.4 (83.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - FinanceAgent v1.1&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (61.5), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - Investment Banking Modeling Tasks&lt;/strong&gt; (Score (%)): leader GPT-5.4 (87.3), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - BrowseComp&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (89.3), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - Frontier Science Research&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (36.7), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - FrontierMath Tier 1-3&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (50.0), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - FrontierMath Tier 4&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (38.0), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - GPQA Diamond&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (94.4), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - Humanity's Last Exam (no tools)&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (42.7), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - Humanity's Last Exam (with tools)&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (58.7), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - ARC-AGI-1 (Verified)&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (94.5), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.4 Launch - ARC-AGI-2 (Verified)&lt;/strong&gt; (Score (%)): leader GPT-5.4 Pro (83.3), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 System Card - Tacit Knowledge and Troubleshooting&lt;/strong&gt; (Score (%)): leader GPT-5.5 Pro (81.67), 2 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 System Card - Biochemistry Knowledge Improvement&lt;/strong&gt; (reward@4 (%)): leader GPT-5.5 Pro (39.26), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 System Card - Hard Negative Protein Binding Prediction&lt;/strong&gt; (pass@4 (%)): leader GPT-5.4 (Thinking) (3.46), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-5.5 System Card - DNA Sequence Design for TF Binding&lt;/strong&gt; (pass@1 (%)): leader GPT-5.5 Pro (16.5), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - ProtocolQA Open-Ended&lt;/strong&gt; (pass@1 (%)): leader GPT-5.5 (37.3), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - TroubleshootingBench&lt;/strong&gt; (pass@1 (%)): leader GPT-Rosalind-5.5 (53.31), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - Biorisk Knowledge&lt;/strong&gt; (cons@32 (%)): leader GPT-5.5 Pro (81.67), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - Multi-select Virology Troubleshooting&lt;/strong&gt; (pass@1 (%)): leader GPT-5.5 Pro (55.34), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - Hard Negative Protein Binding Prediction&lt;/strong&gt; (pass@4 (%)): leader GPT-Rosalind-5.5 (3.13), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenAI GPT-Rosalind-5.5 System Card - DNA Sequence Design for TF Binding&lt;/strong&gt; (pass@1 (%)): leader GPT-5.5 Pro (16.5), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - ARC-AGI-2&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (84.6), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - Humanity's Last Exam (no tools)&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (48.4), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - Humanity's Last Exam (search and code)&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (53.4), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - MMMU-Pro&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (81.5), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - International Math Olympiad 2025&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (81.5), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - Codeforces&lt;/strong&gt; (Elo): leader Gemini 3 Deep Think (3455.0), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - International Physics Olympiad 2025 (theory)&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (87.7), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - CMT-Benchmark&lt;/strong&gt; (Pass@8 (%)): leader Gemini 3 Deep Think (50.5), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Google Gemini 3 Deep Think - International Chemistry Olympiad 2025 (theory)&lt;/strong&gt; (Score (%)): leader Gemini 3 Deep Think (82.8), 3 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Terminal Bench 2.0-Terminus&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (69.7), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SWE-Verified&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.6 (Thinking) (80.8), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SWE-Pro&lt;/strong&gt; (Resolved (%)): leader Qwen 3.7 Max (60.6), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SWE-Multilingual&lt;/strong&gt; (Resolved (%)): leader Qwen 3.7 Max (78.3), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - NL2repo&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (47.6), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SciCode&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (53.5), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - QwenWebDev&lt;/strong&gt; (Elo): leader Claude Opus 4.6 (Thinking) (1617.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - QwenSVG&lt;/strong&gt; (Elo): leader Qwen 3.7 Max (1608.0), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Qwenclaw&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (65.5), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - CoWorkBench&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (68.2), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - ClawEval&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (70.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Skillsbench&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (59.2), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - BFCL-V4&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (76.7), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MCP-Mark&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (60.8), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MCP-Atlas&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (76.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Vitabench&lt;/strong&gt; (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (51.9), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SpreadSheetBench-v1&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (89.3), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Kernel Bench L3 - Median Speedup&lt;/strong&gt; (Median speedup (x)): leader Claude Opus 4.6 (Thinking) (2.63), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Kernel Bench L3 - Win Rate&lt;/strong&gt; (Problems faster than torch.compile (%)): leader Claude Opus 4.6 (Thinking) (98.0), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Humanity's Last Exam (with tools)&lt;/strong&gt; (Score (%)): leader Kimi K2.6 (Thinking) (54.0), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - QwenWorldBench&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (57.3), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - GPQA Diamond&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (92.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Humanity's Last Exam&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (41.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - LiveCodeBench&lt;/strong&gt; (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (93.5), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - HMMT 2026 Feb&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (97.1), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - IMOAnswerBench&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (90.0), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - CritPT&lt;/strong&gt; (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (12.9), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Apex&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (44.5), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MMLU-Pro&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (89.7), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MMLU-Redux&lt;/strong&gt; (Score (%)): leader Kimi K2.6 (Thinking) (95.3), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - SuperGPQA&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (73.6), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - IFEval&lt;/strong&gt; (Score (%)): leader Kimi K2.6 (Thinking) (94.5), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - IFBench&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (79.1), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MRCR-v2 128k&lt;/strong&gt; (Accuracy (%)): leader Qwen 3.7 Max (90.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - WMT24++&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (85.8), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MAXIFE&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (89.2), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MMMLU&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (90.6), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - MMLU-ProX&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (87.0), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - NOVA-63&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (59.1), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - INCLUDE&lt;/strong&gt; (Score (%)): leader Claude Opus 4.6 (Thinking) (87.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - Global PIQA&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (91.4), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.7 Launch - PolyMATH&lt;/strong&gt; (Score (%)): leader Qwen 3.7 Max (86.5), 6 models&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-15

=== DAILY ===
NEW BENCHMARKS (145)
  - Open LLM Leaderboard - IFEval (Score): leader Llama-3.3-70B-Instruct (89.98), 4576 models
  - Open LLM Leaderboard - BBH (Score): leader Benchmaxx-Llama-3.2-1B-Instruct (76.7), 4576 models
  - Open LLM Leaderboard - MATH Level </summary></entry><entry><title>AI Benchmark Digest — 2026-06-14</title><id>https://aibenchmarks.dev/digest/2026-06-14</id><updated>2026-06-14T09:01:01.779177+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (75)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Ramp SWE-Bench&lt;/strong&gt; (Resolved (%)): leader Claude Fable 5 (87.5), 14 models&lt;br&gt;&lt;span&gt;Ramp Labs benchmark for background coding agents on realistic financial software engineering work, scored by resolved tasks with the mini-SWE-agent harness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CADGenBench&lt;/strong&gt; (Aggregate CAD Score): leader Claude Fable 5 (0.4514), 11 models&lt;br&gt;&lt;span&gt;CAD generation and editing benchmark scoring generated CAD artifacts on aggregate geometric and validity metrics across validated submissions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4 (v2)&lt;/strong&gt; (Accuracy (%, 41 private v2 problems)): leader Claude Fable 5 (max) (87.8), 27 models&lt;br&gt;&lt;span&gt;Current v2 private Tier 4 FrontierMath expansion set from Epoch AI, measuring accuracy on the hardest unpublished research-level mathematics problems.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tiers 1-3 (v2)&lt;/strong&gt; (Accuracy (%, 285 private v2 problems)): leader GPT-5.5 Pro (xhigh) (87.72), 26 models&lt;br&gt;&lt;span&gt;Current v2 private FrontierMath base set from Epoch AI, covering original problems from undergraduate through early-postdoc difficulty across major areas of modern mathematics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - SpatialBench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (69.57), 11 models&lt;br&gt;&lt;span&gt;LatchBio agentic benchmark on messy real-world spatial transcriptomics data, with models writing and running analysis workflows across assays, platforms, and task categories.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - scBench&lt;/strong&gt; (Pass Rate (%)): leader Claude Mythos 5 (59.3), 13 models&lt;br&gt;&lt;span&gt;LatchBio agentic benchmark for single-cell RNA-seq analysis, requiring models to perform realistic data cleaning, clustering, cell typing, and differential-expression workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - SpatialBench-Long&lt;/strong&gt; (Pass Rate (%)): leader Gemini 3.5 Flash (11.11), 12 models&lt;br&gt;&lt;span&gt;Long-form Benchmarks.bio spatial transcriptomics tasks that require multi-step biological data analysis, tool use, and synthesis over larger assay contexts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - EpiBench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (44.97), 11 models&lt;br&gt;&lt;span&gt;Benchmarks.bio epigenomics benchmark covering real assays such as chromatin accessibility, binding, and methylation analyses with deterministic graders.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena&lt;/strong&gt; (Net Improvement (%)): leader Grok 4.3 xAI · Proprietary (18.3), 25 models&lt;br&gt;&lt;span&gt;Arena.ai agent leaderboard measuring net improvement on real-world tool orchestration sessions with success, steerability, recovery, and hallucination metrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Confirmed Success&lt;/strong&gt; (Confirmed Success (%)): leader Claude Fable 5 (High) (17.21), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric tracking confirmed successful completion rate on real-world agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Praise vs Complaint&lt;/strong&gt; (Praise vs Complaint (%)): leader Claude Fable 5 (High) (27.74), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric comparing user praise against complaints across agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Steerability&lt;/strong&gt; (Steerability (%)): leader Nemotron 3 Ultra (23.87), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring how well models adapt to user steering during tool-use sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Bash Recovery&lt;/strong&gt; (Bash Recovery (%)): leader Grok 4.3 xAI · Proprietary (60.23), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring recovery from shell or command-line failures in agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Tool Hallucination&lt;/strong&gt; (Tool Hallucination (%)): leader Grok 4.3 xAI · Proprietary (0.26), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring tool hallucination rate; lower values indicate fewer invented or invalid tool uses.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agents' Last Exam&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (24.0), 18 models&lt;br&gt;&lt;span&gt;Snorkel benchmark of long-horizon economically valuable agent tasks across many industries, reporting workflow pass rate and score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WolfBench&lt;/strong&gt; (Average Score (%)): leader GPT-5.5 (77.0), 27 models&lt;br&gt;&lt;span&gt;Agent benchmark based on Terminal-Bench 2.0 that compares harnesses and models across repeated terminal task runs using aggregate score statistics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Appwrite Arena (With Skills)&lt;/strong&gt; (Overall Score (%)): leader GPT-5.5 (97.7), 16 models&lt;br&gt;&lt;span&gt;Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks when models can use Appwrite skills.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Appwrite Arena (Without Skills)&lt;/strong&gt; (Overall Score (%)): leader Claude Fable 5 (97.7), 16 models&lt;br&gt;&lt;span&gt;Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks without Appwrite skill assistance.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (83.4), 6 models&lt;br&gt;&lt;span&gt;Official Terminal-Bench 2.1 leaderboard measuring agent success on realistic command-line tasks, using each model best available harness row.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Claude Code)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (78.9), 3 models&lt;br&gt;&lt;span&gt;Terminal-Bench 2.1 results for the Claude Code harness, measuring command-line task completion by model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Terminus 2)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (78.2), 5 models&lt;br&gt;&lt;span&gt;Terminal-Bench 2.1 results for the Terminus 2 harness, measuring command-line task completion by model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Finance Agent v2&lt;/strong&gt; (Accuracy (%)): leader gemini-3.5-flash (57.86), 29 models&lt;br&gt;&lt;span&gt;Updated Vals AI financial-research agent benchmark over SEC filings and supporting documents, measuring completion accuracy on realistic analyst workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Public Benefits Bench&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (71.65), 13 models&lt;br&gt;&lt;span&gt;SNAP public-benefits guidance benchmark measuring whether models answer benefits questions accurately while following eligibility and documentation rules.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Terminal-Bench 2.1&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (80.52), 30 models&lt;br&gt;&lt;span&gt;Updated Terminal-Bench 2.1 evaluation from Vals AI, measuring agentic command-line task completion in sandboxed software and systems environments.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LiveCodeBench&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (89.78), 121 models&lt;br&gt;&lt;span&gt;Vals AI run of LiveCodeBench coding problems, measuring pass rates on recent contest-style programming tasks intended to reduce contamination.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI GPQA&lt;/strong&gt; (Accuracy (%)): leader gemini-3.1-pro-preview (95.45), 115 models&lt;br&gt;&lt;span&gt;Vals AI run of GPQA graduate-level science questions, measuring difficult expert-domain reasoning accuracy.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MMLU-Pro&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (91.5), 114 models&lt;br&gt;&lt;span&gt;Vals AI run of MMLU-Pro multitask academic questions, using harder multi-choice problems across STEM, humanities, and professional domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MMMU&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (89.31), 76 models&lt;br&gt;&lt;span&gt;Vals AI run of MMMU multimodal college-level subject questions, measuring visual and textual academic reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI SWE-bench Verified&lt;/strong&gt; (Resolved (%)): leader claude-fable-5 (95.0), 57 models&lt;br&gt;&lt;span&gt;Vals AI SWE-bench Verified leaderboard, measuring the percentage of real GitHub issues resolved by coding agents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GDP.pdf&lt;/strong&gt; (Strict Pass Rate (%)): leader Claude Fable 5 (30.0), 12 models&lt;br&gt;&lt;span&gt;Surge AI document-reasoning benchmark over 100 professional PDF workflows, scored by strict pass rate against expert-written rubrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Riemann-bench&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (55.0), 15 models&lt;br&gt;&lt;span&gt;Surge AI frontier mathematics benchmark with advanced research-style problems sourced from mathematicians and scored by solution correctness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Pro (Anthropic Scaffold)&lt;/strong&gt; (Pass@1 (%)): leader Claude Mythos 5 (80.3), 6 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Pro, measuring pass@1 on production software engineering issues using Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA Pro&lt;/strong&gt; (Correctness (%)): leader Claude Fable 5 (57.9), 4 models&lt;br&gt;&lt;span&gt;Hard OfficeQA subset for frontier document agents, requiring grounded search and numerical reasoning over U.S. Treasury Bulletin documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Real-World Finance v2&lt;/strong&gt; (Elo): leader Claude Fable 5 (1374.0), 4 models&lt;br&gt;&lt;span&gt;Anthropic long-horizon finance workflow evaluation using pairwise preference grading and Elo ratings over realistic professional deliverables.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Real-World Finance v1&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (70.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic curated finance benchmark of 53 tasks evaluated against reference answers with a model-based grader.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Legal Agent Benchmark (Harvey Held-Out)&lt;/strong&gt; (All-Pass Rate (%)): leader Claude Fable 5 (13.3), 5 models&lt;br&gt;&lt;span&gt;Harvey legal-agent held-out evaluation using closed-universe matter files and expert rubrics, scored by all-pass task success.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Toolathlon (Anthropic Internal Harness)&lt;/strong&gt; (Pass@1 (%)): leader Claude Fable 5 (61.7), 7 models&lt;br&gt;&lt;span&gt;Anthropic internal Toolathlon harness over 108 tool-use tasks, reporting pass@1 for agentic workflow completion.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Verified (Anthropic Scaffold)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (88.6), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Verified, measuring real GitHub issue resolution with Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Multilingual (Anthropic Scaffold)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (84.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Multilingual, measuring multilingual software issue resolution with Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Multimodal (Anthropic Internal Harness)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (38.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal multimodal SWE-bench harness, measuring software issue resolution that requires visual or multimodal context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (49.8), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Humanitys Last Exam without tools, covering expert-level academic reasoning across many domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (57.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Humanitys Last Exam with tools, covering expert-level academic reasoning across many domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartQAPro (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (69.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ChartQAPro, testing chart understanding and quantitative visual reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartQAPro (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (72.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ChartQAPro, testing chart understanding and quantitative visual reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ScreenSpot-Pro (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (82.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ScreenSpot-Pro (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (87.9), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks BFS 256K (Anthropic)&lt;/strong&gt; (F1 Score (%)): leader Claude Opus 4.8 (85.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic GraphWalks long-context graph traversal evaluation using breadth-first-search tasks at 256K context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks Parents 256K (Anthropic)&lt;/strong&gt; (F1 Score (%)): leader Claude Opus 4.8 (99.3), 4 models&lt;br&gt;&lt;span&gt;Anthropic GraphWalks long-context graph traversal evaluation using parent-pointer recovery tasks at 256K context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;USAMO 2026 (Anthropic)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (96.7), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card evaluation on 2026 USAMO-style olympiad math problems, scored by answer correctness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ArXivMath Mar-Apr 2026 (Anthropic)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (71.82), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card evaluation on recent arXiv mathematics problems from March and April 2026.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA (Anthropic Internal Harness)&lt;/strong&gt; (Exact Match (%)): leader Claude Opus 4.8 (77.6), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal OfficeQA document-agent benchmark, requiring grounded search and numerical reasoning over office documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA Pro (Anthropic Internal Harness)&lt;/strong&gt; (Exact Match (%)): leader Claude Opus 4.8 (66.2), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal OfficeQA Pro hard subset, requiring grounded search and numerical reasoning over office documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (75.8), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (89.7), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LAB-Bench FigQA (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (80.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LAB-Bench FigQA (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (87.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.7 (81.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.7 (90.1), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HealthBench Professional (Anthropic)&lt;/strong&gt; (Length-Adjusted Score (%)): leader Claude Opus 4.8 (55.8), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of HealthBench Professional, measuring clinical and healthcare reasoning with length-adjusted scoring.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GMMLU (Anthropic)&lt;/strong&gt; (Average Accuracy (%)): leader Gemini 3.1 Pro (92.2), 5 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Global MMLU, measuring multilingual academic and professional knowledge.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioPipelineBench Verified (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (88.1), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioPipelineBench Verified, measuring biological data-analysis workflow completion.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Verified - Human Solvable (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (82.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioMysteryBench Verified human-solvable tasks, testing biological mystery problem solving.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Verified - Human Difficult (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (40.0), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioMysteryBench Verified human-difficult tasks, testing hard biological mystery problem solving.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LatchBio SpatialBench (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (53.8), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of LatchBio SpatialBench, measuring spatial transcriptomics analysis workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LatchBio SingleCellBench (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (58.2), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of LatchBio SingleCellBench, measuring single-cell RNA-seq analysis workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Structural Biology (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (81.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card structural biology evaluation, testing biomolecular structure reasoning and analysis.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ProteinGym Hard (Anthropic)&lt;/strong&gt; (Rank Correlation (%)): leader Claude Mythos Preview (43.1), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of the hard ProteinGym subset, measuring protein variant effect prediction via rank correlation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Organic Chemistry (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (86.5), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card organic chemistry evaluation, testing reaction and molecule reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Protocol Troubleshooting (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (69.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card protocol troubleshooting benchmark, testing diagnosis of laboratory protocol failures.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Patent Questions (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (68.8), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 patent-question subset, testing life-science document reasoning over patent material.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Clinical Trial Questions (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (86.3), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 clinical-trial subset, testing life-science reasoning over trial documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Table Reading (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (77.2), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 table-reading subset, testing scientific table comprehension.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Supplementary Materials (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (58.9), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 supplementary-materials subset, testing reasoning over scientific supporting files.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Security League - Functional Correctness&lt;/strong&gt; (Functional Correctness (%)): leader GPT-5.5 (84.9), 15 models&lt;br&gt;&lt;span&gt;Endor Labs coding-agent benchmark measuring whether agents functionally complete security-sensitive software tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Security League - Security Correctness&lt;/strong&gt; (Security Correctness (%)): leader GPT-5.5 (24.0), 15 models&lt;br&gt;&lt;span&gt;Endor Labs coding-agent benchmark measuring whether completed software tasks avoid introducing or preserving security vulnerabilities.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: GLM-5.2 (81.3) beat intern-s2-preview by 4.6&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (86)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FrontierCode Diamond&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (13.4), 12 models&lt;br&gt;&lt;span&gt;Hardest 50 FrontierCode production-code tasks from Cognition, measuring whether maintainers would merge model PRs using blocker criteria and quality rubrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierCode Main&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (34.3), 12 models&lt;br&gt;&lt;span&gt;100 hardest FrontierCode production-code tasks, including Diamond, scored by maintainer-style mergeability criteria across correctness, tests, scope, style, and codebase standards.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierCode Extended&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (51.8), 12 models&lt;br&gt;&lt;span&gt;Full 150-task FrontierCode benchmark from Cognition, evaluating production-quality coding agents on maintainer-authored open source repository work.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ramp SWE-Bench&lt;/strong&gt; (Resolved (%)): leader Claude Fable 5 (87.5), 14 models&lt;br&gt;&lt;span&gt;Ramp Labs benchmark for background coding agents on realistic financial software engineering work, scored by resolved tasks with the mini-SWE-agent harness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CADGenBench&lt;/strong&gt; (Aggregate CAD Score): leader Claude Fable 5 (0.4514), 11 models&lt;br&gt;&lt;span&gt;CAD generation and editing benchmark scoring generated CAD artifacts on aggregate geometric and validity metrics across validated submissions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4 (v2)&lt;/strong&gt; (Accuracy (%, 41 private v2 problems)): leader Claude Fable 5 (max) (87.8), 27 models&lt;br&gt;&lt;span&gt;Current v2 private Tier 4 FrontierMath expansion set from Epoch AI, measuring accuracy on the hardest unpublished research-level mathematics problems.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tiers 1-3 (v2)&lt;/strong&gt; (Accuracy (%, 285 private v2 problems)): leader GPT-5.5 Pro (xhigh) (87.72), 26 models&lt;br&gt;&lt;span&gt;Current v2 private FrontierMath base set from Epoch AI, covering original problems from undergraduate through early-postdoc difficulty across major areas of modern mathematics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - SpatialBench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (69.57), 11 models&lt;br&gt;&lt;span&gt;LatchBio agentic benchmark on messy real-world spatial transcriptomics data, with models writing and running analysis workflows across assays, platforms, and task categories.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - scBench&lt;/strong&gt; (Pass Rate (%)): leader Claude Mythos 5 (59.3), 13 models&lt;br&gt;&lt;span&gt;LatchBio agentic benchmark for single-cell RNA-seq analysis, requiring models to perform realistic data cleaning, clustering, cell typing, and differential-expression workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - SpatialBench-Long&lt;/strong&gt; (Pass Rate (%)): leader Gemini 3.5 Flash (11.11), 12 models&lt;br&gt;&lt;span&gt;Long-form Benchmarks.bio spatial transcriptomics tasks that require multi-step biological data analysis, tool use, and synthesis over larger assay contexts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Benchmarks.bio - EpiBench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (44.97), 11 models&lt;br&gt;&lt;span&gt;Benchmarks.bio epigenomics benchmark covering real assays such as chromatin accessibility, binding, and methylation analyses with deterministic graders.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena&lt;/strong&gt; (Net Improvement (%)): leader Grok 4.3 xAI · Proprietary (18.3), 25 models&lt;br&gt;&lt;span&gt;Arena.ai agent leaderboard measuring net improvement on real-world tool orchestration sessions with success, steerability, recovery, and hallucination metrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Confirmed Success&lt;/strong&gt; (Confirmed Success (%)): leader Claude Fable 5 (High) (17.21), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric tracking confirmed successful completion rate on real-world agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Praise vs Complaint&lt;/strong&gt; (Praise vs Complaint (%)): leader Claude Fable 5 (High) (27.74), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric comparing user praise against complaints across agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Steerability&lt;/strong&gt; (Steerability (%)): leader Nemotron 3 Ultra (23.87), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring how well models adapt to user steering during tool-use sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Bash Recovery&lt;/strong&gt; (Bash Recovery (%)): leader Grok 4.3 xAI · Proprietary (60.23), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring recovery from shell or command-line failures in agent sessions.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Arena - Tool Hallucination&lt;/strong&gt; (Tool Hallucination (%)): leader Grok 4.3 xAI · Proprietary (0.26), 25 models&lt;br&gt;&lt;span&gt;Agent Arena submetric measuring tool hallucination rate; lower values indicate fewer invented or invalid tool uses.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agents' Last Exam&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.5 (24.0), 18 models&lt;br&gt;&lt;span&gt;Snorkel benchmark of long-horizon economically valuable agent tasks across many industries, reporting workflow pass rate and score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WolfBench&lt;/strong&gt; (Average Score (%)): leader GPT-5.5 (77.0), 27 models&lt;br&gt;&lt;span&gt;Agent benchmark based on Terminal-Bench 2.0 that compares harnesses and models across repeated terminal task runs using aggregate score statistics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Appwrite Arena (With Skills)&lt;/strong&gt; (Overall Score (%)): leader GPT-5.5 (97.7), 16 models&lt;br&gt;&lt;span&gt;Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks when models can use Appwrite skills.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Appwrite Arena (Without Skills)&lt;/strong&gt; (Overall Score (%)): leader Claude Fable 5 (97.7), 16 models&lt;br&gt;&lt;span&gt;Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks without Appwrite skill assistance.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (83.4), 6 models&lt;br&gt;&lt;span&gt;Official Terminal-Bench 2.1 leaderboard measuring agent success on realistic command-line tasks, using each model best available harness row.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Claude Code)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (78.9), 3 models&lt;br&gt;&lt;span&gt;Terminal-Bench 2.1 results for the Claude Code harness, measuring command-line task completion by model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Terminal-Bench 2.1 (Terminus 2)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (78.2), 5 models&lt;br&gt;&lt;span&gt;Terminal-Bench 2.1 results for the Terminus 2 harness, measuring command-line task completion by model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Finance Agent v2&lt;/strong&gt; (Accuracy (%)): leader gemini-3.5-flash (57.86), 29 models&lt;br&gt;&lt;span&gt;Updated Vals AI financial-research agent benchmark over SEC filings and supporting documents, measuring completion accuracy on realistic analyst workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Public Benefits Bench&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (71.65), 13 models&lt;br&gt;&lt;span&gt;SNAP public-benefits guidance benchmark measuring whether models answer benefits questions accurately while following eligibility and documentation rules.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Terminal-Bench 2.1&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (80.52), 30 models&lt;br&gt;&lt;span&gt;Updated Terminal-Bench 2.1 evaluation from Vals AI, measuring agentic command-line task completion in sandboxed software and systems environments.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LiveCodeBench&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (89.78), 121 models&lt;br&gt;&lt;span&gt;Vals AI run of LiveCodeBench coding problems, measuring pass rates on recent contest-style programming tasks intended to reduce contamination.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI GPQA&lt;/strong&gt; (Accuracy (%)): leader gemini-3.1-pro-preview (95.45), 115 models&lt;br&gt;&lt;span&gt;Vals AI run of GPQA graduate-level science questions, measuring difficult expert-domain reasoning accuracy.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MMLU-Pro&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (91.5), 114 models&lt;br&gt;&lt;span&gt;Vals AI run of MMLU-Pro multitask academic questions, using harder multi-choice problems across STEM, humanities, and professional domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MMMU&lt;/strong&gt; (Accuracy (%)): leader claude-fable-5 (89.31), 76 models&lt;br&gt;&lt;span&gt;Vals AI run of MMMU multimodal college-level subject questions, measuring visual and textual academic reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI SWE-bench Verified&lt;/strong&gt; (Resolved (%)): leader claude-fable-5 (95.0), 57 models&lt;br&gt;&lt;span&gt;Vals AI SWE-bench Verified leaderboard, measuring the percentage of real GitHub issues resolved by coding agents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM Leaderboard - Average&lt;/strong&gt; (Average Score (%)): leader Gemini 3.1 Pro Preview (88.54), 86 models&lt;br&gt;&lt;span&gt;Icelandic LLM leaderboard aggregating WinoGrande-IS, GED, Inflection, Belebele-IS, ARC-Challenge-IS, and WikiQA-IS for Icelandic language understanding and reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - WinoGrande-IS&lt;/strong&gt; (Score (%)): leader Gemini 3.1 Pro Preview (96.14), 86 models&lt;br&gt;&lt;span&gt;Icelandic WinoGrande common-sense reasoning score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - GED&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (91.5), 86 models&lt;br&gt;&lt;span&gt;Icelandic grammatical error detection score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - Inflection&lt;/strong&gt; (Score (%)): leader GPT-5.5 (97.96), 86 models&lt;br&gt;&lt;span&gt;Icelandic morphological inflection score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - Belebele-IS&lt;/strong&gt; (Score (%)): leader Gemini 3.1 Pro Preview (95.0), 86 models&lt;br&gt;&lt;span&gt;Icelandic Belebele reading-comprehension score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - ARC-Challenge-IS&lt;/strong&gt; (Score (%)): leader GPT-5.5 (95.22), 86 models&lt;br&gt;&lt;span&gt;Icelandic ARC-Challenge science and commonsense reasoning score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - WikiQA-IS&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (75.39), 86 models&lt;br&gt;&lt;span&gt;Icelandic WikiQA question-answering score.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GDP.pdf&lt;/strong&gt; (Strict Pass Rate (%)): leader Claude Fable 5 (30.0), 12 models&lt;br&gt;&lt;span&gt;Surge AI document-reasoning benchmark over 100 professional PDF workflows, scored by strict pass rate against expert-written rubrics.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Riemann-bench&lt;/strong&gt; (Score (%)): leader Claude Fable 5 (55.0), 15 models&lt;br&gt;&lt;span&gt;Surge AI frontier mathematics benchmark with advanced research-style problems sourced from mathematicians and scored by solution correctness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Pro (Anthropic Scaffold)&lt;/strong&gt; (Pass@1 (%)): leader Claude Mythos 5 (80.3), 6 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Pro, measuring pass@1 on production software engineering issues using Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA Pro&lt;/strong&gt; (Correctness (%)): leader Claude Fable 5 (57.9), 4 models&lt;br&gt;&lt;span&gt;Hard OfficeQA subset for frontier document agents, requiring grounded search and numerical reasoning over U.S. Treasury Bulletin documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Real-World Finance v2&lt;/strong&gt; (Elo): leader Claude Fable 5 (1374.0), 4 models&lt;br&gt;&lt;span&gt;Anthropic long-horizon finance workflow evaluation using pairwise preference grading and Elo ratings over realistic professional deliverables.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Real-World Finance v1&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (70.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic curated finance benchmark of 53 tasks evaluated against reference answers with a model-based grader.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Legal Agent Benchmark (Harvey Held-Out)&lt;/strong&gt; (All-Pass Rate (%)): leader Claude Fable 5 (13.3), 5 models&lt;br&gt;&lt;span&gt;Harvey legal-agent held-out evaluation using closed-universe matter files and expert rubrics, scored by all-pass task success.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Toolathlon (Anthropic Internal Harness)&lt;/strong&gt; (Pass@1 (%)): leader Claude Fable 5 (61.7), 7 models&lt;br&gt;&lt;span&gt;Anthropic internal Toolathlon harness over 108 tool-use tasks, reporting pass@1 for agentic workflow completion.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Verified (Anthropic Scaffold)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (88.6), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Verified, measuring real GitHub issue resolution with Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Multilingual (Anthropic Scaffold)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (84.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of SWE-bench Multilingual, measuring multilingual software issue resolution with Anthropic scaffold settings.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SWE-bench Multimodal (Anthropic Internal Harness)&lt;/strong&gt; (Resolved (%)): leader Claude Opus 4.8 (38.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal multimodal SWE-bench harness, measuring software issue resolution that requires visual or multimodal context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (49.8), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Humanitys Last Exam without tools, covering expert-level academic reasoning across many domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Humanity's Last Exam (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (57.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Humanitys Last Exam with tools, covering expert-level academic reasoning across many domains.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartQAPro (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (69.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ChartQAPro, testing chart understanding and quantitative visual reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartQAPro (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (72.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ChartQAPro, testing chart understanding and quantitative visual reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ScreenSpot-Pro (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (82.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ScreenSpot-Pro (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (87.9), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks BFS 256K (Anthropic)&lt;/strong&gt; (F1 Score (%)): leader Claude Opus 4.8 (85.9), 4 models&lt;br&gt;&lt;span&gt;Anthropic GraphWalks long-context graph traversal evaluation using breadth-first-search tasks at 256K context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GraphWalks Parents 256K (Anthropic)&lt;/strong&gt; (F1 Score (%)): leader Claude Opus 4.8 (99.3), 4 models&lt;br&gt;&lt;span&gt;Anthropic GraphWalks long-context graph traversal evaluation using parent-pointer recovery tasks at 256K context.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;USAMO 2026 (Anthropic)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (96.7), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card evaluation on 2026 USAMO-style olympiad math problems, scored by answer correctness.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ArXivMath Mar-Apr 2026 (Anthropic)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (71.82), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card evaluation on recent arXiv mathematics problems from March and April 2026.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA (Anthropic Internal Harness)&lt;/strong&gt; (Exact Match (%)): leader Claude Opus 4.8 (77.6), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal OfficeQA document-agent benchmark, requiring grounded search and numerical reasoning over office documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OfficeQA Pro (Anthropic Internal Harness)&lt;/strong&gt; (Exact Match (%)): leader Claude Opus 4.8 (66.2), 2 models&lt;br&gt;&lt;span&gt;Anthropic internal OfficeQA Pro hard subset, requiring grounded search and numerical reasoning over office documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (75.8), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (89.7), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LAB-Bench FigQA (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (80.4), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LAB-Bench FigQA (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.8 (87.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Anthropic No Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.7 (81.3), 2 models&lt;br&gt;&lt;span&gt;Anthropic no-tool run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CharXiv Reasoning (Anthropic Tools)&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.7 (90.1), 2 models&lt;br&gt;&lt;span&gt;Anthropic tool-enabled run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HealthBench Professional (Anthropic)&lt;/strong&gt; (Length-Adjusted Score (%)): leader Claude Opus 4.8 (55.8), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of HealthBench Professional, measuring clinical and healthcare reasoning with length-adjusted scoring.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GMMLU (Anthropic)&lt;/strong&gt; (Average Accuracy (%)): leader Gemini 3.1 Pro (92.2), 5 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of Global MMLU, measuring multilingual academic and professional knowledge.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioPipelineBench Verified (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (88.1), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioPipelineBench Verified, measuring biological data-analysis workflow completion.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Verified - Human Solvable (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (82.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioMysteryBench Verified human-solvable tasks, testing biological mystery problem solving.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMysteryBench Verified - Human Difficult (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (40.0), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of BioMysteryBench Verified human-difficult tasks, testing hard biological mystery problem solving.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LatchBio SpatialBench (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (53.8), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of LatchBio SpatialBench, measuring spatial transcriptomics analysis workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LatchBio SingleCellBench (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (58.2), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of LatchBio SingleCellBench, measuring single-cell RNA-seq analysis workflows.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Structural Biology (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (81.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card structural biology evaluation, testing biomolecular structure reasoning and analysis.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ProteinGym Hard (Anthropic)&lt;/strong&gt; (Rank Correlation (%)): leader Claude Mythos Preview (43.1), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card run of the hard ProteinGym subset, measuring protein variant effect prediction via rank correlation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Organic Chemistry (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (86.5), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card organic chemistry evaluation, testing reaction and molecule reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Protocol Troubleshooting (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (69.6), 4 models&lt;br&gt;&lt;span&gt;Anthropic system-card protocol troubleshooting benchmark, testing diagnosis of laboratory protocol failures.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Patent Questions (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (68.8), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 patent-question subset, testing life-science document reasoning over patent material.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Clinical Trial Questions (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Mythos Preview (86.3), 3 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 clinical-trial subset, testing life-science reasoning over trial documents.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Table Reading (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (77.2), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 table-reading subset, testing scientific table comprehension.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LABBench2 - Supplementary Materials (Anthropic)&lt;/strong&gt; (Score (%)): leader Claude Opus 4.8 (58.9), 2 models&lt;br&gt;&lt;span&gt;Anthropic system-card LABBench2 supplementary-materials subset, testing reasoning over scientific supporting files.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BoxPwnr CTF Bench&lt;/strong&gt; (Average Platform Completion (%)): leader z-ai/glm-5.1 (54.37), 15 models&lt;br&gt;&lt;span&gt;Aggregated BoxPwnr trace leaderboard over public CTF and security-lab platforms including CyBench, Hack The Box, picoCTF, PortSwigger, TryHackMe, Argus, and XBOW.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Security League - Functional Correctness&lt;/strong&gt; (Functional Correctness (%)): leader GPT-5.5 (84.9), 15 models&lt;br&gt;&lt;span&gt;Endor Labs coding-agent benchmark measuring whether agents functionally complete security-sensitive software tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Agent Security League - Security Correctness&lt;/strong&gt; (Security Correctness (%)): leader GPT-5.5 (24.0), 15 models&lt;br&gt;&lt;span&gt;Endor Labs coding-agent benchmark measuring whether completed software tasks avoid introducing or preserving security vulnerabilities.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (67)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; — ELO 2697, #4&lt;ul&gt;&lt;li&gt;Lynchmark: 100.0 (#1/13)&lt;/li&gt;&lt;li&gt;Design Arena (Website): 1345.0 (#1/143)&lt;/li&gt;&lt;li&gt;Design Arena (Game Dev): 1382.0 (#1/129)&lt;/li&gt;&lt;li&gt;Design Arena (UI Components): 1417.0 (#1/123)&lt;/li&gt;&lt;li&gt;Design Arena (Data Viz): 1381.0 (#1/125)&lt;/li&gt;&lt;li&gt;Design Arena (3D): 1370.0 (#1/117)&lt;/li&gt;&lt;li&gt;Design Arena (SVG): 1370.0 (#1/94)&lt;/li&gt;&lt;li&gt;Chatbot Arena (Text): 1510.0 (#1/366)&lt;/li&gt;&lt;li&gt;Chatbot Arena (Code): 1665.0 (#1/86)&lt;/li&gt;&lt;li&gt;Blueprint-Bench 2: 0.386 (#1/14)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; — ELO 2449, #6&lt;ul&gt;&lt;li&gt;Evals for Every Language - MGSM: 96.62 (#1/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 71.58 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 69.43 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 60.02 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bem: 60.25 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bm: 59.47 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language chm: 63.17 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 71.59 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language crh: 69.2 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 86.15 (#2/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; — ELO 2384, #7&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.362 (#2/14)&lt;/li&gt;&lt;li&gt;GRAB-Lite: 71.8 (#2/38)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ary: 47.34 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language doi: 71.32 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language et: 72.25 (#3/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - ARC: 97.82 (#4/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ay: 59.02 (#4/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language az: 65.39 (#4/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bho: 67.61 (#4/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bm: 54.72 (#4/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; — ELO 2370, #8&lt;ul&gt;&lt;li&gt;Position Bias (Lechmazur): 34.8 (#10/36)&lt;/li&gt;&lt;li&gt;RuneBench: 2222.0 (#11/23)&lt;/li&gt;&lt;li&gt;Wolfram LLM Benchmarking Project: 67.5 (#14/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; — ELO 2325, #10&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language chm: 63.6 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cs: 74.38 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language doi: 71.84 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language: 66.95 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MGSM: 95.57 (#2/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language am: 67.86 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 70.69 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language arz: 52.06 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 68.11 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language awa: 68.23 (#2/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Nemotron 3 Ultra&lt;/strong&gt; — ELO 2288, #13&lt;ul&gt;&lt;li&gt;YC-Bench: 326.9 (#18/26)&lt;/li&gt;&lt;li&gt;SimpleBench: 41.7 (#37/74)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.6&lt;/strong&gt; — ELO 2253, #15&lt;ul&gt;&lt;li&gt;Android Bench: 66.6 (#5/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4&lt;/strong&gt; — ELO 2242, #16&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.271 (#4/14)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt; — ELO 2219, #18&lt;ul&gt;&lt;li&gt;ZeroBench: 19.0 (#4/60)&lt;/li&gt;&lt;li&gt;GRAB-Lite: 63.0 (#4/38)&lt;/li&gt;&lt;li&gt;Position Bias (Lechmazur): 29.8 (#5/36)&lt;/li&gt;&lt;li&gt;Android Bench: 63.7 (#6/23)&lt;/li&gt;&lt;li&gt;YC-Bench: 987.0 (#12/26)&lt;/li&gt;&lt;li&gt;SWE-rebench: 49.45 (#30/85)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5 Pro&lt;/strong&gt; — ELO 2217, #19&lt;ul&gt;&lt;li&gt;Epoch AI - ECI: 149.85 (#69/374)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen Max&lt;/strong&gt; — ELO 2148, #23&lt;ul&gt;&lt;li&gt;SimpleQA Verified: 58.52 (#10/55)&lt;/li&gt;&lt;li&gt;OTIS Mock AIME 2024-25: 95.0 (#13/145)&lt;/li&gt;&lt;li&gt;Chess Puzzles (Epoch AI): 22.0 (#22/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepSeek V4 Pro&lt;/strong&gt; — ELO 2097, #30&lt;ul&gt;&lt;li&gt;RuneBench: 2939.0 (#6/23)&lt;/li&gt;&lt;li&gt;ProphetArena: 0.9061 (#15/46)&lt;/li&gt;&lt;li&gt;Position Bias (Lechmazur): 43.6 (#19/36)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.6 Plus&lt;/strong&gt; — ELO 2092, #31&lt;ul&gt;&lt;li&gt;ProphetArena: 0.9289 (#3/46)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 66.13 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bm: 50.12 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language chm: 59.82 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 68.26 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ace: 65.27 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 61.59 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 65.74 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bjn: 45.34 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ban: 62.52 (#9/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiMo-V2.5-Pro&lt;/strong&gt; — ELO 2059, #36&lt;ul&gt;&lt;li&gt;LLM Stats (CMMLU): 90.2 (#1/6)&lt;/li&gt;&lt;li&gt;LLM Stats (DROP): 86.3 (#3/29)&lt;/li&gt;&lt;li&gt;LLM Stats (TriviaQA): 81.3 (#3/18)&lt;/li&gt;&lt;li&gt;LLM Stats (C-Eval): 91.5 (#5/18)&lt;/li&gt;&lt;li&gt;LLM Stats (Claw-Eval): 64.0 (#5/11)&lt;/li&gt;&lt;li&gt;LLM Stats (GDPval-AA): 1581.0 (#6/13)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 24.0 (#13/42)&lt;/li&gt;&lt;li&gt;LLM Stats (MMLU-Redux): 92.8 (#14/47)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 83.73 (#14/64)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 50.74 (#16/29)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiniMax-M3&lt;/strong&gt; — ELO 2054, #37&lt;ul&gt;&lt;li&gt;OSWorld: 75.19 (#6/61)&lt;/li&gt;&lt;li&gt;WebDev Arena: 1527.75 (#9/70)&lt;/li&gt;&lt;li&gt;YC-Bench: 999.5 (#11/26)&lt;/li&gt;&lt;li&gt;Position Bias (Lechmazur): 34.9 (#11/36)&lt;/li&gt;&lt;li&gt;Sycophancy (Lechmazur): 3.5 (#12/32)&lt;/li&gt;&lt;li&gt;Design Arena (SVG): 1255.0 (#18/94)&lt;/li&gt;&lt;li&gt;Design Arena (Game Dev): 1273.0 (#27/129)&lt;/li&gt;&lt;li&gt;SWE-rebench: 45.64 (#38/85)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;O3&lt;/strong&gt; — ELO 2049, #39&lt;ul&gt;&lt;li&gt;GRAB-Lite: 40.8 (#21/38)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kimi K2.6&lt;/strong&gt; — ELO 2048, #41&lt;ul&gt;&lt;li&gt;RuneBench: 1256.0 (#16/23)&lt;/li&gt;&lt;li&gt;Position Bias (Lechmazur): 47.3 (#24/36)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.1&lt;/strong&gt; — ELO 2045, #42&lt;ul&gt;&lt;li&gt;GRAB-Lite: 44.4 (#17/38)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;kimi-k2.7-code&lt;/strong&gt; — ELO 2040, #45&lt;ul&gt;&lt;li&gt;LiveBench Python: 90.0 (#2/125)&lt;/li&gt;&lt;li&gt;LiveBench TypeScript: 65.0 (#3/124)&lt;/li&gt;&lt;li&gt;OTIS Mock AIME 2024-25: 96.39 (#6/145)&lt;/li&gt;&lt;li&gt;Design Arena (Website): 1322.0 (#7/143)&lt;/li&gt;&lt;li&gt;Design Arena (3D): 1328.0 (#11/117)&lt;/li&gt;&lt;li&gt;LiveBench Logic With Navigation: 74.0 (#14/125)&lt;/li&gt;&lt;li&gt;LiveBench Zebra Puzzle: 96.0 (#15/124)&lt;/li&gt;&lt;li&gt;LiveBench Olympiad: 90.3 (#17/125)&lt;/li&gt;&lt;li&gt;Vals AI Vibe Code Bench: 47.21 (#18/62)&lt;/li&gt;&lt;li&gt;LiveBench JavaScript: 55.0 (#19/125)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Sonnet 4.6&lt;/strong&gt; — ELO 2023, #50&lt;ul&gt;&lt;li&gt;ZeroBench: 11.0 (#11/60)&lt;/li&gt;&lt;li&gt;SWE-rebench: 54.49 (#18/85)&lt;/li&gt;&lt;li&gt;Terminal-Bench 2.0: 53.4 (#21/58)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM-5.1&lt;/strong&gt; — ELO 2004, #55&lt;ul&gt;&lt;li&gt;ProphetArena: 0.9253 (#4/46)&lt;/li&gt;&lt;li&gt;FrontierSWE: 32.0 (#9/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Grok 4.3&lt;/strong&gt; — ELO 1973, #64&lt;ul&gt;&lt;li&gt;ProphetArena: 0.9188 (#6/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Step 3.7 Flash&lt;/strong&gt; — ELO 1962, #71&lt;ul&gt;&lt;li&gt;Design Arena (Game Dev): 1216.0 (#54/129)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Plus&lt;/strong&gt; — ELO 1960, #72&lt;ul&gt;&lt;li&gt;Sycophancy (Lechmazur): 5.0 (#18/32)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 Plus&lt;/strong&gt; — ELO 1951, #77&lt;ul&gt;&lt;li&gt;Epoch AI - Apex Agents: 13.6 (#29/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Grok 4.20&lt;/strong&gt; — ELO 1936, #82&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language fa: 70.2 (#5/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MGSM: 87.39 (#7/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 56.11 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cy: 77.85 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 84.29 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language am: 64.62 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ba: 66.75 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ceb: 74.99 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 72.75 (#8/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 67.76 (#9/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 Mini&lt;/strong&gt; — ELO 1912, #91&lt;ul&gt;&lt;li&gt;ZeroBench: 10.0 (#13/60)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Sonnet 4 (20250514)&lt;/strong&gt; — ELO 1909, #95&lt;ul&gt;&lt;li&gt;Epoch AI - Apex Agents: 9.3 (#33/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.1 Flash Lite&lt;/strong&gt; — ELO 1905, #97&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language am: 68.6 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ca: 76.29 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ceb: 78.06 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cy: 82.03 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language el: 73.81 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 87.28 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 76.16 (#1/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language aeb: 53.18 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language az: 67.76 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language eo: 76.43 (#2/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 122B A10B&lt;/strong&gt; — ELO 1903, #101&lt;ul&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 21.38 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 66.8 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 53.71 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 31.85 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 67.39 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecHistory: 79.77 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 55.3 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 50.29 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 42.32 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 58.91 (#4/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiMo-V2.5&lt;/strong&gt; — ELO 1903, #102&lt;ul&gt;&lt;li&gt;LLM Stats (Video-MME): 87.7 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (Claw-Eval): 63.2 (#6/11)&lt;/li&gt;&lt;li&gt;LLM Stats (CharXiv-R): 81.0 (#12/38)&lt;/li&gt;&lt;li&gt;Vals AI Multimodal Index: 52.77 (#12/21)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 51.57 (#15/29)&lt;/li&gt;&lt;li&gt;Vals AI Vibe Code Bench: 42.17 (#21/62)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 16.0 (#22/42)&lt;/li&gt;&lt;li&gt;Vals AI SAGE: 43.27 (#26/61)&lt;/li&gt;&lt;li&gt;Vals AI MortgageTax: 59.26 (#49/80)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 72.15 (#50/64)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;qwen3.6-flash&lt;/strong&gt; — ELO 1872, #116&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language chm: 55.74 (#12/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language am: 57.31 (#19/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ban: 58.91 (#19/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - ARC: 91.99 (#20/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 62.38 (#20/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language dz: 45.14 (#20/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 79.89 (#20/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ace: 57.48 (#21/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 53.08 (#21/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ee: 41.46 (#21/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiniMax-M2.7&lt;/strong&gt; — ELO 1853, #124&lt;ul&gt;&lt;li&gt;ProphetArena: 0.9215 (#5/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;O3 Mini&lt;/strong&gt; — ELO 1850, #127&lt;ul&gt;&lt;li&gt;FinBen - FNS: 16.95 (#4/21)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 20.98 (#5/21)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;nemotron-3-ultra-550B-a55B&lt;/strong&gt; — ELO 1778, #168&lt;ul&gt;&lt;li&gt;Vals AI ProofBench: 2.0 (#40/42)&lt;/li&gt;&lt;li&gt;Vals AI Vibe Code Bench: 7.64 (#49/62)&lt;/li&gt;&lt;li&gt;WeirdML: 43.45 (#63/131)&lt;/li&gt;&lt;li&gt;Design Arena (Website): 1144.0 (#97/143)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepSeek V3.1&lt;/strong&gt; — ELO 1763, #176&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language da: 76.78 (#2/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ban: 65.1 (#4/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - ARC: 97.4 (#5/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ay: 58.91 (#5/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 68.89 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ca: 73.25 (#6/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bem: 54.51 (#7/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MMLU: 97.67 (#8/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language el: 70.9 (#10/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 64.63 (#11/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o&lt;/strong&gt; — ELO 1712, #208&lt;ul&gt;&lt;li&gt;FinBen (Financial LLM): 46.01 (#1/20)&lt;/li&gt;&lt;li&gt;FinBen - QA: 78.22 (#1/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 25.5 (#3/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 59.26 (#4/20)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 9.18 (#6/21)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mistral Medium 3.5&lt;/strong&gt; — ELO 1712, #209&lt;ul&gt;&lt;li&gt;Position Bias (Lechmazur): 72.5 (#36/36)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mistral-Small-3.2-24B-Instruct-2506&lt;/strong&gt; — ELO 1708, #211&lt;ul&gt;&lt;li&gt;Evals for Every Language - Classification: 89.59 (#24/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 76.19 (#29/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ars: 46.69 (#31/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language awa: 61.09 (#31/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ca: 69.44 (#31/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 62.6 (#32/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cs: 66.53 (#32/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language doi: 55.87 (#36/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language eu: 59.54 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language az: 58.06 (#39/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 35B A3B&lt;/strong&gt; — ELO 1707, #213&lt;ul&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 68.97 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 21.89 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 51.47 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 68.38 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 54.97 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 32.6 (#2/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecHistory: 81.65 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 56.6 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 43.32 (#3/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 60.29 (#3/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepSeek V3&lt;/strong&gt; — ELO 1706, #215&lt;ul&gt;&lt;li&gt;FinBen - FNS: 37.72 (#1/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 61.11 (#3/20)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 7.43 (#7/21)&lt;/li&gt;&lt;li&gt;FinBen - QA: 50.0 (#7/20)&lt;/li&gt;&lt;li&gt;FinBen (Financial LLM): 10.2 (#13/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4.1 Mini&lt;/strong&gt; — ELO 1705, #216&lt;ul&gt;&lt;li&gt;GRAB-Lite: 18.6 (#32/38)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (2024-11-20)&lt;/strong&gt; — ELO 1696, #224&lt;ul&gt;&lt;li&gt;Epoch AI - Apex Agents: 1.1 (#46/46)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GLM 4.5 Air&lt;/strong&gt; — ELO 1684, #230&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language chm: 47.52 (#21/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language et: 66.43 (#21/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 60.22 (#22/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 60.21 (#23/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language az: 62.15 (#24/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 41.24 (#26/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 70.3 (#26/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ca: 70.13 (#27/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bho: 62.66 (#28/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ace: 51.96 (#29/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hermes 4 70B&lt;/strong&gt; — ELO 1674, #239&lt;ul&gt;&lt;li&gt;Evals for Every Language - MGSM: 77.91 (#24/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MMLU: 88.52 (#26/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - ARC: 83.16 (#38/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language chm: 34.57 (#40/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language dz: 28.93 (#40/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 31.63 (#44/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language am: 31.51 (#49/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 41.14 (#49/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 40.36 (#57/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ba: 41.82 (#58/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;jamba-large-1.7&lt;/strong&gt; — ELO 1663, #245&lt;ul&gt;&lt;li&gt;Evals for Every Language - Classification: 91.29 (#18/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language af: 71.78 (#24/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language fa: 65.89 (#24/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bg: 70.55 (#25/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ee: 30.99 (#26/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 63.0 (#27/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 62.93 (#28/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language de: 70.42 (#30/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language aeb: 42.65 (#32/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language doi: 56.88 (#33/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 70B Instruct&lt;/strong&gt; — ELO 1658, #251&lt;ul&gt;&lt;li&gt;FinBen - FinNum: 46.34 (#3/21)&lt;/li&gt;&lt;li&gt;FinBen - QA: 64.44 (#3/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 13.61 (#7/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 50.0 (#7/20)&lt;/li&gt;&lt;li&gt;FinBen (Financial LLM): 14.07 (#8/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ministral 3 8B (2512)&lt;/strong&gt; — ELO 1640, #263&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language bm: 29.35 (#28/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Classification: 84.43 (#39/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cs: 65.2 (#41/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 73.09 (#42/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bn: 62.11 (#43/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 67.86 (#44/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language el: 63.45 (#45/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 59.13 (#46/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ace: 43.57 (#47/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language chm: 31.67 (#47/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3 27B (IT)&lt;/strong&gt; — ELO 1639, #266&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language el: 72.48 (#3/71)&lt;/li&gt;&lt;li&gt;FinBen (Financial LLM): 15.74 (#7/20)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 0.0 (#10/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 38.89 (#10/20)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language eo: 73.18 (#10/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Classification: 95.41 (#11/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bg: 73.9 (#11/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 72.28 (#11/71)&lt;/li&gt;&lt;li&gt;FinBen - QA: 22.67 (#13/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 0.21 (#14/21)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;nova-2-lite-v1&lt;/strong&gt; — ELO 1635, #268&lt;ul&gt;&lt;li&gt;Evals for Every Language - MMLU: 95.33 (#12/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 81.54 (#12/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 64.22 (#17/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language chm: 52.92 (#18/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MGSM: 80.9 (#19/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bn: 68.34 (#19/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 46.55 (#20/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 53.69 (#20/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language da: 71.55 (#21/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bm: 36.11 (#22/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 9B&lt;/strong&gt; — ELO 1628, #272&lt;ul&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 20.77 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 64.88 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 52.16 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 30.94 (#4/13)&lt;/li&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 65.44 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecHistory: 77.47 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 53.7 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 48.84 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 41.1 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 57.21 (#5/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3 30B A3B 2507 Instruct&lt;/strong&gt; — ELO 1615, #280&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language ars: 50.46 (#10/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language aeb: 45.53 (#18/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 78.55 (#23/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bs: 68.53 (#30/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bg: 69.16 (#32/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language arz: 42.54 (#33/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language dz: 32.48 (#33/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language am: 42.2 (#35/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bn: 63.67 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ace: 47.99 (#38/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hunyuan A13B-Instruct&lt;/strong&gt; — ELO 1579, #307&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language ars: 42.49 (#55/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language aeb: 35.99 (#56/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language apc: 40.5 (#56/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Translation From: 22.42 (#57/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ary: 33.19 (#57/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 25.6 (#58/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 24.99 (#58/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language arz: 36.21 (#59/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Translation To: 18.34 (#60/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bjn: 29.79 (#61/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o Mini&lt;/strong&gt; — ELO 1543, #345&lt;ul&gt;&lt;li&gt;GRAB-Lite: 11.4 (#38/38)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ministral 3 14B (2512)&lt;/strong&gt; — ELO 1532, #356&lt;ul&gt;&lt;li&gt;Evals for Every Language - Classification: 88.17 (#31/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language be: 62.67 (#31/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language el: 66.63 (#31/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bn: 64.34 (#33/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language az: 59.29 (#34/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language af: 69.81 (#35/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 68.9 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 73.16 (#41/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language arz: 40.92 (#42/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bg: 67.27 (#42/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-OSS-20B&lt;/strong&gt; — ELO 1515, #371&lt;ul&gt;&lt;li&gt;Evals for Every Language - Language en: 77.2 (#26/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 69.42 (#30/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language awa: 60.89 (#34/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bs: 67.0 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language da: 69.16 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language dz: 30.01 (#37/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language as: 54.72 (#38/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bem: 33.09 (#39/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 34.16 (#40/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cs: 65.28 (#40/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 4 Scout Instruct&lt;/strong&gt; — ELO 1498, #384&lt;ul&gt;&lt;li&gt;FinBen - FinNum: 49.12 (#2/21)&lt;/li&gt;&lt;li&gt;FinBen - QA: 74.22 (#2/20)&lt;/li&gt;&lt;li&gt;FinBen (Financial LLM): 20.89 (#3/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 16.9 (#5/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 55.56 (#5/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Laguna M.1&lt;/strong&gt; — ELO 1491, #391&lt;ul&gt;&lt;li&gt;Vals AI (Vals Index): 35.27 (#27/29)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 0.0 (#42/42)&lt;/li&gt;&lt;li&gt;Vals AI Terminal-Bench 2.0: 31.46 (#43/68)&lt;/li&gt;&lt;li&gt;Vals AI Vibe Code Bench: 10.94 (#48/62)&lt;/li&gt;&lt;li&gt;Vals AI MedCode: 25.24 (#64/67)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 58.16 (#68/115)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 75.14 (#86/118)&lt;/li&gt;&lt;li&gt;Vals AI TaxEval v2: 1.64 (#121/121)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;granite-4.0-h-micro&lt;/strong&gt; — ELO 1486, #399&lt;ul&gt;&lt;li&gt;Evals for Every Language - Classification: 86.11 (#36/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ar: 60.19 (#45/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language cv: 27.5 (#52/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ay: 29.15 (#54/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bn: 50.72 (#55/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ary: 33.31 (#56/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language bg: 58.7 (#57/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language eo: 59.1 (#57/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ak: 25.2 (#59/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language da: 56.25 (#60/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Laguna XS.2&lt;/strong&gt; — ELO 1486, #401&lt;ul&gt;&lt;li&gt;Vals AI (Vals Index): 29.15 (#28/29)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 1.0 (#41/42)&lt;/li&gt;&lt;li&gt;Vals AI Terminal-Bench 2.0: 28.09 (#47/68)&lt;/li&gt;&lt;li&gt;Vals AI Vibe Code Bench: 3.84 (#53/62)&lt;/li&gt;&lt;li&gt;Vals AI MedCode: 20.7 (#66/67)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 56.33 (#72/115)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 71.03 (#91/118)&lt;/li&gt;&lt;li&gt;Vals AI TaxEval v2: 59.98 (#107/121)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3 4B (IT)&lt;/strong&gt; — ELO 1463, #424&lt;ul&gt;&lt;li&gt;FinBen (Financial LLM): 12.74 (#9/20)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 0.0 (#9/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 38.89 (#9/20)&lt;/li&gt;&lt;li&gt;FinBen - QA: 22.67 (#12/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 0.24 (#13/21)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Phi-4 Mini Instruct&lt;/strong&gt; — ELO 1451, #434&lt;ul&gt;&lt;li&gt;Evals for Every Language - Classification: 79.23 (#54/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ckb: 31.9 (#56/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language aeb: 33.95 (#61/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language ee: 21.7 (#62/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language en: 62.96 (#62/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language es: 51.93 (#65/71)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MGSM: 16.66 (#66/70)&lt;/li&gt;&lt;li&gt;Evals for Every Language - MMLU: 43.8 (#66/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - ARC: 41.91 (#67/69)&lt;/li&gt;&lt;li&gt;Evals for Every Language - Language doi: 30.09 (#67/71)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 4B&lt;/strong&gt; — ELO 1430, #455&lt;ul&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 19.57 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 61.13 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 49.14 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 29.15 (#5/13)&lt;/li&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 61.66 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecHistory: 72.99 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 50.6 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 46.02 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 38.73 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 53.9 (#6/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen3.5 0.8B&lt;/strong&gt; — ELO 1370, #528&lt;ul&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 17.79 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 44.67 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 56.05 (#8/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 46.0 (#8/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 35.21 (#8/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 55.57 (#8/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 49.0 (#8/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciAbstractRetrieval: 56.26 (#9/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 41.83 (#9/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 26.5 (#9/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.5 2B&lt;/strong&gt; — ELO 1247, #653&lt;ul&gt;&lt;li&gt;LIBRA - ruSciPassageCount *: 18.72 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA2: 47.01 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA3 *: 27.88 (#6/13)&lt;/li&gt;&lt;li&gt;LIBRA - MatreshkaNames *: 58.98 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - ru2WikiMultihopQA *: 48.4 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruSciFi: 44.02 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecMHQA *: 37.05 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA1: 58.48 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - ruBABILongQA4: 51.56 (#7/13)&lt;/li&gt;&lt;li&gt;LIBRA - LibrusecHistory: 69.83 (#8/13)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen2.5-Omni-7B&lt;/strong&gt; — ELO 1227, #667&lt;ul&gt;&lt;li&gt;FinBen (Financial LLM): 33.53 (#2/20)&lt;/li&gt;&lt;li&gt;FinBen - FinNum: 0.4 (#8/21)&lt;/li&gt;&lt;li&gt;FinBen - QA: 48.89 (#8/20)&lt;/li&gt;&lt;li&gt;FinBen - FNS: 5.6 (#11/21)&lt;/li&gt;&lt;li&gt;FinBen - MultiFin: 38.89 (#11/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 12B&lt;/strong&gt; — ELO 1100, #731&lt;ul&gt;&lt;li&gt;LLM Stats (MRCR v2): 43.4 (#3/7)&lt;/li&gt;&lt;li&gt;LLM Stats (FLEURS): 93.1 (#4/6)&lt;/li&gt;&lt;li&gt;LLM Stats (MedXpertQA): 48.7 (#8/12)&lt;/li&gt;&lt;li&gt;LLM Stats (MathVision): 79.7 (#9/28)&lt;/li&gt;&lt;li&gt;LLM Stats (AIME 2026): 77.5 (#13/16)&lt;/li&gt;&lt;li&gt;LLM Stats (OmniDocBench 1.5): 16.4 (#13/15)&lt;/li&gt;&lt;li&gt;LLM Stats (CodeForces): 55.3 (#15/16)&lt;/li&gt;&lt;li&gt;LLM Stats (MMMLU): 83.4 (#34/48)&lt;/li&gt;&lt;li&gt;ZeroEval GPQA Diamond: 78.8 (#82/223)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (186)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on AI Chess Leaderboard (Continuation): 1092.0 (#30)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on AI Chess Leaderboard (Reasoning): 1711.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Document): 1495.0 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Vision): 1307.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on ClockBench: 35.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Epoch AI - Apex Agents: 45.0 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on LLM Stats (GDPval-AA): 1932.0 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Lynchmark: 100.0 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on MineBench: 1790.51 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on PM-LLM-Benchmark: 35.6 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on PinchBench: 59.61 (#44)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on React Native Evals: 86.96 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on SEAL - MCP Atlas: 83.3 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vals AI MedCode: 56.07 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vals AI MortgageTax: 68.92 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vals AI SAGE: 51.89 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vals AI TaxEval v2: 76.94 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vellum - GPQA: 94.1 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vellum - HumanEval: 95.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Vending-Bench 2: 4529.94 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Android Bench: 68.7 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language: 66.95 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - ARC: 97.23 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Classification: 95.98 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ace: 69.04 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language aeb: 50.61 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language af: 76.97 (#9)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ak: 59.75 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language am: 67.86 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language apc: 55.53 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ar: 70.69 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ars: 49.83 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ary: 44.23 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language arz: 52.06 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language as: 68.11 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language awa: 68.23 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ay: 59.38 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language az: 65.04 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ba: 67.46 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ban: 65.75 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language be: 66.48 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bem: 59.05 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bg: 74.44 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bho: 67.27 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bjn: 48.88 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bm: 58.41 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bn: 72.35 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language bs: 70.22 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ca: 72.35 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ceb: 75.18 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ckb: 70.88 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language crh: 66.99 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language cs: 74.38 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language cv: 62.92 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language cy: 79.87 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language da: 74.47 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language de: 75.66 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language dz: 59.16 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language ee: 60.86 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language el: 71.56 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language en: 84.79 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language eo: 75.16 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language es: 70.89 (#19)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language et: 71.59 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language eu: 68.46 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Language fa: 69.71 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - MGSM: 95.57 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - MMLU: 95.33 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Translation From: 40.53 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on Evals for Every Language - Translation To: 39.5 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7&lt;/strong&gt; on GRAB-Lite: 58.2 (#10)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Chess Puzzles (Epoch AI): 34.0 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Design Arena (Game Dev): 1300.0 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on EQ-Bench Longform Writing: 80.8 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Epoch AI - Apex Agents: 42.5 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Epoch AI - ECI: 156.34 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language: 66.27 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - ARC: 98.0 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Classification: 90.31 (#21)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ace: 66.63 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language aeb: 50.53 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language af: 78.38 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ak: 60.02 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language am: 65.76 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language apc: 49.54 (#21)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ars: 47.35 (#26)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ary: 40.29 (#25)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language arz: 49.71 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language as: 66.93 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language awa: 67.71 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ay: 58.4 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language az: 65.38 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ba: 67.66 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ban: 63.8 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bem: 60.25 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bg: 74.44 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bho: 67.32 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bjn: 47.35 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bm: 59.47 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bn: 70.4 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language bs: 74.0 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ca: 74.29 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ceb: 75.82 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language chm: 63.17 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ckb: 71.59 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language crh: 69.2 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language cs: 73.8 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language cv: 64.32 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language cy: 79.83 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language da: 74.57 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language de: 76.71 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language doi: 70.16 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language dz: 58.51 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language ee: 57.06 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language el: 70.34 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language en: 86.15 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language eo: 74.5 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language es: 70.97 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language et: 70.93 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language eu: 66.0 (#19)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Language fa: 69.54 (#9)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - MMLU: 98.31 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Translation From: 39.86 (#9)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Evals for Every Language - Translation To: 38.22 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on GRAB-Lite: 60.6 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on OTIS Mock AIME 2024-25: 98.33 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on SimpleQA Verified: 39.5 (#26)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on WebDev Arena: 1545.05 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 65.9 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on ZeroBench: 17.0 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Blueprint-Bench 2: 0.362 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language: 65.09 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - ARC: 97.82 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Classification: 82.73 (#42)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ace: 67.32 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language aeb: 44.61 (#22)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language af: 77.33 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ak: 57.86 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language am: 65.01 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language apc: 50.92 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ar: 65.19 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ars: 46.47 (#33)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ary: 47.34 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language arz: 45.23 (#19)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language as: 66.04 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language awa: 66.14 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ay: 59.02 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language az: 65.39 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ba: 64.64 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ban: 62.74 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language be: 64.63 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bem: 53.46 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bg: 71.22 (#23)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bho: 67.61 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bjn: 44.06 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bm: 54.72 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bn: 69.73 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language bs: 71.46 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ca: 73.21 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ceb: 74.54 (#10)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language chm: 58.46 (#9)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ckb: 68.48 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language crh: 63.78 (#15)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language cs: 71.8 (#10)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language cv: 59.68 (#10)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language cy: 77.61 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language da: 71.48 (#23)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language de: 73.13 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language doi: 71.32 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language dz: 58.36 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language ee: 56.99 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language el: 71.64 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language en: 85.03 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language eo: 72.05 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language es: 70.48 (#23)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language et: 72.25 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language eu: 67.59 (#11)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Language fa: 67.54 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - MGSM: 90.21 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - MMLU: 98.21 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Translation From: 40.95 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Evals for Every Language - Translation To: 39.31 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on GRAB-Lite: 71.8 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on Position Bias (Lechmazur): 34.8 (#10)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on RuneBench: 2222.0 (#11)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 67.5 (#14)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (92)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;YC-Bench&lt;/strong&gt;: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (2171.0) beat GPT-5.5 (High) by 155.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Code)&lt;/strong&gt;: Claude Fable 5 (1665.0) beat Claude Opus 4.7 (Thinking) by 98.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text-to-Video)&lt;/strong&gt;: gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p by 64.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (UI Components)&lt;/strong&gt;: Claude Fable 5 (1417.0) beat Claude Opus 4.7 by 57.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-turn Debate (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: Claude Fable 5 (1381.0) beat Claude Opus 4.7 (Thinking) by 42.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Game Dev)&lt;/strong&gt;: Claude Fable 5 (1382.0) beat GPT-5.5 by 27.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleTables&lt;/strong&gt;: TelecomGPT (88.0) beat OTel-LLM-8.3B-QnA by 26.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP-Mark)&lt;/strong&gt;: Kimi K2.7 Code (81.1) beat Qwen 3.7 Max by 20.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Image)&lt;/strong&gt;: riverflow-2.5-pro (1419.0) beat gpt-image-2 by 17.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WDCD&lt;/strong&gt;: Qwen 3 Max (84.38) beat Claude Opus 4.7 by 14.38&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ay&lt;/strong&gt;: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Test Writing&lt;/strong&gt;: Fable-5 (Claude Code) xHigh (58.52) beat GPT-5.4 (xHigh) by 14.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Python&lt;/strong&gt;: Claude Fable 5 (xHigh) (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (FLEURS)&lt;/strong&gt;: Qwen2.5-Omni-7B (95.9) beat Gemini 1.5 Flash-8B by 9.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt;: Claude Fable 5 (Max) (72.9) beat Claude Opus 4.7 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (High) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Fable 5 (90.35) beat Claude Opus 4.8 by 7.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierSWE&lt;/strong&gt;: Claude Fable 5 (90.0) beat Claude Opus 4.8 by 7.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language crh&lt;/strong&gt;: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text)&lt;/strong&gt;: Claude Fable 5 (1510.0) beat Claude Opus 4.6 (Thinking) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI ProofBench&lt;/strong&gt;: Claude Fable 5 (77.0) beat aristotle by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FinBen - MultiFin&lt;/strong&gt;: plutus-8B-instruct (72.22) beat Qwen 2.5 72B Instruct by 5.55&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Fable 5 (75.14) beat Claude Opus 4.8 by 4.78&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: GLM-5.2 (81.3) beat intern-s2-preview by 4.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI IOI&lt;/strong&gt;: Claude Fable 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Website)&lt;/strong&gt;: Claude Fable 5 (1345.0) beat Claude Opus 4.6 by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV April&lt;/strong&gt;: Claude Fable 5 (Max) (70.73) beat GPT-5.5 (xHigh) by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco LLM Leaderboard&lt;/strong&gt;: TelecomGPT (89.64) beat OTel-LLM-8.3B-QnA by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FinBen - QA&lt;/strong&gt;: GPT-4o (78.22) beat GPT-4.5 (Preview) by 3.55&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cv&lt;/strong&gt;: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: Opus 4.8 (Claude Code) (48.79) beat GPT-5.5 by 3.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: Claude Fable 5 (71.83) beat Grok 4.3 by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Fable 5 (74.15) beat Claude Opus 4.8 by 3.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (3D)&lt;/strong&gt;: Claude Fable 5 (1370.0) beat Kimi K2.6 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GRAB-Lite&lt;/strong&gt;: Claude Fable 5 (74.0) beat GPT-5.4 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: Claude Fable 5 (High) (87.85) beat GPT-5.5 (xHigh) by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BIRD-SQL&lt;/strong&gt;: Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) by 2.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - 3GPP&lt;/strong&gt;: TelecomGPT (84.22) beat OTel-LLM-8.3B-QnA by 2.82&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleLogs&lt;/strong&gt;: TelecomGPT (98.96) beat OTel-LLM-8.3B-QnA by 2.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MGSM&lt;/strong&gt;: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ban&lt;/strong&gt;: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SimpleBench&lt;/strong&gt;: Claude Fable 5 (81.9) beat Gemini 3.1 Pro (Preview) by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Terminal-Bench Hard&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: gemini-omni-flash (1475.0) beat Grok 1.5 by 2.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Plot Unscrambling&lt;/strong&gt;: Claude Fable 5 (xHigh) (78.09) beat GPT-5.5 (High) by 1.81&lt;/li&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, High Effort) (74.23) beat Gemini 3.5 Flash (Thinking, Medium) by 1.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - srsRAN-Bench&lt;/strong&gt;: TelecomGPT (91.33) beat OTel-LLM-8.3B-QnA by 1.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language chm&lt;/strong&gt;: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language doi&lt;/strong&gt;: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA CritPt&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language es&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA SciCode&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ace&lt;/strong&gt;: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MMLU&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - ARC&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Longform Writing&lt;/strong&gt;: Claude Fable 5 (83.0) beat Claude Opus 4.7 by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LegalBench&lt;/strong&gt;: Claude Fable 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ca&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (SVG)&lt;/strong&gt;: Claude Fable 5 (1370.0) beat prism by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Opper TaskBench&lt;/strong&gt;: Claude Fable 5 (96.4) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ar&lt;/strong&gt;: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language en&lt;/strong&gt;: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - HMMT Feb 2026&lt;/strong&gt;: GPT-5.5 (xHigh) (98.48) beat GPT-5.4 (xHigh) by 0.75&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cy&lt;/strong&gt;: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language am&lt;/strong&gt;: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MedScribe&lt;/strong&gt;: Claude Fable 5 (88.52) beat GPT-5.1 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language af&lt;/strong&gt;: Gemini 3.1 Pro (Preview) (79.41) beat Claude Sonnet 4 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language be&lt;/strong&gt;: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Video-MME)&lt;/strong&gt;: MiMo-V2.5 (87.7) beat Kimi K2.5 by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ceb&lt;/strong&gt;: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language el&lt;/strong&gt;: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (CMMLU)&lt;/strong&gt;: MiMo-V2.5-Pro (90.2) beat Qwen 2 72B Instruct by 0.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt;: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Olympiad&lt;/strong&gt;: Claude Fable 5 (High) (92.18) beat Claude Opus 4.6 (Thinking, High) by 0.01&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-14

=== DAILY ===
NEW BENCHMARKS (75)
  - Ramp SWE-Bench (Resolved (%)): leader Claude Fable 5 (87.5), 14 models
      Ramp Labs benchmark for background coding agents on realistic financial software engineering work, scored by resolved tasks with the mini-SWE-agent har</summary></entry><entry><title>AI Benchmark Digest — 2026-06-13</title><id>https://aibenchmarks.dev/digest/2026-06-13</id><updated>2026-06-13T08:02:57.174839+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (11)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude 5&lt;/strong&gt; on Chess Puzzles (Epoch AI): 41.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude 5&lt;/strong&gt; on OTIS Mock AIME 2024-25: 99.72 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude 5&lt;/strong&gt; on SimpleQA Verified: 68.3 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Epoch AI - Apex Agents: 45.0 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Icelandic LLM - ARC-Challenge-IS: 72.95 (#59)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Icelandic LLM - Belebele-IS: 90.78 (#36)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Icelandic LLM - Inflection: 97.75 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Icelandic LLM - WinoGrande-IS: 96.05 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Icelandic LLM Leaderboard - Average: 87.4 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on Blueprint-Bench 2: 0.362 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 67.5 (#14)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (6)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Image)&lt;/strong&gt;: riverflow-2.5-pro (1416.0) beat gpt-image-2 by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP-Mark)&lt;/strong&gt;: Kimi K2.7 Code (81.1) beat Qwen 3.7 Max by 20.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - WikiQA-IS&lt;/strong&gt;: Claude Fable 5 (75.39) beat Gemini 3.1 Pro (Preview) by 7.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Icelandic LLM - GED&lt;/strong&gt;: Claude Fable 5 (91.5) beat Claude Opus 4.7 by 7.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BIRD-SQL&lt;/strong&gt;: Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) by 2.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Graphic Design)&lt;/strong&gt;: riverflow-2.5-pro (1474.0) beat gpt-image-2 by 1.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-13

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (11)
  - Claude 5 on Chess Puzzles (Epoch AI): 41.0 Accuracy (%) (#8/44)
  - Claude 5 on OTIS Mock AIME 2024-25: 99.72 Accuracy (%) (#3/143)
  - Claude 5 on SimpleQA Verified: 68.3 Accuracy (%) (#4/53)
  - Claude Fable 5 o</summary></entry><entry><title>AI Benchmark Digest — 2026-06-12</title><id>https://aibenchmarks.dev/digest/2026-06-12</id><updated>2026-06-12T08:17:57.895837+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV_FALSE May&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV May&lt;/strong&gt; (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Lynchmark: 100.0 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on MineBench: 1929.84 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Chess Puzzles (Epoch AI): 34.0 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Design Arena (Game Dev): 1250.0 (#37)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on GRAB-Lite: 60.6 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on OTIS Mock AIME 2024-25: 98.33 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on SimpleQA Verified: 39.5 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on GRAB-Lite: 71.8 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on Position Bias (Lechmazur): 34.8 (#10)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text-to-Video)&lt;/strong&gt;: gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p by 64.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (UI Components)&lt;/strong&gt;: Claude Fable 5 (1411.0) beat Claude Opus 4.7 (Thinking) by 56.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Game Dev)&lt;/strong&gt;: Claude Fable 5 (1393.0) beat GPT-5.5 by 39.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (SVG)&lt;/strong&gt;: Claude Fable 5 (1384.0) beat prism by 18.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Test Writing&lt;/strong&gt;: Fable-5 (Claude Code) xHigh (58.52) beat Opus 4.8 (Claude Code) by 12.96&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV April&lt;/strong&gt;: Claude 5 (70.73) beat GPT-5.5 (xHigh) by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GRAB-Lite&lt;/strong&gt;: Claude Fable 5 (74.0) beat GPT-5.4 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: Claude 5 (87.85) beat GPT-5.5 (xHigh) by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: gemini-omni-flash (1475.0) beat Grok 1.5 by 2.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-12

=== DAILY ===
NEW BENCHMARKS (2)
  - MathArena - ARXIV_FALSE May (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models
  - MathArena - ARXIV May (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models

NEW SCORES FROM TOP-10 MODELS (9)
  - Claude Fable 5 on </summary></entry><entry><title>AI Benchmark Digest — 2026-06-11</title><id>https://aibenchmarks.dev/digest/2026-06-11</id><updated>2026-06-11T08:17:01.068404+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GDPval-AA&lt;/strong&gt; (Elo): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.0), 390 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Document): 1495.0 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Vision): 1307.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on React Native Evals: 86.96 (#4)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (12)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (2171.0) beat GPT-5.5 (High) by 155.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Code)&lt;/strong&gt;: Claude Fable 5 (1665.0) beat Claude Opus 4.7 (Thinking) by 98.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: Claude Fable 5 (1406.0) beat Claude Opus 4.7 (Thinking) by 68.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Website)&lt;/strong&gt;: Claude Fable 5 (1364.0) beat Claude Opus 4.6 by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (3D)&lt;/strong&gt;: Claude Fable 5 (1383.0) beat Kimi K2.6 by 17.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierSWE&lt;/strong&gt;: Claude Fable 5 (90.0) beat Claude Opus 4.8 by 7.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text)&lt;/strong&gt;: Claude Fable 5 (1510.0) beat Claude Opus 4.6 (Thinking) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SimpleBench&lt;/strong&gt;: Claude Fable (81.9) beat Gemini 3.1 Pro (Preview) by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: Claude 5 (74.23) beat Gemini 3.5 Flash (Thinking, Medium) by 1.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Longform Writing&lt;/strong&gt;: Claude Fable 5 (83.0) beat Claude Opus 4.7 by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Video-MME)&lt;/strong&gt;: MiMo-V2.5 (87.7) beat Kimi K2.5 by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (CMMLU)&lt;/strong&gt;: MiMo-V2.5-Pro (90.2) beat Qwen 2 72B Instruct by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-11

=== DAILY ===
NEW BENCHMARKS (1)
  - GDPval-AA (Elo): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.0), 390 models

NEW SCORES FROM TOP-10 MODELS (3)
  - Claude Fable 5 on Chatbot Arena (Document): 1495.0 Elo (#5/29)
  - Claude Fabl</summary></entry><entry><title>AI Benchmark Digest — 2026-06-10</title><id>https://aibenchmarks.dev/digest/2026-06-10</id><updated>2026-06-10T09:55:36.786616+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SkateBench&lt;/strong&gt; (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models&lt;br&gt;&lt;span&gt;Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; — ELO 1871, #31&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.386 (#1/14)&lt;/li&gt;&lt;li&gt;Opper TaskBench: 96.4 (#1/85)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 85.0 (#1/16)&lt;/li&gt;&lt;li&gt;YC-Bench: 1977.6 (#1/21)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 75.14 (#1/25)&lt;/li&gt;&lt;li&gt;Vals AI Multimodal Index: 74.15 (#1/20)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 88.56 (#1/114)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 71.83 (#1/111)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 88.52 (#1/62)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 77.0 (#1/37)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (55)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;YC-Bench&lt;/strong&gt;: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-turn Debate (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ay&lt;/strong&gt;: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Python&lt;/strong&gt;: Claude Fable 5 (xHigh) (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt;: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Fable 5 (90.35) beat Claude Opus 4.8 by 7.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language crh&lt;/strong&gt;: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI ProofBench&lt;/strong&gt;: Claude Fable 5 (77.0) beat aristotle by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Fable 5 (75.14) beat Claude Opus 4.8 by 4.78&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI IOI&lt;/strong&gt;: Claude Fable 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cv&lt;/strong&gt;: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: Claude Fable 5 (71.83) beat Grok 4.3 by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Fable 5 (74.15) beat Claude Opus 4.8 by 3.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MGSM&lt;/strong&gt;: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ban&lt;/strong&gt;: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Terminal-Bench Hard&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Plot Unscrambling&lt;/strong&gt;: Claude Fable 5 (xHigh) (78.09) beat GPT-5.5 (High) by 1.81&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language chm&lt;/strong&gt;: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language doi&lt;/strong&gt;: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA CritPt&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language es&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA SciCode&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ace&lt;/strong&gt;: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MMLU&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - ARC&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LegalBench&lt;/strong&gt;: Claude Fable 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ca&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Opper TaskBench&lt;/strong&gt;: Claude Fable 5 (96.4) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ar&lt;/strong&gt;: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language en&lt;/strong&gt;: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cy&lt;/strong&gt;: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language am&lt;/strong&gt;: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MedScribe&lt;/strong&gt;: Claude Fable 5 (88.52) beat GPT-5.1 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language be&lt;/strong&gt;: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ceb&lt;/strong&gt;: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language el&lt;/strong&gt;: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt;: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Olympiad&lt;/strong&gt;: Claude Fable 5 (High) (92.18) beat Claude Opus 4.6 (Thinking, High) by 0.01&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-10

=== DAILY ===
NEW BENCHMARKS (1)
  - SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
      Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. Skate</summary></entry><entry><title>AI Benchmark Digest — 2026-06-10</title><id>https://aibenchmarks.dev/digest/2026-06-10</id><updated>2026-06-10T08:06:50.673963+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SkateBench&lt;/strong&gt; (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models&lt;br&gt;&lt;span&gt;Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude 5&lt;/strong&gt; — ELO 1904, #22&lt;ul&gt;&lt;li&gt;LiveBench Olympiad: 92.18 (#1/124)&lt;/li&gt;&lt;li&gt;LiveBench Plot Unscrambling: 78.09 (#1/124)&lt;/li&gt;&lt;li&gt;LiveBench Python: 95.0 (#1/124)&lt;/li&gt;&lt;li&gt;Opper TaskBench: 96.4 (#1/85)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 75.14 (#1/25)&lt;/li&gt;&lt;li&gt;Vals AI Multimodal Index: 74.15 (#1/20)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 88.56 (#1/114)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 71.83 (#1/111)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 88.52 (#1/62)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 77.0 (#1/37)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; — ELO 1901, #23&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.386 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 85.0 (#1/16)&lt;/li&gt;&lt;li&gt;YC-Bench: 1977.6 (#1/21)&lt;/li&gt;&lt;li&gt;SEAL - MCP Atlas: 83.3 (#2/23)&lt;/li&gt;&lt;li&gt;Vellum - HumanEval: 95.0 (#2/38)&lt;/li&gt;&lt;li&gt;Vellum - GPQA: 94.1 (#3/57)&lt;/li&gt;&lt;li&gt;ClockBench: 35.0 (#4/27)&lt;/li&gt;&lt;li&gt;LLM Stats (GDPval-AA): 64.4 (#11/12)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (55)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;YC-Bench&lt;/strong&gt;: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-turn Debate (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ay&lt;/strong&gt;: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Python&lt;/strong&gt;: Claude 5 (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt;: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude 5 (90.35) beat Claude Opus 4.8 by 7.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language crh&lt;/strong&gt;: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI ProofBench&lt;/strong&gt;: Claude 5 (77.0) beat aristotle by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude 5 (75.14) beat Claude Opus 4.8 by 4.78&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI IOI&lt;/strong&gt;: Claude 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cv&lt;/strong&gt;: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: Claude 5 (71.83) beat Grok 4.3 by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude 5 (74.15) beat Claude Opus 4.8 by 3.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MGSM&lt;/strong&gt;: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ban&lt;/strong&gt;: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Terminal-Bench Hard&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Plot Unscrambling&lt;/strong&gt;: Claude 5 (78.09) beat GPT-5.5 (High) by 1.81&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language chm&lt;/strong&gt;: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language doi&lt;/strong&gt;: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA CritPt&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language es&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA SciCode&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ace&lt;/strong&gt;: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MMLU&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - ARC&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LegalBench&lt;/strong&gt;: Claude 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ca&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Opper TaskBench&lt;/strong&gt;: Claude 5 (96.4) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ar&lt;/strong&gt;: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language en&lt;/strong&gt;: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cy&lt;/strong&gt;: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language am&lt;/strong&gt;: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MedScribe&lt;/strong&gt;: Claude 5 (88.52) beat GPT-5.1 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language be&lt;/strong&gt;: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ceb&lt;/strong&gt;: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language el&lt;/strong&gt;: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt;: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Olympiad&lt;/strong&gt;: Claude 5 (92.18) beat Claude Opus 4.6 (Thinking) (High) by 0.01&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-10

=== DAILY ===
NEW BENCHMARKS (1)
  - SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
      Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. Skate</summary></entry><entry><title>AI Benchmark Digest — 2026-06-09</title><id>https://aibenchmarks.dev/digest/2026-06-09</id><updated>2026-06-09T07:53:25.528997+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SEAL - SWE Atlas - Codebase QnA: 45.43 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SEAL - SWE Atlas - Test Writing: 42.59 (#3)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (7)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleTables&lt;/strong&gt;: TelecomGPT (88.0) beat OTel-LLM-8.3B-QnA by 26.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco LLM Leaderboard&lt;/strong&gt;: TelecomGPT (89.64) beat OTel-LLM-8.3B-QnA by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: Opus 4.8 (Claude Code) (48.79) beat GPT-5.5 by 3.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - 3GPP&lt;/strong&gt;: TelecomGPT (84.22) beat OTel-LLM-8.3B-QnA by 2.82&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleLogs&lt;/strong&gt;: TelecomGPT (98.96) beat OTel-LLM-8.3B-QnA by 2.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - srsRAN-Bench&lt;/strong&gt;: TelecomGPT (91.33) beat OTel-LLM-8.3B-QnA by 1.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Test Writing&lt;/strong&gt;: Opus 4.8 (Claude Code) (45.56) beat GPT-5.4 (xHigh) by 1.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-09

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 (xHigh) on SEAL - SWE Atlas - Codebase QnA: 45.43 Score (#2/14)
  - GPT-5.5 (xHigh) on SEAL - SWE Atlas - Test Writing: 42.59 Score (#3/14)

NEW #1 LEADERS (7)
  - GSMA Open-Telco - TeleTables (Score (%)): </summary></entry><entry><title>AI Benchmark Digest — 2026-06-07</title><id>https://aibenchmarks.dev/digest/2026-06-07</id><updated>2026-06-07T08:34:58.487719+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;New Models (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MiniMax-M3&lt;/strong&gt; — ELO 1762, #83&lt;ul&gt;&lt;li&gt;LLM Stats (OmniDocBench 1.5): 91.6 (#1/13)&lt;/li&gt;&lt;li&gt;LLM Stats (Video-MME): 85.4 (#2/13)&lt;/li&gt;&lt;li&gt;OpenClawProBench: 75.1 (#2/65)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 87.25 (#2/61)&lt;/li&gt;&lt;li&gt;AA IFBench: 82.86 (#3/429)&lt;/li&gt;&lt;li&gt;LLM Stats (Claw-Eval): 74.5 (#3/9)&lt;/li&gt;&lt;li&gt;LLM Stats (NL2Repo): 42.13 (#3/7)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 92.93 (#4/501)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 68.1 (#4/110)&lt;/li&gt;&lt;li&gt;Design Arena (3D): 1348.0 (#5/115)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;nemotron-3-ultra-550B-a55B&lt;/strong&gt; — ELO 1587, #292&lt;ul&gt;&lt;li&gt;PinchBench: 90.58 (#10/49)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 65.46 (#16/110)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 43.99 (#18/24)&lt;/li&gt;&lt;li&gt;LiveBench Python: 75.0 (#24/122)&lt;/li&gt;&lt;li&gt;LiveBench Paraphrase: 61.15 (#33/122)&lt;/li&gt;&lt;li&gt;Vals AI TaxEval v2: 73.1 (#34/116)&lt;/li&gt;&lt;li&gt;Bullshit Benchmark: 41.8 (#34/148)&lt;/li&gt;&lt;li&gt;Vals AI MedCode: 38.62 (#35/62)&lt;/li&gt;&lt;li&gt;AI Chess Leaderboard (Reasoning): 975.0 (#39/277)&lt;/li&gt;&lt;li&gt;LiveBench Code Generation: 77.47 (#43/122)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on IMO-Bench: 71.9 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on IUMB: 100.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro (xHigh)&lt;/strong&gt; on IMO-Bench: 88.1 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3 Deep Think&lt;/strong&gt; on IUMB: 87.5 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (10)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Creative Writing v3&lt;/strong&gt;: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Multi-Challenge)&lt;/strong&gt;: Nova 2 Pro (77.7) beat GPT-5 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 11-12&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - APEX 2025&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 7-8&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - AIME 2026&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OmniDocBench 1.5)&lt;/strong&gt;: MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-07

=== WEEKLY ===
NEW MODELS (2)
  - MiniMax-M3 — ELO 1762, #83/970 (above: Gemini 3 Flash (High), below: Claude Opus 4.5 (Non-reasoning))
      LLM Stats (OmniDocBench 1.5): 91.6 (#1/13)
      LLM Stats (Video-MME): 85.4 (#2/13)
      OpenClawProBench: 75.1 (#2/65)
  </summary></entry><entry><title>AI Benchmark Digest — 2026-06-06</title><id>https://aibenchmarks.dev/digest/2026-06-06</id><updated>2026-06-06T07:45:06.870709+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (20)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Yajilin&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint solving from puzz.link grids.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Slitherlink&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Slitherlink loop puzzles, where numbered cells constrain how a single continuous loop surrounds the grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Heyawake&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-5-high (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Heyawake room-shading puzzles, testing region constraints, connectivity, and line-of-sight reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Mashu&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Mashu loop puzzles, where black and white pearls impose turn and straight-line constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Shakashaka&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-sonnet-4-5 (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Shakashaka triangle-shading puzzles, testing local clue satisfaction and global rectangle formation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurikabe&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurikabe island puzzles, where numbered islands must be separated by one connected wall region.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - LITS&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (53.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on LITS tetromino-shading puzzles, testing region-wise shape placement and adjacency constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Light Up&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Light Up puzzles, where lamps must illuminate every open cell while satisfying numbered black-cell clues.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurimisaki&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurimisaki puzzles, a Nurikabe-family grid task requiring connected-region reasoning around clue cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Shikaku&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (80.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Shikaku rectangle-partitioning puzzles, where each numbered clue defines one rectangle of matching area.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Norinori&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (93.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Norinori shading puzzles, testing room constraints and two-cell adjacency patterns.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Double Choco&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Double Choco region-division puzzles, testing balanced partitioning under color and shape constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Firefly&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Firefly line-drawing puzzles, testing path construction from directional clues and grid constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Sashigane&lt;/strong&gt; (Direct-ask Success Rate (%)): leader mistral-large-2512 (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Sashigane shape-partitioning puzzles, testing right-angle region construction from numbered and directional clues.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Sudoku&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (20.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Sudoku puzzles, testing classic row, column, and box constraint satisfaction through exact move outputs.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurimaze&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (26.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurimaze puzzles, testing maze-style path and shading constraints in a connected grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Tapa&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Tapa shading puzzles, where clue numbers describe blocks of shaded neighboring cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Kurodoko&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Kurodoko visibility puzzles, testing shading, sight-line counts, and connected unshaded cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Country&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Country region puzzles, testing loop and region constraints over a partitioned grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Hitori&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Hitori number-grid puzzles, where repeated numbers are shaded while preserving connectivity and non-adjacency constraints.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (24)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Multi-Challenge)&lt;/strong&gt;: Nova 2 Pro (77.7) beat GPT-5 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK World Religions&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (87.13) beat gemma-3-12B-pt by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School US History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (91.67) beat MamayLM-Gemma-3-12B-IT-v1.0 by 5.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Anatomy&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (65.19) beat lapa-12B-pt by 5.19&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Clinical Knowledge&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (77.74) beat gemma-3-12B-pt by 4.53&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional LAW&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (51.5) beat gemma-3-12B-pt by 4.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Humanities&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (61.68) beat Qwen3-8B-Base by 4.12&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Computer Security&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (82.0) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Global Facts&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (52.0) beat Gemma 3 12B (IT) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Miscellaneous&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (83.52) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 3.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Prehistory&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (77.78) beat gemma-3-12B-pt by 3.71&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Other&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (74.57) beat gemma-3-12B-pt by 3.41&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Business Ethics&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (77.0) beat MamayLM-Gemma-3-12B-IT-v1.0 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School World History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (86.08) beat gemma-3-12B-pt by 1.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School Microeconomics&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (84.45) beat Qwen3-8B-Base by 1.68&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Marketing&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (88.89) beat MamayLM-Gemma-3-12B-IT-v1.0 by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional Psychology&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (70.1) beat gemma-3-12B-pt by 0.98&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Public Relations&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (68.18) beat lapa-12B-pt by 0.91&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School European History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (84.24) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 0.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School Macroeconomics&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (76.67) beat gemma-3-12B-pt by 0.52&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Sociology&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (83.08) beat lapa-v0.1.2-instruct by 0.49&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OmniDocBench 1.5)&lt;/strong&gt;: MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional Medicine&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (80.15) beat gemma-3-12B-pt by 0.37&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-06

=== DAILY ===
NEW BENCHMARKS (20)
  - Pencil Puzzle Bench - Yajilin (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models
      PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint s</summary></entry><entry><title>AI Benchmark Digest — 2026-06-04</title><id>https://aibenchmarks.dev/digest/2026-06-04</id><updated>2026-06-04T08:22:19.073162+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-04

=== DAILY ===
NEW #1 LEADERS (1)
  - GAIA (Accuracy (%)): CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 (93.02) by 0.34
</summary></entry><entry><title>AI Benchmark Digest — 2026-06-03</title><id>https://aibenchmarks.dev/digest/2026-06-03</id><updated>2026-06-03T08:25:40.519214+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on IUMB: 100.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3 Deep Think&lt;/strong&gt; on IUMB: 87.5 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 11-12&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - APEX 2025&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 7-8&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - AIME 2026&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-03

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 Pro on IUMB: 100.0 Score (%) (#2/55)
  - Gemini 3 Deep Think on IUMB: 87.5 Score (%) (#6/55)

NEW #1 LEADERS (4)
  - MathArena - Kangaroo 2025 Levels 11-12 (Accuracy (%)): Claude Opus 4.8 (Thinking) (100.0)</summary></entry><entry><title>AI Benchmark Digest — 2026-06-02</title><id>https://aibenchmarks.dev/digest/2026-06-02</id><updated>2026-06-02T08:19:29.198019+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GIM&lt;/strong&gt; (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models&lt;br&gt;&lt;span&gt;Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, language, world-knowledge, and document tasks. Scores are reported as IRT ability on GIM-820.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on IMO-Bench: 71.9 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro (xHigh)&lt;/strong&gt; on IMO-Bench: 88.1 (#2)&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-02

=== DAILY ===
NEW BENCHMARKS (1)
  - GIM (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models
      Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, langua</summary></entry><entry><title>AI Benchmark Digest — 2026-06-01</title><id>https://aibenchmarks.dev/digest/2026-06-01</id><updated>2026-06-01T08:29:45.265204+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Creative Writing v3&lt;/strong&gt;: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-01

=== DAILY ===
NEW #1 LEADERS (3)
  - EQ-Bench Creative Writing v3 (Elo): Claude Opus 4.7 (2050.8) beat GPT-5.4 (1906.0) by 144.8
  - Design Arena (Data Viz) (Elo): GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) (1344.0) by 23.0
  - Chatbot Arena (Image-to-Video) (</summary></entry><entry><title>AI Benchmark Digest — 2026-05-30</title><id>https://aibenchmarks.dev/digest/2026-05-30</id><updated>2026-05-30T07:49:09.779753+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Natural Intelligence: 65.39 (#30)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Willingness (W/10): 2.2 (#1094)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Writing: 65.88 (#34)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI Leaderboard: 52.64 (#69)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.4 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Bullshit Benchmark&lt;/strong&gt;: Claude Opus 4.8 (96.4) beat Claude Sonnet 4.6 by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (xHigh) (3.5) beat GPT-5.5 (Thinking, xHigh) by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-30

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (5)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Natural Intelligence: 65.39 NatInt Score (#30/1247)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Willingness (W/10): 2.2 W/10 Score (#1094/</summary></entry><entry><title>AI Benchmark Digest — 2026-05-29</title><id>https://aibenchmarks.dev/digest/2026-05-29</id><updated>2026-05-29T08:06:41.324282+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;DeepSWE&lt;/strong&gt; (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models&lt;br&gt;&lt;span&gt;DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realistic repository work.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; — ELO 1801, #52&lt;ul&gt;&lt;li&gt;Clerk LLM Leaderboard: 91.3 (#1/19)&lt;/li&gt;&lt;li&gt;Vellum - HumanEval: 88.6 (#1/36)&lt;/li&gt;&lt;li&gt;Vellum - Humanity's Last Exam: 57.9 (#1/20)&lt;/li&gt;&lt;li&gt;LLM Stats (DeepSearchQA): 93.1 (#1/6)&lt;/li&gt;&lt;li&gt;LLM Stats (Include): 87.6 (#1/30)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 83.4 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (ScreenSpot Pro): 87.9 (#1/22)&lt;/li&gt;&lt;li&gt;LLM Stats (Toolathlon): 59.9 (#1/20)&lt;/li&gt;&lt;li&gt;FrontierSWE: 83.0 (#1/11)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 70.17 (#1/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on WebDev Arena: 1478.93 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on WebDev Arena: 1504.74 (#12)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) beat GPT-5.5 (xHigh) by 120.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (57.9) beat Gemini 3 Pro by 12.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Clerk LLM Leaderboard&lt;/strong&gt;: Claude Opus 4.8 (91.3) beat GPT-5.4 by 11.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Opus 4.8 (82.72) beat Claude Opus 4.7 by 11.72&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - Apex Agents&lt;/strong&gt;: gemini-3.5-flash_unknown (49.6) beat GPT-5.5 (xHigh) by 11.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Opus 4.8 (83.4) beat Claude Mythos Preview by 3.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Claude Opus 4.8 (59.9) beat Gemini 3.5 Flash by 3.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Opus 4.8 (70.71) beat GPT-5.5 by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Opus 4.8 (70.17) beat GPT-5.5 by 2.55&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (DeepSearchQA)&lt;/strong&gt;: Claude Opus 4.8 (93.1) beat Claude Opus 4.6 by 1.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (ScreenSpot Pro)&lt;/strong&gt;: Claude Opus 4.8 (87.9) beat GPT-5.2 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Claude Opus 4.8 (87.6) beat Qwen 3.7 Max by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) beat GPT-5.5 (xHigh) by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Claude Opus 4.8 Fast (94.49) beat Qwen Max by 1.05&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) beat Gemini 3.1 Pro (Preview) by 1.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Opus 4.8 (88.6) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-29

=== DAILY ===
NEW BENCHMARKS (1)
  - DeepSWE (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models
      DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realis</summary></entry><entry><title>AI Benchmark Digest — 2026-05-28</title><id>https://aibenchmarks.dev/digest/2026-05-28</id><updated>2026-05-28T08:13:42.023730+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SWE-rebench: 62.73 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Grounding&lt;/strong&gt;: Gemma 4 26B A4B (80.87) beat GPT-5.2 by 4.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Qwen Max (93.44) beat Grok 0.1 by 1.37&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-28

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on SWE-rebench: 62.73 Resolved (%) (#1/82)

NEW #1 LEADERS (2)
  - Kaggle FACTS Grounding (Score (%)): Gemma 4 26B A4B (80.87) beat GPT-5.2 (76.17) by 4.7
  - PinchBench (Success Rate (%)): Qwen Max</summary></entry><entry><title>AI Benchmark Digest — 2026-05-27</title><id>https://aibenchmarks.dev/digest/2026-05-27</id><updated>2026-05-27T08:20:58.056719+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.2 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (11)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Chess (Saplin)&lt;/strong&gt;: GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro by 20.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (PolyMATH)&lt;/strong&gt;: Qwen 3.7 Max (86.5) beat Qwen 3.6 Plus by 9.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP-Mark)&lt;/strong&gt;: Qwen 3.7 Max (60.8) beat Kimi K2.6 by 4.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (NL2Repo)&lt;/strong&gt;: Qwen 3.7 Max (47.2) beat GLM-5.1 by 4.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-ProX)&lt;/strong&gt;: Qwen 3.7 Max (87.0) beat Qwen 3.6 Plus by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (HMMT Feb 26)&lt;/strong&gt;: Qwen 3.7 Max (97.1) beat DeepSeek V4 Pro (Max) by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MAXIFE)&lt;/strong&gt;: Qwen 3.7 Max (89.2) beat Qwen 3.6 Plus by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Qwen 3.7 Max (86.2) beat Qwen 3.5 397B A17B by 0.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (IMO-AnswerBench)&lt;/strong&gt;: Qwen 3.7 Max (90.0) beat DeepSeek V4 Pro (Max) by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (Thinking, xHigh) (3.2) beat GPT-5.5 by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-Redux)&lt;/strong&gt;: Qwen 3.7 Max (95.0) beat Qwen 3.5 397B A17B by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-27

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.4 (xHigh) on Creative Writing (Lechmazur): 3.2 Mean Score (#2/25)

NEW #1 LEADERS (11)
  - LLM Chess (Saplin) (ELO): GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro (1511.4) by 20.8
  - LLM Stats (PolyMATH) (Sc</summary></entry><entry><title>AI Benchmark Digest — 2026-05-25</title><id>https://aibenchmarks.dev/digest/2026-05-25</id><updated>2026-05-25T08:26:35.093083+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (6)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Base&lt;/strong&gt; (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard Sub-Q&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Thinking) (76.6), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Free&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (45.1), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Fixed&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 Pro (No-Think) (60.2), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ExploitBench v8-bench&lt;/strong&gt; (Mean Capability (%)): leader Claude Mythos Preview (69.0), 9 models&lt;br&gt;&lt;span&gt;V8 exploitation ladder benchmark measuring how far AI agents climb from code reachability through crash reproduction, exploit primitives, and arbitrary code execution. Reports mean capability across 41 V8 bug environments.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-25

=== DAILY ===
NEW BENCHMARKS (6)
  - LLMEval-Logic Base (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models
  - LLMEval-Logic Hard (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models
  - LLMEval-Logic Hard Sub-Q (Accuracy (%)): leader Cla</summary></entry><entry><title>AI Benchmark Digest — 2026-05-24</title><id>https://aibenchmarks.dev/digest/2026-05-24</id><updated>2026-05-24T07:56:34.401567+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (14)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;NanoGPT-Bench&lt;/strong&gt; (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models&lt;br&gt;&lt;span&gt;Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress coding agents recover under a fixed H100 compute budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt; (Score (%)): leader Claude Opus 4.7 (64.8), 7 models&lt;br&gt;&lt;span&gt;Cursor benchmark of ambiguous, multi-file coding tasks from real Cursor sessions, with models scored by task success percentage and average cost per task.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.4 (Medium) (40.2), 7 models&lt;br&gt;&lt;span&gt;Small molecule drug design agent benchmark with sandboxed Python, Boltz structure prediction, and ADMET tooling. Measures pass rate across 502 computationally verifiable chemistry tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench Diversity&lt;/strong&gt; (Avg Successful): leader Claude Sonnet 4.6 (8.4), 7 models&lt;br&gt;&lt;span&gt;SMDD-Bench diversity slice measuring whether agents generate multiple distinct, novel, successful molecule designs across repeated Lead Optimization rollouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt; (Connectivity Similarity Score): leader GPT 5.5 (0.362), 12 models&lt;br&gt;&lt;span&gt;Andon Labs spatial reasoning benchmark where agents convert apartment photographs into 2D floor plans, scored by normalized connectivity similarity against ground truth layouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt; (CMS Points): leader GPT-5.5 (high) (59.0), 25 models&lt;br&gt;&lt;span&gt;Pairwise Auction Conversation Testbed for multi-round buyer-seller bargaining. LLMs negotiate over 20 rounds with hidden private values, scored by Composite Model Score from head-to-head surplus capture.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FormationEval&lt;/strong&gt; (Accuracy (%)): leader gemini-3-pro-preview (99.8), 72 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench&lt;/strong&gt; (Average Score (%)): leader claude-opus-4-7 (66.21), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Translate Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (80.2), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Punctuate Punct F1&lt;/strong&gt; (Score (%)): leader claude-opus-4-7 (80.02), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Char-Gloss Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (73.6), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Idiom-Source Book EM&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (74.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Fill-In Exact&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (88.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Compress Efficiency&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (16.32), 9 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.1 Pro (High)&lt;/strong&gt; on CLBench: 20.8 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language&lt;/strong&gt;: Gemini 3.1 Pro (69.11) beat Gemini 2.5 Flash by 6.52&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLBench&lt;/strong&gt;: GPT-5.4 (xHigh) (27.9) beat GPT-5.1 (High) by 4.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Logic With Navigation&lt;/strong&gt;: Qwen Max (84.0) beat Claude Opus 4.6 (Thinking) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Spider 2.0-Lite&lt;/strong&gt;: DivSkill-SQL (73.13) beat SOMA-SQL by 1.11&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Grok 0.1 (92.07) beat Claude Opus 4.7 by 0.49&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-24

=== DAILY ===
NEW BENCHMARKS (14)
  - NanoGPT-Bench (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models
      Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress cod</summary></entry><entry><title>AI Benchmark Digest — 2026-05-23</title><id>https://aibenchmarks.dev/digest/2026-05-23</id><updated>2026-05-23T07:20:10.541511+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OSWorld&lt;/strong&gt;: Opus 4.7 (83.64) beat Holo3-35B-A3B by 1.08&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-23

=== DAILY ===
NEW #1 LEADERS (1)
  - OSWorld (Success Rate (%)): Opus 4.7 (83.64) beat Holo3-35B-A3B (82.56) by 1.08
</summary></entry><entry><title>AI Benchmark Digest — 2026-05-22</title><id>https://aibenchmarks.dev/digest/2026-05-22</id><updated>2026-05-22T07:36:15.662013+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Sycophancy (Lechmazur): 3.5 (#11)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro-preview (thinking_level=low) by 0.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Arabic Broad Leaderboard&lt;/strong&gt;: gemini-3.5-flash (9.253) beat gemini-3-pro-preview by 0.05&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-22

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (High) on Sycophancy (Lechmazur): 3.5 Sycophancy rate % (lower is better) (#11/31)

NEW #1 LEADERS (2)
  - UGI - Writing (Writing Score): gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro</summary></entry><entry><title>AI Benchmark Digest — 2026-05-21</title><id>https://aibenchmarks.dev/digest/2026-05-21</id><updated>2026-05-21T07:40:34.045646+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on WeirdML: 62.64 (#17)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle Game Arena Poker (Heads Up)&lt;/strong&gt;: GPT-5.5 (73.93) beat GPT-5.2 by 33.93&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA APEX-Agents&lt;/strong&gt;: Gemini 3.5 Flash (high) (47.05) beat GPT-5.5 (xhigh) by 9.37&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LA Leaderboard&lt;/strong&gt;: Qwen2.5-14B-Instruct-GPTQ-Int8 (63.6) beat gemma-2-9b-it by 0.27&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-21

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - Gemini 3.5 Flash (High) on WeirdML: 62.64 Average Score (#17/124)

NEW #1 LEADERS (3)
  - Kaggle Game Arena Poker (Heads Up) (Mean BB/100): GPT-5.5 (73.93) beat GPT-5.2 (40.0) by 33.93
  - AA APEX-Agents (Pass@1 (%</summary></entry><entry><title>AI Benchmark Digest — 2026-05-20</title><id>https://aibenchmarks.dev/digest/2026-05-20</id><updated>2026-05-20T07:43:37.557151+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; — ELO 1942, #9&lt;ul&gt;&lt;li&gt;AA MMMU-Pro: 84.28 (#1/190)&lt;/li&gt;&lt;li&gt;SEAL - MCP Atlas: 83.6 (#1/21)&lt;/li&gt;&lt;li&gt;AA Omniscience: 22.68 (#3/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Law: 57.4 (#4/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4/393)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 40.96 (#5/484)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 92.22 (#6/488)&lt;/li&gt;&lt;li&gt;AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6/393)&lt;/li&gt;&lt;li&gt;AA GDPval: 1655.7 (#7/365)&lt;/li&gt;&lt;li&gt;AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7/393)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (34)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Multi-turn Debate (Lechmazur): 1583.6 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA CritPt: 13.14 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GDPval: 1655.7 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GPQA Diamond: 92.22 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Humanity's Last Exam: 40.96 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA IFBench: 76.33 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Long Context Reasoning: 69.33 (#27)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience: 22.68 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Business: 45.8 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Health: 40.2 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Law: 57.4 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE): 65.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - C: 80.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Dart: 60.0 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Go: 50.0 (#32)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - HTML: 72.0 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Java: 51.0 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - JavaScript: 71.82 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Julia: 60.0 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Kotlin: 56.0 (#22)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Python: 61.0 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - R: 56.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Rust: 80.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Swift: 72.0 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - TypeScript: 67.78 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA SciCode: 53.12 (#11)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA TAU-2 Bench: 95.32 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Terminal-Bench Hard: 40.91 (#36)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-1: 92.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-2: 72.08 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on Artificial Analysis Intelligence Index: 55.33 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (GDPval-AA)&lt;/strong&gt;: Gemini 3.5 Flash (165600.0) beat Claude Sonnet 4.6 by 2300.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP Atlas)&lt;/strong&gt;: Gemini 3.5 Flash (83.6) beat Claude Opus 4.7 by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA MMMU-Pro&lt;/strong&gt;: Gemini 3.5 Flash (high) (84.28) beat Gemini 3.1 Pro Preview by 1.85&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - MCP Atlas&lt;/strong&gt;: gemini-3.5-flash (high) (83.6) beat Muse Spark by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Gemini 3.5 Flash (56.5) beat GPT-5.5 by 0.9&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-20

=== DAILY ===
NEW MODELS (1)
  - Gemini 3.5 Flash (High) — ELO 1942, #9/609 (above: Claude Opus 4.7 (Thinking), below: GPT-5.5 (High))
      AA MMMU-Pro: 84.28 (#1/190)
      SEAL - MCP Atlas: 83.6 (#1/21)
      AA Omniscience: 22.68 (#3/393)
      AA Omniscience - </summary></entry><entry><title>AI Benchmark Digest — 2026-05-17</title><id>https://aibenchmarks.dev/digest/2026-05-17</id><updated>2026-05-17T08:02:54.093472+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite by 3.0&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on SEAL Showdown: 1115.7 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat qwen3.5-397b-a17b by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VisuLogic&lt;/strong&gt;: PEREA-1.0new (52.8) beat Human by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) by 1.01&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA TAU-2 Bench&lt;/strong&gt;: JT-35B-Flash (99.12) beat GLM-4.7-Flash (Reasoning) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AISI Cyber TLO 10M&lt;/strong&gt;: GPT-5.5 (10.0) beat Claude Opus 4.6 by 0.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-17

=== DAILY ===
NEW #1 LEADERS (1)
  - OpenClawProBench (Overall Score (%)): intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite (73.7) by 3.0

=== WEEKLY ===
NEW SCORES FROM TOP-10 MODELS (3)
  - Claude Opus 4.7 (Thinking) on SEAL Showdown: 1115.7 Arena Score (#12</summary></entry><entry><title>AI Benchmark Digest — 2026-05-16</title><id>https://aibenchmarks.dev/digest/2026-05-16</id><updated>2026-05-16T07:15:27.727063+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-16

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on Chatbot Arena (Code): 1501.0 Elo (#9/79)

NEW #1 LEADERS (2)
  - MathArena - ARXIVLEAN March (Accuracy (%)): AlephProver (34.15) beat Aristotle (17.07) by 17.08
  - GAIA (Accuracy (%)): Co-Sight </summary></entry><entry><title>AI Benchmark Digest — 2026-05-14</title><id>https://aibenchmarks.dev/digest/2026-05-14</id><updated>2026-05-14T07:26:43.169192+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Pro-260215 (High)&lt;/strong&gt; — ELO 1781, #73&lt;ul&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 65.2 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 77.3 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 95.0 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 82.1 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 83.8 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 77.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 94.6 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 44.2 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 69.6 (#4/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Lite-260215 (High)&lt;/strong&gt; — ELO 1741, #103&lt;ul&gt;&lt;li&gt;OpenCompass Reasoning - Common: 78.1 (#2/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 91.7 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 59.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 67.1 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 72.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 77.1 (#8/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hy3-preview (High)&lt;/strong&gt; — ELO 1729, #110&lt;ul&gt;&lt;li&gt;OpenCompass Math - College: 81.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 76.0 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 74.5 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 75.4 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Academic: 43.6 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 58.5 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - Competition: 67.6 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 28.7 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 73.5 (#12/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ring-2.5-1T&lt;/strong&gt; — ELO 1711, #119&lt;ul&gt;&lt;li&gt;OpenCompass Knowledge - Social Science: 92.9 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 65.4 (#11/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 68.8 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 90.0 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 75.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Knowledge: 89.4 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Engineering: 90.8 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 69.8 (#15/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-14

=== DAILY ===
NEW MODELS (4)
  - Doubao-Seed-2-0-Pro-260215 (High) — ELO 1781, #73/796 (above: GPT-5.2 (Low), below: GLM-5-Turbo)
      OpenCompass LLM - Reasoning: 65.2 (#1/23)
      OpenCompass LLM - Math: 77.3 (#1/23)
      OpenCompass Knowledge - Humanities: 95.</summary></entry></feed>