<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="/feed.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>AI Benchmark Digest</title><subtitle>AI benchmark changes — new models, leader shifts, and trends</subtitle><link href="https://aibenchmarks.dev/data/feed.xml" rel="self" /><link href="https://aibenchmarks.dev/#/digest" rel="alternate" /><id>https://aibenchmarks.dev/feed</id><icon>https://aibenchmarks.dev/favicon.ico</icon><updated>2026-06-12T08:17:57.895837+00:00</updated><entry><title>AI Benchmark Digest — 2026-06-12</title><id>https://aibenchmarks.dev/digest/2026-06-12</id><updated>2026-06-12T08:17:57.895837+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV_FALSE May&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV May&lt;/strong&gt; (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Lynchmark: 100.0 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on MineBench: 1929.84 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Chess Puzzles (Epoch AI): 34.0 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on Design Arena (Game Dev): 1250.0 (#37)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on GRAB-Lite: 60.6 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on OTIS Mock AIME 2024-25: 98.33 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; on SimpleQA Verified: 39.5 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5&lt;/strong&gt; on GRAB-Lite: 71.8 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen 3.7 Max&lt;/strong&gt; on Position Bias (Lechmazur): 34.8 (#10)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text-to-Video)&lt;/strong&gt;: gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p by 64.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (UI Components)&lt;/strong&gt;: Claude Fable 5 (1411.0) beat Claude Opus 4.7 (Thinking) by 56.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Game Dev)&lt;/strong&gt;: Claude Fable 5 (1393.0) beat GPT-5.5 by 39.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (SVG)&lt;/strong&gt;: Claude Fable 5 (1384.0) beat prism by 18.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Test Writing&lt;/strong&gt;: Fable-5 (Claude Code) xHigh (58.52) beat Opus 4.8 (Claude Code) by 12.96&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV April&lt;/strong&gt;: Claude 5 (70.73) beat GPT-5.5 (xHigh) by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GRAB-Lite&lt;/strong&gt;: Claude Fable 5 (74.0) beat GPT-5.4 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: Claude 5 (87.85) beat GPT-5.5 (xHigh) by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: gemini-omni-flash (1475.0) beat Grok 1.5 by 2.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-12

=== DAILY ===
NEW BENCHMARKS (2)
  - MathArena - ARXIV_FALSE May (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models
  - MathArena - ARXIV May (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models

NEW SCORES FROM TOP-10 MODELS (9)
  - Claude Fable 5 on </summary></entry><entry><title>AI Benchmark Digest — 2026-06-11</title><id>https://aibenchmarks.dev/digest/2026-06-11</id><updated>2026-06-11T08:17:01.068404+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GDPval-AA&lt;/strong&gt; (Elo): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.0), 390 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Document): 1495.0 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on Chatbot Arena (Vision): 1307.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; on React Native Evals: 86.96 (#4)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (12)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (2171.0) beat GPT-5.5 (High) by 155.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Code)&lt;/strong&gt;: Claude Fable 5 (1665.0) beat Claude Opus 4.7 (Thinking) by 98.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: Claude Fable 5 (1406.0) beat Claude Opus 4.7 (Thinking) by 68.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Website)&lt;/strong&gt;: Claude Fable 5 (1364.0) beat Claude Opus 4.6 by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (3D)&lt;/strong&gt;: Claude Fable 5 (1383.0) beat Kimi K2.6 by 17.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierSWE&lt;/strong&gt;: Claude Fable 5 (90.0) beat Claude Opus 4.8 by 7.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Text)&lt;/strong&gt;: Claude Fable 5 (1510.0) beat Claude Opus 4.6 (Thinking) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SimpleBench&lt;/strong&gt;: Claude Fable (81.9) beat Gemini 3.1 Pro (Preview) by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: Claude 5 (74.23) beat Gemini 3.5 Flash (Thinking, Medium) by 1.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Longform Writing&lt;/strong&gt;: Claude Fable 5 (83.0) beat Claude Opus 4.7 by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Video-MME)&lt;/strong&gt;: MiMo-V2.5 (87.7) beat Kimi K2.5 by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (CMMLU)&lt;/strong&gt;: MiMo-V2.5-Pro (90.2) beat Qwen 2 72B Instruct by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-11

=== DAILY ===
NEW BENCHMARKS (1)
  - GDPval-AA (Elo): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.0), 390 models

NEW SCORES FROM TOP-10 MODELS (3)
  - Claude Fable 5 on Chatbot Arena (Document): 1495.0 Elo (#5/29)
  - Claude Fabl</summary></entry><entry><title>AI Benchmark Digest — 2026-06-10</title><id>https://aibenchmarks.dev/digest/2026-06-10</id><updated>2026-06-10T09:55:36.786616+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SkateBench&lt;/strong&gt; (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models&lt;br&gt;&lt;span&gt;Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; — ELO 1871, #31&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.386 (#1/14)&lt;/li&gt;&lt;li&gt;Opper TaskBench: 96.4 (#1/85)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 85.0 (#1/16)&lt;/li&gt;&lt;li&gt;YC-Bench: 1977.6 (#1/21)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 75.14 (#1/25)&lt;/li&gt;&lt;li&gt;Vals AI Multimodal Index: 74.15 (#1/20)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 88.56 (#1/114)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 71.83 (#1/111)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 88.52 (#1/62)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 77.0 (#1/37)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (55)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;YC-Bench&lt;/strong&gt;: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-turn Debate (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ay&lt;/strong&gt;: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Python&lt;/strong&gt;: Claude Fable 5 (xHigh) (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt;: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Fable 5 (90.35) beat Claude Opus 4.8 by 7.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language crh&lt;/strong&gt;: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI ProofBench&lt;/strong&gt;: Claude Fable 5 (77.0) beat aristotle by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Fable 5 (75.14) beat Claude Opus 4.8 by 4.78&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI IOI&lt;/strong&gt;: Claude Fable 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cv&lt;/strong&gt;: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: Claude Fable 5 (71.83) beat Grok 4.3 by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Fable 5 (74.15) beat Claude Opus 4.8 by 3.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MGSM&lt;/strong&gt;: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ban&lt;/strong&gt;: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Terminal-Bench Hard&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Plot Unscrambling&lt;/strong&gt;: Claude Fable 5 (xHigh) (78.09) beat GPT-5.5 (High) by 1.81&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language chm&lt;/strong&gt;: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language doi&lt;/strong&gt;: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA CritPt&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language es&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA SciCode&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ace&lt;/strong&gt;: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MMLU&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - ARC&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LegalBench&lt;/strong&gt;: Claude Fable 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ca&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Opper TaskBench&lt;/strong&gt;: Claude Fable 5 (96.4) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ar&lt;/strong&gt;: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language en&lt;/strong&gt;: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cy&lt;/strong&gt;: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language am&lt;/strong&gt;: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MedScribe&lt;/strong&gt;: Claude Fable 5 (88.52) beat GPT-5.1 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language be&lt;/strong&gt;: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ceb&lt;/strong&gt;: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language el&lt;/strong&gt;: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt;: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Olympiad&lt;/strong&gt;: Claude Fable 5 (High) (92.18) beat Claude Opus 4.6 (Thinking, High) by 0.01&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-10

=== DAILY ===
NEW BENCHMARKS (1)
  - SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
      Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. Skate</summary></entry><entry><title>AI Benchmark Digest — 2026-06-10</title><id>https://aibenchmarks.dev/digest/2026-06-10</id><updated>2026-06-10T08:06:50.673963+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SkateBench&lt;/strong&gt; (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models&lt;br&gt;&lt;span&gt;Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude 5&lt;/strong&gt; — ELO 1904, #22&lt;ul&gt;&lt;li&gt;LiveBench Olympiad: 92.18 (#1/124)&lt;/li&gt;&lt;li&gt;LiveBench Plot Unscrambling: 78.09 (#1/124)&lt;/li&gt;&lt;li&gt;LiveBench Python: 95.0 (#1/124)&lt;/li&gt;&lt;li&gt;Opper TaskBench: 96.4 (#1/85)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 75.14 (#1/25)&lt;/li&gt;&lt;li&gt;Vals AI Multimodal Index: 74.15 (#1/20)&lt;/li&gt;&lt;li&gt;Vals AI LegalBench: 88.56 (#1/114)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 71.83 (#1/111)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 88.52 (#1/62)&lt;/li&gt;&lt;li&gt;Vals AI ProofBench: 77.0 (#1/37)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt; — ELO 1901, #23&lt;ul&gt;&lt;li&gt;Blueprint-Bench 2: 0.386 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 85.0 (#1/16)&lt;/li&gt;&lt;li&gt;YC-Bench: 1977.6 (#1/21)&lt;/li&gt;&lt;li&gt;SEAL - MCP Atlas: 83.3 (#2/23)&lt;/li&gt;&lt;li&gt;Vellum - HumanEval: 95.0 (#2/38)&lt;/li&gt;&lt;li&gt;Vellum - GPQA: 94.1 (#3/57)&lt;/li&gt;&lt;li&gt;ClockBench: 35.0 (#4/27)&lt;/li&gt;&lt;li&gt;LLM Stats (GDPval-AA): 64.4 (#11/12)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (55)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;YC-Bench&lt;/strong&gt;: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multi-turn Debate (Lechmazur)&lt;/strong&gt;: Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ay&lt;/strong&gt;: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Python&lt;/strong&gt;: Claude 5 (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt;: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude 5 (90.35) beat Claude Opus 4.8 by 7.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language crh&lt;/strong&gt;: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI ProofBench&lt;/strong&gt;: Claude 5 (77.0) beat aristotle by 6.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude 5 (75.14) beat Claude Opus 4.8 by 4.78&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI IOI&lt;/strong&gt;: Claude 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cv&lt;/strong&gt;: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: Claude 5 (71.83) beat Grok 4.3 by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude 5 (74.15) beat Claude Opus 4.8 by 3.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MGSM&lt;/strong&gt;: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ban&lt;/strong&gt;: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Terminal-Bench Hard&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Plot Unscrambling&lt;/strong&gt;: Claude 5 (78.09) beat GPT-5.5 (High) by 1.81&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language chm&lt;/strong&gt;: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language doi&lt;/strong&gt;: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA CritPt&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language es&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA SciCode&lt;/strong&gt;: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ace&lt;/strong&gt;: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - MMLU&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - ARC&lt;/strong&gt;: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI LegalBench&lt;/strong&gt;: Claude 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ca&lt;/strong&gt;: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Opper TaskBench&lt;/strong&gt;: Claude 5 (96.4) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ar&lt;/strong&gt;: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language en&lt;/strong&gt;: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language cy&lt;/strong&gt;: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language am&lt;/strong&gt;: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI MedScribe&lt;/strong&gt;: Claude 5 (88.52) beat GPT-5.1 by 0.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language be&lt;/strong&gt;: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language ceb&lt;/strong&gt;: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language - Language el&lt;/strong&gt;: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt;: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Olympiad&lt;/strong&gt;: Claude 5 (92.18) beat Claude Opus 4.6 (Thinking) (High) by 0.01&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-10

=== DAILY ===
NEW BENCHMARKS (1)
  - SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
      Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. Skate</summary></entry><entry><title>AI Benchmark Digest — 2026-06-09</title><id>https://aibenchmarks.dev/digest/2026-06-09</id><updated>2026-06-09T07:53:25.528997+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SEAL - SWE Atlas - Codebase QnA: 45.43 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SEAL - SWE Atlas - Test Writing: 42.59 (#3)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (7)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleTables&lt;/strong&gt;: TelecomGPT (88.0) beat OTel-LLM-8.3B-QnA by 26.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco LLM Leaderboard&lt;/strong&gt;: TelecomGPT (89.64) beat OTel-LLM-8.3B-QnA by 3.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: Opus 4.8 (Claude Code) (48.79) beat GPT-5.5 by 3.36&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - 3GPP&lt;/strong&gt;: TelecomGPT (84.22) beat OTel-LLM-8.3B-QnA by 2.82&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - TeleLogs&lt;/strong&gt;: TelecomGPT (98.96) beat OTel-LLM-8.3B-QnA by 2.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSMA Open-Telco - srsRAN-Bench&lt;/strong&gt;: TelecomGPT (91.33) beat OTel-LLM-8.3B-QnA by 1.65&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Test Writing&lt;/strong&gt;: Opus 4.8 (Claude Code) (45.56) beat GPT-5.4 (xHigh) by 1.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-09

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 (xHigh) on SEAL - SWE Atlas - Codebase QnA: 45.43 Score (#2/14)
  - GPT-5.5 (xHigh) on SEAL - SWE Atlas - Test Writing: 42.59 Score (#3/14)

NEW #1 LEADERS (7)
  - GSMA Open-Telco - TeleTables (Score (%)): </summary></entry><entry><title>AI Benchmark Digest — 2026-06-07</title><id>https://aibenchmarks.dev/digest/2026-06-07</id><updated>2026-06-07T08:34:58.487719+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;New Models (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MiniMax-M3&lt;/strong&gt; — ELO 1762, #83&lt;ul&gt;&lt;li&gt;LLM Stats (OmniDocBench 1.5): 91.6 (#1/13)&lt;/li&gt;&lt;li&gt;LLM Stats (Video-MME): 85.4 (#2/13)&lt;/li&gt;&lt;li&gt;OpenClawProBench: 75.1 (#2/65)&lt;/li&gt;&lt;li&gt;Vals AI MedScribe: 87.25 (#2/61)&lt;/li&gt;&lt;li&gt;AA IFBench: 82.86 (#3/429)&lt;/li&gt;&lt;li&gt;LLM Stats (Claw-Eval): 74.5 (#3/9)&lt;/li&gt;&lt;li&gt;LLM Stats (NL2Repo): 42.13 (#3/7)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 92.93 (#4/501)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 68.1 (#4/110)&lt;/li&gt;&lt;li&gt;Design Arena (3D): 1348.0 (#5/115)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;nemotron-3-ultra-550B-a55B&lt;/strong&gt; — ELO 1587, #292&lt;ul&gt;&lt;li&gt;PinchBench: 90.58 (#10/49)&lt;/li&gt;&lt;li&gt;Vals AI CorpFin v2: 65.46 (#16/110)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 43.99 (#18/24)&lt;/li&gt;&lt;li&gt;LiveBench Python: 75.0 (#24/122)&lt;/li&gt;&lt;li&gt;LiveBench Paraphrase: 61.15 (#33/122)&lt;/li&gt;&lt;li&gt;Vals AI TaxEval v2: 73.1 (#34/116)&lt;/li&gt;&lt;li&gt;Bullshit Benchmark: 41.8 (#34/148)&lt;/li&gt;&lt;li&gt;Vals AI MedCode: 38.62 (#35/62)&lt;/li&gt;&lt;li&gt;AI Chess Leaderboard (Reasoning): 975.0 (#39/277)&lt;/li&gt;&lt;li&gt;LiveBench Code Generation: 77.47 (#43/122)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on IMO-Bench: 71.9 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on IUMB: 100.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro (xHigh)&lt;/strong&gt; on IMO-Bench: 88.1 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3 Deep Think&lt;/strong&gt; on IUMB: 87.5 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (10)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Creative Writing v3&lt;/strong&gt;: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Multi-Challenge)&lt;/strong&gt;: Nova 2 Pro (77.7) beat GPT-5 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 11-12&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - APEX 2025&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 7-8&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - AIME 2026&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OmniDocBench 1.5)&lt;/strong&gt;: MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-07

=== WEEKLY ===
NEW MODELS (2)
  - MiniMax-M3 — ELO 1762, #83/970 (above: Gemini 3 Flash (High), below: Claude Opus 4.5 (Non-reasoning))
      LLM Stats (OmniDocBench 1.5): 91.6 (#1/13)
      LLM Stats (Video-MME): 85.4 (#2/13)
      OpenClawProBench: 75.1 (#2/65)
  </summary></entry><entry><title>AI Benchmark Digest — 2026-06-06</title><id>https://aibenchmarks.dev/digest/2026-06-06</id><updated>2026-06-06T07:45:06.870709+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (20)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Yajilin&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint solving from puzz.link grids.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Slitherlink&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Slitherlink loop puzzles, where numbered cells constrain how a single continuous loop surrounds the grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Heyawake&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-5-high (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Heyawake room-shading puzzles, testing region constraints, connectivity, and line-of-sight reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Mashu&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Mashu loop puzzles, where black and white pearls impose turn and straight-line constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Shakashaka&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-sonnet-4-5 (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Shakashaka triangle-shading puzzles, testing local clue satisfaction and global rectangle formation.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurikabe&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurikabe island puzzles, where numbered islands must be separated by one connected wall region.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - LITS&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (53.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on LITS tetromino-shading puzzles, testing region-wise shape placement and adjacency constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Light Up&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Light Up puzzles, where lamps must illuminate every open cell while satisfying numbered black-cell clues.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurimisaki&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurimisaki puzzles, a Nurikabe-family grid task requiring connected-region reasoning around clue cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Shikaku&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (80.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Shikaku rectangle-partitioning puzzles, where each numbered clue defines one rectangle of matching area.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Norinori&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (93.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Norinori shading puzzles, testing room constraints and two-cell adjacency patterns.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Double Choco&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Double Choco region-division puzzles, testing balanced partitioning under color and shape constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Firefly&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (33.3), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Firefly line-drawing puzzles, testing path construction from directional clues and grid constraints.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Sashigane&lt;/strong&gt; (Direct-ask Success Rate (%)): leader mistral-large-2512 (0.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Sashigane shape-partitioning puzzles, testing right-angle region construction from numbered and directional clues.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Sudoku&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (20.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Sudoku puzzles, testing classic row, column, and box constraint satisfaction through exact move outputs.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Nurimaze&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (26.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Nurimaze puzzles, testing maze-style path and shading constraints in a connected grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Tapa&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Tapa shading puzzles, where clue numbers describe blocks of shaded neighboring cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Kurodoko&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Kurodoko visibility puzzles, testing shading, sight-line counts, and connected unshaded cells.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Country&lt;/strong&gt; (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Country region puzzles, testing loop and region constraints over a partitioned grid.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pencil Puzzle Bench - Hitori&lt;/strong&gt; (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models&lt;br&gt;&lt;span&gt;PPBench direct-ask success rate on Hitori number-grid puzzles, where repeated numbers are shaded while preserving connectivity and non-adjacency constraints.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (24)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Multi-Challenge)&lt;/strong&gt;: Nova 2 Pro (77.7) beat GPT-5 by 8.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK World Religions&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (87.13) beat gemma-3-12B-pt by 7.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School US History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (91.67) beat MamayLM-Gemma-3-12B-IT-v1.0 by 5.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Anatomy&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (65.19) beat lapa-12B-pt by 5.19&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Clinical Knowledge&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (77.74) beat gemma-3-12B-pt by 4.53&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional LAW&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (51.5) beat gemma-3-12B-pt by 4.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Humanities&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (61.68) beat Qwen3-8B-Base by 4.12&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Computer Security&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (82.0) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Global Facts&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (52.0) beat Gemma 3 12B (IT) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Miscellaneous&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (83.52) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 3.95&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Prehistory&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (77.78) beat gemma-3-12B-pt by 3.71&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Other&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (74.57) beat gemma-3-12B-pt by 3.41&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Business Ethics&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (77.0) beat MamayLM-Gemma-3-12B-IT-v1.0 by 3.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School World History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (86.08) beat gemma-3-12B-pt by 1.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School Microeconomics&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (84.45) beat Qwen3-8B-Base by 1.68&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Marketing&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (88.89) beat MamayLM-Gemma-3-12B-IT-v1.0 by 1.28&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional Psychology&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (70.1) beat gemma-3-12B-pt by 0.98&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Public Relations&lt;/strong&gt;: MamayLM-Gemma-3-12B-IT-v2.0 (68.18) beat lapa-12B-pt by 0.91&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School European History&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (84.24) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 0.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK High School Macroeconomics&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (76.67) beat gemma-3-12B-pt by 0.52&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Sociology&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (83.08) beat lapa-v0.1.2-instruct by 0.49&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OmniDocBench 1.5)&lt;/strong&gt;: MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM - Global MMLU Full UK Professional Medicine&lt;/strong&gt;: MamayLM-Gemma-3-27B-IT-v2.0 (80.15) beat gemma-3-12B-pt by 0.37&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-06

=== DAILY ===
NEW BENCHMARKS (20)
  - Pencil Puzzle Bench - Yajilin (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models
      PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint s</summary></entry><entry><title>AI Benchmark Digest — 2026-06-04</title><id>https://aibenchmarks.dev/digest/2026-06-04</id><updated>2026-06-04T08:22:19.073162+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-04

=== DAILY ===
NEW #1 LEADERS (1)
  - GAIA (Accuracy (%)): CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 (93.02) by 0.34
</summary></entry><entry><title>AI Benchmark Digest — 2026-06-03</title><id>https://aibenchmarks.dev/digest/2026-06-03</id><updated>2026-06-03T08:25:40.519214+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on IUMB: 100.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3 Deep Think&lt;/strong&gt; on IUMB: 87.5 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 11-12&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - APEX 2025&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 7-8&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - AIME 2026&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-03

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 Pro on IUMB: 100.0 Score (%) (#2/55)
  - Gemini 3 Deep Think on IUMB: 87.5 Score (%) (#6/55)

NEW #1 LEADERS (4)
  - MathArena - Kangaroo 2025 Levels 11-12 (Accuracy (%)): Claude Opus 4.8 (Thinking) (100.0)</summary></entry><entry><title>AI Benchmark Digest — 2026-06-02</title><id>https://aibenchmarks.dev/digest/2026-06-02</id><updated>2026-06-02T08:19:29.198019+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GIM&lt;/strong&gt; (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models&lt;br&gt;&lt;span&gt;Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, language, world-knowledge, and document tasks. Scores are reported as IRT ability on GIM-820.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on IMO-Bench: 71.9 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro (xHigh)&lt;/strong&gt; on IMO-Bench: 88.1 (#2)&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-02

=== DAILY ===
NEW BENCHMARKS (1)
  - GIM (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models
      Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, langua</summary></entry><entry><title>AI Benchmark Digest — 2026-06-01</title><id>https://aibenchmarks.dev/digest/2026-06-01</id><updated>2026-06-01T08:29:45.265204+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Creative Writing v3&lt;/strong&gt;: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-01

=== DAILY ===
NEW #1 LEADERS (3)
  - EQ-Bench Creative Writing v3 (Elo): Claude Opus 4.7 (2050.8) beat GPT-5.4 (1906.0) by 144.8
  - Design Arena (Data Viz) (Elo): GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) (1344.0) by 23.0
  - Chatbot Arena (Image-to-Video) (</summary></entry><entry><title>AI Benchmark Digest — 2026-05-30</title><id>https://aibenchmarks.dev/digest/2026-05-30</id><updated>2026-05-30T07:49:09.779753+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Natural Intelligence: 65.39 (#30)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Willingness (W/10): 2.2 (#1094)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Writing: 65.88 (#34)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI Leaderboard: 52.64 (#69)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.4 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Bullshit Benchmark&lt;/strong&gt;: Claude Opus 4.8 (96.4) beat Claude Sonnet 4.6 by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (xHigh) (3.5) beat GPT-5.5 (Thinking, xHigh) by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-30

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (5)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Natural Intelligence: 65.39 NatInt Score (#30/1247)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Willingness (W/10): 2.2 W/10 Score (#1094/</summary></entry><entry><title>AI Benchmark Digest — 2026-05-29</title><id>https://aibenchmarks.dev/digest/2026-05-29</id><updated>2026-05-29T08:06:41.324282+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;DeepSWE&lt;/strong&gt; (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models&lt;br&gt;&lt;span&gt;DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realistic repository work.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; — ELO 1801, #52&lt;ul&gt;&lt;li&gt;Clerk LLM Leaderboard: 91.3 (#1/19)&lt;/li&gt;&lt;li&gt;Vellum - HumanEval: 88.6 (#1/36)&lt;/li&gt;&lt;li&gt;Vellum - Humanity's Last Exam: 57.9 (#1/20)&lt;/li&gt;&lt;li&gt;LLM Stats (DeepSearchQA): 93.1 (#1/6)&lt;/li&gt;&lt;li&gt;LLM Stats (Include): 87.6 (#1/30)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 83.4 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (ScreenSpot Pro): 87.9 (#1/22)&lt;/li&gt;&lt;li&gt;LLM Stats (Toolathlon): 59.9 (#1/20)&lt;/li&gt;&lt;li&gt;FrontierSWE: 83.0 (#1/11)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 70.17 (#1/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on WebDev Arena: 1478.93 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on WebDev Arena: 1504.74 (#12)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) beat GPT-5.5 (xHigh) by 120.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (57.9) beat Gemini 3 Pro by 12.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Clerk LLM Leaderboard&lt;/strong&gt;: Claude Opus 4.8 (91.3) beat GPT-5.4 by 11.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Opus 4.8 (82.72) beat Claude Opus 4.7 by 11.72&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - Apex Agents&lt;/strong&gt;: gemini-3.5-flash_unknown (49.6) beat GPT-5.5 (xHigh) by 11.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Opus 4.8 (83.4) beat Claude Mythos Preview by 3.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Claude Opus 4.8 (59.9) beat Gemini 3.5 Flash by 3.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Opus 4.8 (70.71) beat GPT-5.5 by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Opus 4.8 (70.17) beat GPT-5.5 by 2.55&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (DeepSearchQA)&lt;/strong&gt;: Claude Opus 4.8 (93.1) beat Claude Opus 4.6 by 1.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (ScreenSpot Pro)&lt;/strong&gt;: Claude Opus 4.8 (87.9) beat GPT-5.2 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Claude Opus 4.8 (87.6) beat Qwen 3.7 Max by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) beat GPT-5.5 (xHigh) by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Claude Opus 4.8 Fast (94.49) beat Qwen Max by 1.05&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) beat Gemini 3.1 Pro (Preview) by 1.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Opus 4.8 (88.6) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-29

=== DAILY ===
NEW BENCHMARKS (1)
  - DeepSWE (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models
      DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realis</summary></entry><entry><title>AI Benchmark Digest — 2026-05-28</title><id>https://aibenchmarks.dev/digest/2026-05-28</id><updated>2026-05-28T08:13:42.023730+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SWE-rebench: 62.73 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Grounding&lt;/strong&gt;: Gemma 4 26B A4B (80.87) beat GPT-5.2 by 4.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Qwen Max (93.44) beat Grok 0.1 by 1.37&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-28

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on SWE-rebench: 62.73 Resolved (%) (#1/82)

NEW #1 LEADERS (2)
  - Kaggle FACTS Grounding (Score (%)): Gemma 4 26B A4B (80.87) beat GPT-5.2 (76.17) by 4.7
  - PinchBench (Success Rate (%)): Qwen Max</summary></entry><entry><title>AI Benchmark Digest — 2026-05-27</title><id>https://aibenchmarks.dev/digest/2026-05-27</id><updated>2026-05-27T08:20:58.056719+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.2 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (11)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Chess (Saplin)&lt;/strong&gt;: GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro by 20.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (PolyMATH)&lt;/strong&gt;: Qwen 3.7 Max (86.5) beat Qwen 3.6 Plus by 9.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP-Mark)&lt;/strong&gt;: Qwen 3.7 Max (60.8) beat Kimi K2.6 by 4.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (NL2Repo)&lt;/strong&gt;: Qwen 3.7 Max (47.2) beat GLM-5.1 by 4.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-ProX)&lt;/strong&gt;: Qwen 3.7 Max (87.0) beat Qwen 3.6 Plus by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (HMMT Feb 26)&lt;/strong&gt;: Qwen 3.7 Max (97.1) beat DeepSeek V4 Pro (Max) by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MAXIFE)&lt;/strong&gt;: Qwen 3.7 Max (89.2) beat Qwen 3.6 Plus by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Qwen 3.7 Max (86.2) beat Qwen 3.5 397B A17B by 0.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (IMO-AnswerBench)&lt;/strong&gt;: Qwen 3.7 Max (90.0) beat DeepSeek V4 Pro (Max) by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (Thinking, xHigh) (3.2) beat GPT-5.5 by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-Redux)&lt;/strong&gt;: Qwen 3.7 Max (95.0) beat Qwen 3.5 397B A17B by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-27

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.4 (xHigh) on Creative Writing (Lechmazur): 3.2 Mean Score (#2/25)

NEW #1 LEADERS (11)
  - LLM Chess (Saplin) (ELO): GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro (1511.4) by 20.8
  - LLM Stats (PolyMATH) (Sc</summary></entry><entry><title>AI Benchmark Digest — 2026-05-25</title><id>https://aibenchmarks.dev/digest/2026-05-25</id><updated>2026-05-25T08:26:35.093083+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (6)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Base&lt;/strong&gt; (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard Sub-Q&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Thinking) (76.6), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Free&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (45.1), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Fixed&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 Pro (No-Think) (60.2), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ExploitBench v8-bench&lt;/strong&gt; (Mean Capability (%)): leader Claude Mythos Preview (69.0), 9 models&lt;br&gt;&lt;span&gt;V8 exploitation ladder benchmark measuring how far AI agents climb from code reachability through crash reproduction, exploit primitives, and arbitrary code execution. Reports mean capability across 41 V8 bug environments.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-25

=== DAILY ===
NEW BENCHMARKS (6)
  - LLMEval-Logic Base (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models
  - LLMEval-Logic Hard (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models
  - LLMEval-Logic Hard Sub-Q (Accuracy (%)): leader Cla</summary></entry><entry><title>AI Benchmark Digest — 2026-05-24</title><id>https://aibenchmarks.dev/digest/2026-05-24</id><updated>2026-05-24T07:56:34.401567+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (14)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;NanoGPT-Bench&lt;/strong&gt; (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models&lt;br&gt;&lt;span&gt;Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress coding agents recover under a fixed H100 compute budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt; (Score (%)): leader Claude Opus 4.7 (64.8), 7 models&lt;br&gt;&lt;span&gt;Cursor benchmark of ambiguous, multi-file coding tasks from real Cursor sessions, with models scored by task success percentage and average cost per task.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.4 (Medium) (40.2), 7 models&lt;br&gt;&lt;span&gt;Small molecule drug design agent benchmark with sandboxed Python, Boltz structure prediction, and ADMET tooling. Measures pass rate across 502 computationally verifiable chemistry tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench Diversity&lt;/strong&gt; (Avg Successful): leader Claude Sonnet 4.6 (8.4), 7 models&lt;br&gt;&lt;span&gt;SMDD-Bench diversity slice measuring whether agents generate multiple distinct, novel, successful molecule designs across repeated Lead Optimization rollouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt; (Connectivity Similarity Score): leader GPT 5.5 (0.362), 12 models&lt;br&gt;&lt;span&gt;Andon Labs spatial reasoning benchmark where agents convert apartment photographs into 2D floor plans, scored by normalized connectivity similarity against ground truth layouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt; (CMS Points): leader GPT-5.5 (high) (59.0), 25 models&lt;br&gt;&lt;span&gt;Pairwise Auction Conversation Testbed for multi-round buyer-seller bargaining. LLMs negotiate over 20 rounds with hidden private values, scored by Composite Model Score from head-to-head surplus capture.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FormationEval&lt;/strong&gt; (Accuracy (%)): leader gemini-3-pro-preview (99.8), 72 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench&lt;/strong&gt; (Average Score (%)): leader claude-opus-4-7 (66.21), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Translate Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (80.2), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Punctuate Punct F1&lt;/strong&gt; (Score (%)): leader claude-opus-4-7 (80.02), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Char-Gloss Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (73.6), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Idiom-Source Book EM&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (74.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Fill-In Exact&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (88.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Compress Efficiency&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (16.32), 9 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.1 Pro (High)&lt;/strong&gt; on CLBench: 20.8 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language&lt;/strong&gt;: Gemini 3.1 Pro (69.11) beat Gemini 2.5 Flash by 6.52&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLBench&lt;/strong&gt;: GPT-5.4 (xHigh) (27.9) beat GPT-5.1 (High) by 4.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Logic With Navigation&lt;/strong&gt;: Qwen Max (84.0) beat Claude Opus 4.6 (Thinking) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Spider 2.0-Lite&lt;/strong&gt;: DivSkill-SQL (73.13) beat SOMA-SQL by 1.11&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Grok 0.1 (92.07) beat Claude Opus 4.7 by 0.49&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-24

=== DAILY ===
NEW BENCHMARKS (14)
  - NanoGPT-Bench (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models
      Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress cod</summary></entry><entry><title>AI Benchmark Digest — 2026-05-23</title><id>https://aibenchmarks.dev/digest/2026-05-23</id><updated>2026-05-23T07:20:10.541511+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OSWorld&lt;/strong&gt;: Opus 4.7 (83.64) beat Holo3-35B-A3B by 1.08&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-23

=== DAILY ===
NEW #1 LEADERS (1)
  - OSWorld (Success Rate (%)): Opus 4.7 (83.64) beat Holo3-35B-A3B (82.56) by 1.08
</summary></entry><entry><title>AI Benchmark Digest — 2026-05-22</title><id>https://aibenchmarks.dev/digest/2026-05-22</id><updated>2026-05-22T07:36:15.662013+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Sycophancy (Lechmazur): 3.5 (#11)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro-preview (thinking_level=low) by 0.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Arabic Broad Leaderboard&lt;/strong&gt;: gemini-3.5-flash (9.253) beat gemini-3-pro-preview by 0.05&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-22

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (High) on Sycophancy (Lechmazur): 3.5 Sycophancy rate % (lower is better) (#11/31)

NEW #1 LEADERS (2)
  - UGI - Writing (Writing Score): gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro</summary></entry><entry><title>AI Benchmark Digest — 2026-05-21</title><id>https://aibenchmarks.dev/digest/2026-05-21</id><updated>2026-05-21T07:40:34.045646+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on WeirdML: 62.64 (#17)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle Game Arena Poker (Heads Up)&lt;/strong&gt;: GPT-5.5 (73.93) beat GPT-5.2 by 33.93&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA APEX-Agents&lt;/strong&gt;: Gemini 3.5 Flash (high) (47.05) beat GPT-5.5 (xhigh) by 9.37&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LA Leaderboard&lt;/strong&gt;: Qwen2.5-14B-Instruct-GPTQ-Int8 (63.6) beat gemma-2-9b-it by 0.27&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-21

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - Gemini 3.5 Flash (High) on WeirdML: 62.64 Average Score (#17/124)

NEW #1 LEADERS (3)
  - Kaggle Game Arena Poker (Heads Up) (Mean BB/100): GPT-5.5 (73.93) beat GPT-5.2 (40.0) by 33.93
  - AA APEX-Agents (Pass@1 (%</summary></entry><entry><title>AI Benchmark Digest — 2026-05-20</title><id>https://aibenchmarks.dev/digest/2026-05-20</id><updated>2026-05-20T07:43:37.557151+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; — ELO 1942, #9&lt;ul&gt;&lt;li&gt;AA MMMU-Pro: 84.28 (#1/190)&lt;/li&gt;&lt;li&gt;SEAL - MCP Atlas: 83.6 (#1/21)&lt;/li&gt;&lt;li&gt;AA Omniscience: 22.68 (#3/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Law: 57.4 (#4/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4/393)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 40.96 (#5/484)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 92.22 (#6/488)&lt;/li&gt;&lt;li&gt;AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6/393)&lt;/li&gt;&lt;li&gt;AA GDPval: 1655.7 (#7/365)&lt;/li&gt;&lt;li&gt;AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7/393)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (34)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Multi-turn Debate (Lechmazur): 1583.6 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA CritPt: 13.14 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GDPval: 1655.7 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GPQA Diamond: 92.22 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Humanity's Last Exam: 40.96 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA IFBench: 76.33 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Long Context Reasoning: 69.33 (#27)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience: 22.68 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Business: 45.8 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Health: 40.2 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Law: 57.4 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE): 65.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - C: 80.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Dart: 60.0 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Go: 50.0 (#32)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - HTML: 72.0 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Java: 51.0 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - JavaScript: 71.82 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Julia: 60.0 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Kotlin: 56.0 (#22)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Python: 61.0 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - R: 56.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Rust: 80.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Swift: 72.0 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - TypeScript: 67.78 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA SciCode: 53.12 (#11)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA TAU-2 Bench: 95.32 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Terminal-Bench Hard: 40.91 (#36)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-1: 92.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-2: 72.08 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on Artificial Analysis Intelligence Index: 55.33 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (GDPval-AA)&lt;/strong&gt;: Gemini 3.5 Flash (165600.0) beat Claude Sonnet 4.6 by 2300.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP Atlas)&lt;/strong&gt;: Gemini 3.5 Flash (83.6) beat Claude Opus 4.7 by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA MMMU-Pro&lt;/strong&gt;: Gemini 3.5 Flash (high) (84.28) beat Gemini 3.1 Pro Preview by 1.85&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - MCP Atlas&lt;/strong&gt;: gemini-3.5-flash (high) (83.6) beat Muse Spark by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Gemini 3.5 Flash (56.5) beat GPT-5.5 by 0.9&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-20

=== DAILY ===
NEW MODELS (1)
  - Gemini 3.5 Flash (High) — ELO 1942, #9/609 (above: Claude Opus 4.7 (Thinking), below: GPT-5.5 (High))
      AA MMMU-Pro: 84.28 (#1/190)
      SEAL - MCP Atlas: 83.6 (#1/21)
      AA Omniscience: 22.68 (#3/393)
      AA Omniscience - </summary></entry><entry><title>AI Benchmark Digest — 2026-05-17</title><id>https://aibenchmarks.dev/digest/2026-05-17</id><updated>2026-05-17T08:02:54.093472+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite by 3.0&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on SEAL Showdown: 1115.7 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat qwen3.5-397b-a17b by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VisuLogic&lt;/strong&gt;: PEREA-1.0new (52.8) beat Human by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) by 1.01&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA TAU-2 Bench&lt;/strong&gt;: JT-35B-Flash (99.12) beat GLM-4.7-Flash (Reasoning) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AISI Cyber TLO 10M&lt;/strong&gt;: GPT-5.5 (10.0) beat Claude Opus 4.6 by 0.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-17

=== DAILY ===
NEW #1 LEADERS (1)
  - OpenClawProBench (Overall Score (%)): intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite (73.7) by 3.0

=== WEEKLY ===
NEW SCORES FROM TOP-10 MODELS (3)
  - Claude Opus 4.7 (Thinking) on SEAL Showdown: 1115.7 Arena Score (#12</summary></entry><entry><title>AI Benchmark Digest — 2026-05-16</title><id>https://aibenchmarks.dev/digest/2026-05-16</id><updated>2026-05-16T07:15:27.727063+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-16

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on Chatbot Arena (Code): 1501.0 Elo (#9/79)

NEW #1 LEADERS (2)
  - MathArena - ARXIVLEAN March (Accuracy (%)): AlephProver (34.15) beat Aristotle (17.07) by 17.08
  - GAIA (Accuracy (%)): Co-Sight </summary></entry><entry><title>AI Benchmark Digest — 2026-05-14</title><id>https://aibenchmarks.dev/digest/2026-05-14</id><updated>2026-05-14T07:26:43.169192+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Pro-260215 (High)&lt;/strong&gt; — ELO 1781, #73&lt;ul&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 65.2 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 77.3 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 95.0 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 82.1 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 83.8 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 77.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 94.6 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 44.2 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 69.6 (#4/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Lite-260215 (High)&lt;/strong&gt; — ELO 1741, #103&lt;ul&gt;&lt;li&gt;OpenCompass Reasoning - Common: 78.1 (#2/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 91.7 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 59.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 67.1 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 72.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 77.1 (#8/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hy3-preview (High)&lt;/strong&gt; — ELO 1729, #110&lt;ul&gt;&lt;li&gt;OpenCompass Math - College: 81.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 76.0 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 74.5 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 75.4 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Academic: 43.6 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 58.5 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - Competition: 67.6 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 28.7 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 73.5 (#12/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ring-2.5-1T&lt;/strong&gt; — ELO 1711, #119&lt;ul&gt;&lt;li&gt;OpenCompass Knowledge - Social Science: 92.9 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 65.4 (#11/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 68.8 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 90.0 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 75.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Knowledge: 89.4 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Engineering: 90.8 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 69.8 (#15/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-14

=== DAILY ===
NEW MODELS (4)
  - Doubao-Seed-2-0-Pro-260215 (High) — ELO 1781, #73/796 (above: GPT-5.2 (Low), below: GLM-5-Turbo)
      OpenCompass LLM - Reasoning: 65.2 (#1/23)
      OpenCompass LLM - Math: 77.3 (#1/23)
      OpenCompass Knowledge - Humanities: 95.</summary></entry><entry><title>AI Benchmark Digest — 2026-05-13</title><id>https://aibenchmarks.dev/digest/2026-05-13</id><updated>2026-05-13T07:29:12.582080+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;ProgramBench&lt;/strong&gt; (Resolved (%)): leader GPT-5.5 (xHigh) (0.5), 13 models&lt;br&gt;&lt;span&gt;Meta and Stanford benchmark testing whether language-model agents can rebuild complete programs from only a compiled binary and documentation. Agents use mini-SWE-agent across 200 open-source program recreation tasks and are scored by hidden behavioral tests.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ProgramBench Almost&lt;/strong&gt; (Almost (%)): leader GPT-5.5 (xHigh) (13.5), 13 models&lt;br&gt;&lt;span&gt;Companion ProgramBench metric that counts near-complete program recreations: tasks where the generated implementation passes most hidden behavioral tests but does not fully resolve the benchmark task.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;JT-35B-Flash&lt;/strong&gt; — ELO 1693, #141&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 99.1 (#1/405)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - Go: 36.0 (#50/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - Java: 29.0 (#58/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - HTML: 48.0 (#60/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - JavaScript: 41.82 (#75/391)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 82.9 (#76/486)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - C: 53.0 (#78/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - PHP: 38.0 (#79/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - TypeScript: 36.67 (#82/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE): 35.0 (#83/391)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on WeirdML: 84.91 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) by 1.01&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA TAU-2 Bench&lt;/strong&gt;: JT-35B-Flash (99.1) beat GLM-4.7-Flash (Reasoning) by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-13

=== DAILY ===
NEW BENCHMARKS (2)
  - ProgramBench (Resolved (%)): leader GPT-5.5 (xHigh) (0.5), 13 models
      Meta and Stanford benchmark testing whether language-model agents can rebuild complete programs from only a compiled binary and documentation. Agents use </summary></entry><entry><title>AI Benchmark Digest — 2026-05-11</title><id>https://aibenchmarks.dev/digest/2026-05-11</id><updated>2026-05-11T08:08:39.844852+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: Sensenova 6.7 Flash Lite (73.7) beat qwen3.5-397b-a17b by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VisuLogic&lt;/strong&gt;: PEREA-1.0new (52.8) beat Human by 1.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-11

=== DAILY ===
NEW #1 LEADERS (2)
  - OpenClawProBench (Overall Score (%)): Sensenova 6.7 Flash Lite (73.7) beat qwen3.5-397b-a17b (70.4) by 3.3
  - VisuLogic (Overall Accuracy (%)): PEREA-1.0new (52.8) beat Human (51.4) by 1.4
</summary></entry><entry><title>AI Benchmark Digest — 2026-05-10</title><id>https://aibenchmarks.dev/digest/2026-05-10</id><updated>2026-05-10T07:49:15.895022+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (43)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Arabic&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Bengali&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - German&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - English&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Spanish&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - French&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Hindi&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Indonesian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Italian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Japanese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Korean&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Burmese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Portuguese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Swahili&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Yoruba&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Chinese&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Health&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Law&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - C&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - HTML&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - JavaScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Julia&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Kotlin&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - PHP&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Rust&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - TypeScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MMS SQ&lt;/strong&gt; (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - WikiANN SQ&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - ScaLA SQ&lt;/strong&gt; (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MultiWikiQA SQ&lt;/strong&gt; (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - MMS BS&lt;/strong&gt; (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - WikiANN BS&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - Multi Wiki QA BS&lt;/strong&gt; (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (43)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Arabic&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Bengali&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - German&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - English&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Spanish&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - French&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Hindi&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Indonesian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Italian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Japanese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Korean&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Burmese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Portuguese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Swahili&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Yoruba&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Chinese&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Health&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Law&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - C&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - HTML&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - JavaScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Julia&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Kotlin&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - PHP&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Rust&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - TypeScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MMS SQ&lt;/strong&gt; (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - WikiANN SQ&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - ScaLA SQ&lt;/strong&gt; (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MultiWikiQA SQ&lt;/strong&gt; (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - MMS BS&lt;/strong&gt; (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - WikiANN BS&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - Multi Wiki QA BS&lt;/strong&gt; (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (38)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GLM-5V Turbo (Reasoning)&lt;/strong&gt; — ELO 1738, #102&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 98.5 (#3/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 1330.87 (#43/360)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 72.8 (#44/188)&lt;/li&gt;&lt;li&gt;AA SciCode: 43.5 (#52/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 42.85 (#56/482)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 32.6 (#79/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -18.98 (#80/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 61.0 (#84/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 15.8 (#91/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 80.9 (#96/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ERNIE 5.0 Thinking Preview&lt;/strong&gt; — ELO 1631, #214&lt;ul&gt;&lt;li&gt;AA LiveCodeBench: 81.2 (#24/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 86.5 (#33/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 85.0 (#46/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 83.0 (#60/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 1.4 (#68/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 64.6 (#90/188)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 83.9 (#94/402)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 12.7 (#116/479)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 25.0 (#119/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 77.7 (#124/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Reasoning)&lt;/strong&gt; — ELO 1603, #245&lt;ul&gt;&lt;li&gt;AA AIME 2025: 90.3 (#25/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 76.8 (#41/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 83.8 (#44/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 1.1 (#76/388)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.86 (#80/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 64.7 (#85/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 13.1 (#111/479)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 55.7 (#117/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 78.3 (#119/483)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 74.3 (#121/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.5 33B&lt;/strong&gt; — ELO 1578, #277&lt;ul&gt;&lt;li&gt;AA MMMU-Pro: 67.3 (#77/188)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 79.4 (#106/483)&lt;/li&gt;&lt;li&gt;AA IFBench: 58.0 (#107/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 78.1 (#112/402)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.3 (#128/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 11.6 (#131/479)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 20.5 (#144/397)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 30.23 (#147/482)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 49.3 (#150/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 812.72 (#163/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (High)&lt;/strong&gt; — ELO 1562, #294&lt;ul&gt;&lt;li&gt;AA AIME 2025: 78.3 (#71/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 69.4 (#76/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.6 (#82/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 60.1 (#102/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.6 (#135/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.8 (#157/479)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 33.3 (#211/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 9.8 (#212/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 68.1 (#222/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 20.61 (#232/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Open 100B (Reasoning)&lt;/strong&gt; — ELO 1555, #307&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 81.58 (#61/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 57.7 (#110/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.2 (#170/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 48.2 (#180/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 36.0 (#195/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#204/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 666.33 (#207/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 21.67 (#224/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 65.7 (#243/483)&lt;/li&gt;&lt;li&gt;AA Omniscience: -54.1 (#262/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;JT-MINI&lt;/strong&gt; — ELO 1546, #324&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 93.0 (#40/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 18.2 (#154/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 831.97 (#157/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 25.37 (#187/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.6 (#223/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 67.6 (#225/483)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#263/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 36.7 (#277/411)&lt;/li&gt;&lt;li&gt;AA SciCode: 27.2 (#292/477)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.7 (#308/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2 Think V2&lt;/strong&gt; — ELO 1545, #328&lt;ul&gt;&lt;li&gt;AA IFBench: 62.8 (#94/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -33.92 (#125/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 52.7 (#135/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.5 (#165/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 71.3 (#192/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 24.12 (#201/482)&lt;/li&gt;&lt;li&gt;AA GDPval: 607.98 (#222/360)&lt;/li&gt;&lt;li&gt;AA SciCode: 33.0 (#223/477)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#240/397)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#252/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HyperCLOVA X SEED Think (32B)&lt;/strong&gt; — ELO 1537, #342&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 87.4 (#68/402)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.6 (#83/120)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 62.9 (#107/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 59.0 (#118/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.5 (#137/345)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 12.1 (#194/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 678.83 (#199/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.72 (#204/482)&lt;/li&gt;&lt;li&gt;AA Omniscience: -52.87 (#255/388)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#257/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro&lt;/strong&gt; — ELO 1527, #352&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 86.5 (#75/402)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 76.7 (#77/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 65.6 (#92/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 74.23 (#94/120)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.9 (#97/345)&lt;/li&gt;&lt;li&gt;AA IFBench: 49.3 (#155/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 7.7 (#195/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 70.1 (#200/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.06 (#213/482)&lt;/li&gt;&lt;li&gt;AA GDPval: 643.11 (#213/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Motif-2-12.7B (Reasoning)&lt;/strong&gt; — ELO 1520, #366&lt;ul&gt;&lt;li&gt;AA AIME 2025: 80.3 (#65/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 65.1 (#97/343)&lt;/li&gt;&lt;li&gt;AA IFBench: 57.0 (#113/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 79.6 (#122/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 8.2 (#183/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 46.5 (#185/402)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 69.5 (#210/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 19.08 (#244/482)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#250/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 485.33 (#255/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro Preview&lt;/strong&gt; — ELO 1517, #371&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 81.43 (#63/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 78.7 (#70/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 81.3 (#92/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 57.6 (#125/343)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 8.8 (#175/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 49.4 (#177/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 45.6 (#180/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 72.2 (#185/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 29.7 (#251/477)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#255/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (Medium)&lt;/strong&gt; — ELO 1512, #382&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 76.7 (#87/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 64.7 (#107/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 55.1 (#122/411)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 54.1 (#137/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 76.1 (#165/345)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 8.3 (#220/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -49.97 (#222/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 578.73 (#227/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 28.0 (#232/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#251/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 30B&lt;/strong&gt; — ELO 1491, #425&lt;ul&gt;&lt;li&gt;AA IFBench: 44.4 (#191/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 42.1 (#198/402)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#228/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 495.5 (#253/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 18.7 (#273/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#310/397)&lt;/li&gt;&lt;li&gt;AA SciCode: 25.8 (#315/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.69 (#324/482)&lt;/li&gt;&lt;li&gt;AA Omniscience: -67.78 (#342/388)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 48.1 (#354/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Non-reasoning)&lt;/strong&gt; — ELO 1487, #432&lt;ul&gt;&lt;li&gt;AA MMLU-Pro: 81.0 (#94/345)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 71.03 (#104/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 44.0 (#150/269)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 47.0 (#157/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 59.1 (#162/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 767.0 (#174/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.41 (#207/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 69.5 (#209/483)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#239/397)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#242/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (Low)&lt;/strong&gt; — ELO 1483, #444&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 71.44 (#103/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 35.3 (#173/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 39.3 (#187/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 71.3 (#212/345)&lt;/li&gt;&lt;li&gt;AA Omniscience: -48.07 (#212/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 41.0 (#233/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#254/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 19.0 (#271/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 4.5 (#277/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 367.48 (#285/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Reasoning)&lt;/strong&gt; — ELO 1479, #450&lt;ul&gt;&lt;li&gt;AA MATH-500: 96.7 (#30/193)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 79.61 (#78/120)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.5 (#107/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 61.6 (#113/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 61.3 (#115/269)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#206/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 7.0 (#213/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 68.7 (#215/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 30.2 (#246/477)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 28.1 (#251/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Reasoning)&lt;/strong&gt; — ELO 1474, #458&lt;ul&gt;&lt;li&gt;AA Omniscience: -20.05 (#82/388)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.6 (#104/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 51.4 (#143/188)&lt;/li&gt;&lt;li&gt;AA IFBench: 44.2 (#193/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 8.3 (#218/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 30.7 (#222/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.76 (#250/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 57.6 (#297/483)&lt;/li&gt;&lt;li&gt;AA GDPval: 304.3 (#312/360)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.8 (#314/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Reasoning)&lt;/strong&gt; — ELO 1473, #461&lt;ul&gt;&lt;li&gt;AA MATH-500: 97.7 (#21/193)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 74.7 (#48/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 80.0 (#68/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 81.8 (#82/345)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 73.46 (#97/120)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 10.5 (#145/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 73.9 (#167/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 34.4 (#203/477)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#240/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 499.86 (#249/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think Preview&lt;/strong&gt; — ELO 1473, #462&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 93.3 (#38/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 47.1 (#169/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 19.99 (#236/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.7 (#257/479)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#259/388)&lt;/li&gt;&lt;li&gt;AA Omniscience: -55.28 (#267/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 14.7 (#294/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 337.02 (#299/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#315/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 53.8 (#320/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think&lt;/strong&gt; — ELO 1468, #468&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 81.0 (#103/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 54.6 (#124/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.3 (#132/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.1 (#241/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.62 (#258/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 60.1 (#279/483)&lt;/li&gt;&lt;li&gt;AA GDPval: 374.11 (#282/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.0 (#312/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -63.3 (#321/388)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#342/397)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (March 2025, chatgpt-4o-latest)&lt;/strong&gt; — ELO 1449, #500&lt;ul&gt;&lt;li&gt;AA MATH-500: 89.3 (#73/193)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.3 (#110/345)&lt;/li&gt;&lt;li&gt;AA SciCode: 36.6 (#165/477)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.5 (#170/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 25.7 (#196/269)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 65.5 (#247/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.56 (#260/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.0 (#305/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Reasoning)&lt;/strong&gt; — ELO 1448, #502&lt;ul&gt;&lt;li&gt;AA MATH-500: 95.9 (#36/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 54.7 (#132/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.5 (#136/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#215/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.5 (#227/479)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 27.7 (#238/343)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 64.3 (#251/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.49 (#262/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 26.9 (#262/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.1 (#262/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Non-reasoning)&lt;/strong&gt; — ELO 1435, #524&lt;ul&gt;&lt;li&gt;AA MATH-500: 88.9 (#76/193)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 75.34 (#91/120)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.4 (#172/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 75.0 (#178/345)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 30.0 (#186/269)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#203/388)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 31.9 (#230/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 447.04 (#265/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 4.5 (#273/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 33.7 (#306/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)&lt;/strong&gt; — ELO 1408, #560&lt;ul&gt;&lt;li&gt;AA MATH-500: 77.5 (#113/193)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#216/388)&lt;/li&gt;&lt;li&gt;AA Omniscience: -49.68 (#219/388)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 69.8 (#221/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 28.0 (#235/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 7.7 (#237/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 39.5 (#247/411)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.3 (#309/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.7 (#330/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.35 (#336/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;NVIDIA Nemotron 3 Nano 4B&lt;/strong&gt; — ELO 1388, #586&lt;ul&gt;&lt;li&gt;AA IFBench: 58.2 (#106/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#211/388)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#238/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 28.1 (#252/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 476.83 (#258/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 16.7 (#286/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.8 (#323/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.68 (#325/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.3 (#338/483)&lt;/li&gt;&lt;li&gt;AA Omniscience: -71.53 (#351/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 3B&lt;/strong&gt; — ELO 1380, #595&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#232/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 366.32 (#286/360)&lt;/li&gt;&lt;li&gt;AA IFBench: 33.7 (#307/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#312/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 19.6 (#323/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 3.0 (#341/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -77.38 (#370/388)&lt;/li&gt;&lt;li&gt;AA SciCode: 11.9 (#412/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 8.54 (#435/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 31.4 (#441/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Reasoning)&lt;/strong&gt; — ELO 1376, #604&lt;ul&gt;&lt;li&gt;AA Omniscience: -23.98 (#94/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 44.6 (#160/188)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#170/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.0 (#265/411)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 15.0 (#292/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 3.0 (#299/397)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 15.21 (#309/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.8 (#315/402)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.8 (#322/479)&lt;/li&gt;&lt;li&gt;AA GDPval: 272.59 (#338/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)&lt;/strong&gt; — ELO 1351, #630&lt;ul&gt;&lt;li&gt;AA MATH-500: 94.7 (#41/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 50.0 (#140/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 49.3 (#153/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 55.6 (#283/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.1 (#289/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.43 (#334/482)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#358/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 11.7 (#362/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 25.5 (#375/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 40.8 (#393/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ling-mini-2.0&lt;/strong&gt; — ELO 1346, #635&lt;ul&gt;&lt;li&gt;AA AIME 2025: 49.3 (#142/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.9 (#169/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 67.1 (#243/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#284/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.0 (#304/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 56.2 (#306/483)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 6.7 (#329/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 264.15 (#341/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#345/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 13.2 (#356/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Jamba Reasoning 3B&lt;/strong&gt; — ELO 1320, #657&lt;ul&gt;&lt;li&gt;AA IFBench: 52.4 (#137/411)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 10.7 (#231/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 21.0 (#267/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#268/388)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 57.7 (#274/345)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 7.0 (#323/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 15.8 (#342/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#344/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 257.67 (#345/360)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.6 (#347/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Reasoning)&lt;/strong&gt; — ELO 1266, #696&lt;ul&gt;&lt;li&gt;AA AIME 2025: 50.3 (#139/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 51.6 (#143/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#241/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.8 (#251/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 58.8 (#268/345)&lt;/li&gt;&lt;li&gt;AA GDPval: 296.88 (#317/360)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.5 (#336/483)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 16.4 (#338/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#370/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#377/397)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Non-reasoning)&lt;/strong&gt; — ELO 1262, #697&lt;ul&gt;&lt;li&gt;AA AIME 2025: 24.0 (#200/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 29.3 (#226/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#239/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.8 (#250/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 50.0 (#294/345)&lt;/li&gt;&lt;li&gt;AA GDPval: 298.76 (#316/360)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.5 (#318/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#369/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#376/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 25.3 (#376/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 1B&lt;/strong&gt; — ELO 1258, #701&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#234/388)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 6.3 (#244/269)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.1 (#292/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 22.8 (#294/402)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 32.5 (#331/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 4.7 (#333/343)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 4.0 (#340/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 259.61 (#342/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#373/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -81.82 (#377/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 350M&lt;/strong&gt; — ELO 1137, #759&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#227/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.4 (#228/479)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 1.3 (#262/269)&lt;/li&gt;&lt;li&gt;AA GDPval: 294.09 (#319/360)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 1.9 (#339/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 12.7 (#343/345)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 14.6 (#349/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#366/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#369/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -87.25 (#387/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 32B&lt;/strong&gt; — ELO 1037, #780&lt;ul&gt;&lt;li&gt;AA AIME 2025: 3.3 (#256/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.1 (#264/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 51.1 (#292/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 6.8 (#328/343)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#391/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#393/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 10.57 (#397/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#401/402)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 32.8 (#429/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 8.0 (#437/477)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Phi-3 Mini Instruct 3.8B&lt;/strong&gt; — ELO 1025, #781&lt;ul&gt;&lt;li&gt;AA MATH-500: 45.7 (#172/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 0.3 (#265/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 43.5 (#308/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 11.6 (#308/343)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 2.0 (#345/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.4 (#372/479)&lt;/li&gt;&lt;li&gt;AA IFBench: 23.9 (#382/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#388/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#398/402)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 10.1 (#407/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 7B&lt;/strong&gt; — ELO 958, #787&lt;ul&gt;&lt;li&gt;AA AIME 2025: 0.7 (#263/269)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.5 (#265/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 28.2 (#334/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 4.1 (#335/343)&lt;/li&gt;&lt;li&gt;AA IFBench: 24.4 (#381/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#390/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#391/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#399/402)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 9.3 (#423/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 28.8 (#455/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (7)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Mythos Preview&lt;/strong&gt; on METR Benchmark: 17.41 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on OpenClawProBench: 68.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on OpenClawProBench: 69.3 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 68.8 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on Epoch AI - ECI: 159.5 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on PinchBench: 18.11 (#39)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on VoxelBench: 2107.0 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (14)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FoodTruckBench&lt;/strong&gt;: GPT-5.5 (61408.0) beat Claude Opus 4.6 by 11889.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA2&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (64.72) beat GPT-4o by 28.05&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruQuALITY&lt;/strong&gt;: 01-ai_Yi-9B-200K (95.9) beat GPT-4o by 12.57&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - AudioMultiChallenge - Audio Output&lt;/strong&gt;: gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) by 12.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierSWE&lt;/strong&gt;: GPT-5.5 (83.0) beat Claude Opus 4.7 by 9.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4&lt;/strong&gt;: AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) by 8.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Story Theory Bench&lt;/strong&gt;: glm-5 (99.6) beat deepseek-v3.2 by 7.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Parametric&lt;/strong&gt;: Gemini 3.1 Pro Preview (78.96) beat Gemini 3 Flash Preview by 6.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) by 4.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciAbstractRetrieval&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (81.5) beat GLM-4 9B Chat by 3.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS (Google)&lt;/strong&gt;: GPT-5.5 (71.19) beat Gemini 3.1 Pro Preview by 3.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA1&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (80.5) beat GPT-4o by 2.17&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Android Bench&lt;/strong&gt;: GPT 5.5 (74.0) beat GPT-5.4 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: green tree (68.2) beat Cassi ensemble_2_crowdadj by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-10

=== DAILY ===
NEW BENCHMARKS (43)
  - AA Global-MMLU-Lite - Arabic (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models
  - AA Global-MMLU-Lite - Bengali (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models
  - AA Global-MMLU-Lite - German (</summary></entry><entry><title>AI Benchmark Digest — 2026-05-09</title><id>https://aibenchmarks.dev/digest/2026-05-09</id><updated>2026-05-09T07:40:39.118338+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (8)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Factory Code Review Benchmark&lt;/strong&gt; (Mean F1 (%)): leader GPT-5.2 (60.5), 13 models&lt;br&gt;&lt;span&gt;Factory benchmark for code review quality, scoring model comments against expected findings with mean F1 across realistic pull request review tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU&lt;/strong&gt; (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models&lt;br&gt;&lt;span&gt;Albanian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU&lt;/strong&gt; (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models&lt;br&gt;&lt;span&gt;Bosnian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Knowledge&lt;/strong&gt; (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models&lt;br&gt;&lt;span&gt;EuroEval Albanian knowledge category: language-specific factual or domain-knowledge tasks from EuroEval&amp;#x27;s public albanian_all.csv leaderboard, scored as the average task score for each model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Common Sense Reasoning&lt;/strong&gt; (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models&lt;br&gt;&lt;span&gt;EuroEval Albanian common-sense reasoning category: language-specific commonsense tasks from EuroEval&amp;#x27;s public albanian_all.csv leaderboard, scored as the average task score for each model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;IMO-Bench&lt;/strong&gt; (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models&lt;br&gt;&lt;span&gt;Advanced IMO-ProofBench leaderboard for rigorous mathematical proof writing on olympiad-level problems.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum&lt;/strong&gt; (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models&lt;br&gt;&lt;span&gt;Chart question-answering benchmark over real-world charts, testing visual, textual, and synthesis reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SvelteBench&lt;/strong&gt; (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models&lt;br&gt;&lt;span&gt;Frontend coding benchmark for Svelte component tasks, scored by average pass@1.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Grok 4.3 (Non-reasoning)&lt;/strong&gt; — ELO 1647, #259&lt;ul&gt;&lt;li&gt;AA GDPval: 1306.14 (#52/360)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 64.8 (#88/188)&lt;/li&gt;&lt;li&gt;AA Omniscience: -32.3 (#121/388)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 31.02 (#139/482)&lt;/li&gt;&lt;li&gt;AA SciCode: 37.4 (#146/477)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 65.8 (#148/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 18.9 (#149/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 47.6 (#165/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#182/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.5 (#226/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 68.8 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4&lt;/strong&gt;: AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) by 8.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark&lt;/strong&gt;: claude mythos preview early (17.41) beat claude opus 4 6 by 5.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark (80% Horizon)&lt;/strong&gt;: claude mythos preview early (3.1) beat gemini 3 1 pro by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: green tree (68.2) beat Cassi ensemble_2_crowdadj by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-09

=== DAILY ===
NEW BENCHMARKS (8)
  - Factory Code Review Benchmark (Mean F1 (%)): leader GPT-5.2 (60.5), 13 models
      Factory benchmark for code review quality, scoring model comments against expected findings with mean F1 across realistic pull request review tas</summary></entry><entry><title>AI Benchmark Digest — 2026-05-08</title><id>https://aibenchmarks.dev/digest/2026-05-08</id><updated>2026-05-08T07:40:34.661988+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (8)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU&lt;/strong&gt; (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU&lt;/strong&gt; (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Knowledge&lt;/strong&gt; (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Common Sense Reasoning&lt;/strong&gt; (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MoNaCo&lt;/strong&gt; (F1): leader o3 (61.18), 15 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;IMO-Bench&lt;/strong&gt; (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum&lt;/strong&gt; (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SvelteBench&lt;/strong&gt; (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SEAL - AudioMultiChallenge - Audio Output&lt;/strong&gt;: gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) by 12.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Story Theory Bench&lt;/strong&gt;: glm-5 (99.6) beat deepseek-v3.2 by 7.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) by 4.63&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-08

=== DAILY ===
NEW BENCHMARKS (8)
  - EuroEval Albanian NLU (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models
  - EuroEval Bosnian NLU (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models
  - EuroEval Albanian Kno</summary></entry><entry><title>AI Benchmark Digest — 2026-05-07</title><id>https://aibenchmarks.dev/digest/2026-05-07</id><updated>2026-05-07T07:40:24.104745+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (19)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LIBRA - MatreshkaNames *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (81.2), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciPassageCount *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (25.77), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ru2WikiMultihopQA *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (66.63), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LongContextMultiQ *&lt;/strong&gt; (Dataset Total Score (%)): leader 01-ai_Yi-34B-200K (53.14), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LibrusecMHQA *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (51.0), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA3 *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (38.38), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kernel Arena - KernelBench HIP&lt;/strong&gt; (Mean Correctness+Speedup): leader GPT-5.2 (15.463), 11 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kernel Arena - WaferBench NVFP4&lt;/strong&gt; (Mean Correctness+Speedup): leader Gemini 3.1 Pro (2.274), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV_FALSE April&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (72.13), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV April&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (65.48), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark (80% Horizon)&lt;/strong&gt; (80% Time Horizon (hours)): leader gemini 3 1 pro (1.5), 25 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (HealthBench)&lt;/strong&gt; (Score (%)): leader Kimi K2-Thinking-0905 (58.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SCORE Robustness (Accuracy)&lt;/strong&gt; (Average Accuracy (%)): leader Llama-3.1-70B-Instruct (67.02), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SCORE Robustness (Consistency)&lt;/strong&gt; (Average Consistency Rate (%)): leader Llama-3.1-70B-Instruct (72.39), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multilingual MMLU Leaderboard&lt;/strong&gt; (Average Accuracy (%)): leader Claude-3.5-Sonnet (77.39), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pinocchio Italian Leaderboard&lt;/strong&gt; (Average Accuracy (%)): leader gemma-2-27b-it (70.97), 45 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM Leaderboard&lt;/strong&gt; (Average Score (%)): leader gemma-4-26B-A4B-it (reasoning) (63.29), 13 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Arabic Broad Leaderboard&lt;/strong&gt; (Average Score (0-10)): leader gemini-3-pro-preview (9.204), 87 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Darija Chatbot Arena&lt;/strong&gt; (Elo Rating): leader GPT-4o (1404.8), 13 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FoodTruckBench&lt;/strong&gt;: GPT-5.5 (61408.0) beat Claude Opus 4.6 by 11889.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ASCIIBench&lt;/strong&gt;: claude-opus-4.5 (1656.0) beat claude-opus-4.1 by 5.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Parametric&lt;/strong&gt;: Gemini 3.1 Pro Preview (78.96) beat GPT-5.5 by 0.92&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-07

=== DAILY ===
NEW BENCHMARKS (19)
  - LIBRA - MatreshkaNames * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (81.2), 7 models
  - LIBRA - ruSciPassageCount * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (25.77), 7 models
  </summary></entry></feed>