<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="/feed.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>AI Benchmark Digest</title><subtitle>AI benchmark changes — new models, leader shifts, and trends</subtitle><link href="https://aibenchmarks.dev/data/feed.xml" rel="self" /><link href="https://aibenchmarks.dev/#/digest" rel="alternate" /><id>https://aibenchmarks.dev/feed</id><icon>https://aibenchmarks.dev/favicon.ico</icon><updated>2026-06-04T08:22:19.073162+00:00</updated><entry><title>AI Benchmark Digest — 2026-06-04</title><id>https://aibenchmarks.dev/digest/2026-06-04</id><updated>2026-06-04T08:22:19.073162+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-04

=== DAILY ===
NEW #1 LEADERS (1)
  - GAIA (Accuracy (%)): CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 (93.02) by 0.34
</summary></entry><entry><title>AI Benchmark Digest — 2026-06-03</title><id>https://aibenchmarks.dev/digest/2026-06-03</id><updated>2026-06-03T08:25:40.519214+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on IUMB: 100.0 (#2)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3 Deep Think&lt;/strong&gt; on IUMB: 87.5 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 11-12&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - APEX 2025&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Kangaroo 2025 Levels 7-8&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - AIME 2026&lt;/strong&gt;: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-03

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (2)
  - GPT-5.5 Pro on IUMB: 100.0 Score (%) (#2/55)
  - Gemini 3 Deep Think on IUMB: 87.5 Score (%) (#6/55)

NEW #1 LEADERS (4)
  - MathArena - Kangaroo 2025 Levels 11-12 (Accuracy (%)): Claude Opus 4.8 (Thinking) (100.0)</summary></entry><entry><title>AI Benchmark Digest — 2026-06-02</title><id>https://aibenchmarks.dev/digest/2026-06-02</id><updated>2026-06-02T08:19:29.198019+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GIM&lt;/strong&gt; (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models&lt;br&gt;&lt;span&gt;Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, language, world-knowledge, and document tasks. Scores are reported as IRT ability on GIM-820.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on IMO-Bench: 71.9 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro (xHigh)&lt;/strong&gt; on IMO-Bench: 88.1 (#2)&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-02

=== DAILY ===
NEW BENCHMARKS (1)
  - GIM (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models
      Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, langua</summary></entry><entry><title>AI Benchmark Digest — 2026-06-01</title><id>https://aibenchmarks.dev/digest/2026-06-01</id><updated>2026-06-01T08:29:45.265204+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EQ-Bench Creative Writing v3&lt;/strong&gt;: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) by 23.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chatbot Arena (Image-to-Video)&lt;/strong&gt;: Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-06-01

=== DAILY ===
NEW #1 LEADERS (3)
  - EQ-Bench Creative Writing v3 (Elo): Claude Opus 4.7 (2050.8) beat GPT-5.4 (1906.0) by 144.8
  - Design Arena (Data Viz) (Elo): GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) (1344.0) by 23.0
  - Chatbot Arena (Image-to-Video) (</summary></entry><entry><title>AI Benchmark Digest — 2026-05-30</title><id>https://aibenchmarks.dev/digest/2026-05-30</id><updated>2026-05-30T07:49:09.779753+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Natural Intelligence: 65.39 (#30)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Willingness (W/10): 2.2 (#1094)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI - Writing: 65.88 (#34)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8 (Adaptive Reasoning, Max Effort)&lt;/strong&gt; on UGI Leaderboard: 52.64 (#69)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.4 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Bullshit Benchmark&lt;/strong&gt;: Claude Opus 4.8 (96.4) beat Claude Sonnet 4.6 by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (xHigh) (3.5) beat GPT-5.5 (Thinking, xHigh) by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-30

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (5)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Natural Intelligence: 65.39 NatInt Score (#30/1247)
  - Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Willingness (W/10): 2.2 W/10 Score (#1094/</summary></entry><entry><title>AI Benchmark Digest — 2026-05-29</title><id>https://aibenchmarks.dev/digest/2026-05-29</id><updated>2026-05-29T08:06:41.324282+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;DeepSWE&lt;/strong&gt; (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models&lt;br&gt;&lt;span&gt;DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realistic repository work.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.8&lt;/strong&gt; — ELO 1801, #52&lt;ul&gt;&lt;li&gt;Clerk LLM Leaderboard: 91.3 (#1/19)&lt;/li&gt;&lt;li&gt;Vellum - HumanEval: 88.6 (#1/36)&lt;/li&gt;&lt;li&gt;Vellum - Humanity's Last Exam: 57.9 (#1/20)&lt;/li&gt;&lt;li&gt;LLM Stats (DeepSearchQA): 93.1 (#1/6)&lt;/li&gt;&lt;li&gt;LLM Stats (Include): 87.6 (#1/30)&lt;/li&gt;&lt;li&gt;LLM Stats (OSWorld-Verified): 83.4 (#1/14)&lt;/li&gt;&lt;li&gt;LLM Stats (ScreenSpot Pro): 87.9 (#1/22)&lt;/li&gt;&lt;li&gt;LLM Stats (Toolathlon): 59.9 (#1/20)&lt;/li&gt;&lt;li&gt;FrontierSWE: 83.0 (#1/11)&lt;/li&gt;&lt;li&gt;Vals AI (Vals Index): 70.17 (#1/20)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on WebDev Arena: 1478.93 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on WebDev Arena: 1504.74 (#12)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA GDPval&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (1889.8) beat GPT-5.5 (xHigh) by 120.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (57.9) beat Gemini 3 Pro by 12.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Clerk LLM Leaderboard&lt;/strong&gt;: Claude Opus 4.8 (91.3) beat GPT-5.4 by 11.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt;: Claude Opus 4.8 (82.72) beat Claude Opus 4.7 by 11.72&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - Apex Agents&lt;/strong&gt;: gemini-3.5-flash_unknown (49.6) beat GPT-5.5 (xHigh) by 11.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (OSWorld-Verified)&lt;/strong&gt;: Claude Opus 4.8 (83.4) beat Claude Mythos Preview by 3.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Claude Opus 4.8 (59.9) beat Gemini 3.5 Flash by 3.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Multimodal Index&lt;/strong&gt;: Claude Opus 4.8 (70.71) beat GPT-5.5 by 2.94&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI (Vals Index)&lt;/strong&gt;: Claude Opus 4.8 (70.17) beat GPT-5.5 by 2.55&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (DeepSearchQA)&lt;/strong&gt;: Claude Opus 4.8 (93.1) beat Claude Opus 4.6 by 1.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (ScreenSpot Pro)&lt;/strong&gt;: Claude Opus 4.8 (87.9) beat GPT-5.2 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Claude Opus 4.8 (87.6) beat Qwen 3.7 Max by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Artificial Analysis Intelligence Index&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (61.44) beat GPT-5.5 (xHigh) by 1.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Claude Opus 4.8 Fast (94.49) beat Qwen Max by 1.05&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Humanity's Last Exam&lt;/strong&gt;: Claude Opus 4.8 (Adaptive Reasoning, Max Effort) (45.74) beat Gemini 3.1 Pro (Preview) by 1.02&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vellum - HumanEval&lt;/strong&gt;: Claude Opus 4.8 (88.6) beat Claude Opus 4.7 by 1.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-29

=== DAILY ===
NEW BENCHMARKS (1)
  - DeepSWE (Pass@1 (%)): leader GPT-5.5 (xHigh) (70.0), 12 models
      DataCurve benchmark measuring frontier coding agents on original, long-horizon software engineering tasks. Reports pass rates for model configurations on realis</summary></entry><entry><title>AI Benchmark Digest — 2026-05-28</title><id>https://aibenchmarks.dev/digest/2026-05-28</id><updated>2026-05-28T08:13:42.023730+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on SWE-rebench: 62.73 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Grounding&lt;/strong&gt;: Gemma 4 26B A4B (80.87) beat GPT-5.2 by 4.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Qwen Max (93.44) beat Grok 0.1 by 1.37&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-28

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on SWE-rebench: 62.73 Resolved (%) (#1/82)

NEW #1 LEADERS (2)
  - Kaggle FACTS Grounding (Score (%)): Gemma 4 26B A4B (80.87) beat GPT-5.2 (76.17) by 4.7
  - PinchBench (Success Rate (%)): Qwen Max</summary></entry><entry><title>AI Benchmark Digest — 2026-05-27</title><id>https://aibenchmarks.dev/digest/2026-05-27</id><updated>2026-05-27T08:20:58.056719+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on Creative Writing (Lechmazur): 3.2 (#2)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (11)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Chess (Saplin)&lt;/strong&gt;: GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro by 20.8&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (PolyMATH)&lt;/strong&gt;: Qwen 3.7 Max (86.5) beat Qwen 3.6 Plus by 9.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP-Mark)&lt;/strong&gt;: Qwen 3.7 Max (60.8) beat Kimi K2.6 by 4.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (NL2Repo)&lt;/strong&gt;: Qwen 3.7 Max (47.2) beat GLM-5.1 by 4.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-ProX)&lt;/strong&gt;: Qwen 3.7 Max (87.0) beat Qwen 3.6 Plus by 2.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (HMMT Feb 26)&lt;/strong&gt;: Qwen 3.7 Max (97.1) beat DeepSeek V4 Pro (Max) by 1.9&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MAXIFE)&lt;/strong&gt;: Qwen 3.7 Max (89.2) beat Qwen 3.6 Plus by 1.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Include)&lt;/strong&gt;: Qwen 3.7 Max (86.2) beat Qwen 3.5 397B A17B by 0.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (IMO-AnswerBench)&lt;/strong&gt;: Qwen 3.7 Max (90.0) beat DeepSeek V4 Pro (Max) by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Creative Writing (Lechmazur)&lt;/strong&gt;: GPT-5.5 (Thinking, xHigh) (3.2) beat GPT-5.5 by 0.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MMLU-Redux)&lt;/strong&gt;: Qwen 3.7 Max (95.0) beat Qwen 3.5 397B A17B by 0.1&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-27

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.4 (xHigh) on Creative Writing (Lechmazur): 3.2 Mean Score (#2/25)

NEW #1 LEADERS (11)
  - LLM Chess (Saplin) (ELO): GPT-5.5 (Medium) (1532.2) beat Gemini 3.1 Pro (1511.4) by 20.8
  - LLM Stats (PolyMATH) (Sc</summary></entry><entry><title>AI Benchmark Digest — 2026-05-25</title><id>https://aibenchmarks.dev/digest/2026-05-25</id><updated>2026-05-25T08:26:35.093083+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (6)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Base&lt;/strong&gt; (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Hard Sub-Q&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Thinking) (76.6), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Free&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (45.1), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMEval-Logic Formalization Fixed&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 Pro (No-Think) (60.2), 14 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ExploitBench v8-bench&lt;/strong&gt; (Mean Capability (%)): leader Claude Mythos Preview (69.0), 9 models&lt;br&gt;&lt;span&gt;V8 exploitation ladder benchmark measuring how far AI agents climb from code reachability through crash reproduction, exploit primitives, and arbitrary code execution. Reports mean capability across 41 V8 bug environments.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-25

=== DAILY ===
NEW BENCHMARKS (6)
  - LLMEval-Logic Base (Accuracy (%)): leader Seed 2.0 Pro (Thinking) (75.5), 14 models
  - LLMEval-Logic Hard (Accuracy (%)): leader Gemini 3.1 Pro (Thinking) (37.5), 14 models
  - LLMEval-Logic Hard Sub-Q (Accuracy (%)): leader Cla</summary></entry><entry><title>AI Benchmark Digest — 2026-05-24</title><id>https://aibenchmarks.dev/digest/2026-05-24</id><updated>2026-05-24T07:56:34.401567+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (14)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;NanoGPT-Bench&lt;/strong&gt; (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models&lt;br&gt;&lt;span&gt;Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress coding agents recover under a fixed H100 compute budget.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CursorBench 3.1&lt;/strong&gt; (Score (%)): leader Claude Opus 4.7 (64.8), 7 models&lt;br&gt;&lt;span&gt;Cursor benchmark of ambiguous, multi-file coding tasks from real Cursor sessions, with models scored by task success percentage and average cost per task.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench&lt;/strong&gt; (Pass Rate (%)): leader GPT-5.4 (Medium) (40.2), 7 models&lt;br&gt;&lt;span&gt;Small molecule drug design agent benchmark with sandboxed Python, Boltz structure prediction, and ADMET tooling. Measures pass rate across 502 computationally verifiable chemistry tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SMDD-Bench Diversity&lt;/strong&gt; (Avg Successful): leader Claude Sonnet 4.6 (8.4), 7 models&lt;br&gt;&lt;span&gt;SMDD-Bench diversity slice measuring whether agents generate multiple distinct, novel, successful molecule designs across repeated Lead Optimization rollouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Blueprint-Bench 2&lt;/strong&gt; (Connectivity Similarity Score): leader GPT 5.5 (0.362), 12 models&lt;br&gt;&lt;span&gt;Andon Labs spatial reasoning benchmark where agents convert apartment photographs into 2D floor plans, scored by normalized connectivity similarity against ground truth layouts.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PACT (Lechmazur)&lt;/strong&gt; (CMS Points): leader GPT-5.5 (high) (59.0), 25 models&lt;br&gt;&lt;span&gt;Pairwise Auction Conversation Testbed for multi-round buyer-seller bargaining. LLMs negotiate over 20 rounds with hidden private values, scored by Composite Model Score from head-to-head surplus capture.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FormationEval&lt;/strong&gt; (Accuracy (%)): leader gemini-3-pro-preview (99.8), 72 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench&lt;/strong&gt; (Average Score (%)): leader claude-opus-4-7 (66.21), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Translate Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (80.2), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Punctuate Punct F1&lt;/strong&gt; (Score (%)): leader claude-opus-4-7 (80.02), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Char-Gloss Judge&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (73.6), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Idiom-Source Book EM&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (74.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Fill-In Exact&lt;/strong&gt; (Score (%)): leader claude-opus-4-7-thinking (88.0), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chinese Classical Bench - Compress Efficiency&lt;/strong&gt; (Score (%)): leader deepseek-3.2 (16.32), 9 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.1 Pro (High)&lt;/strong&gt; on CLBench: 20.8 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Evals for Every Language&lt;/strong&gt;: Gemini 3.1 Pro (69.11) beat Gemini 2.5 Flash by 6.52&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLBench&lt;/strong&gt;: GPT-5.4 (xHigh) (27.9) beat GPT-5.1 (High) by 4.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LiveBench Logic With Navigation&lt;/strong&gt;: Qwen Max (84.0) beat Claude Opus 4.6 (Thinking) by 4.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Spider 2.0-Lite&lt;/strong&gt;: DivSkill-SQL (73.13) beat SOMA-SQL by 1.11&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PinchBench&lt;/strong&gt;: Grok 0.1 (92.07) beat Claude Opus 4.7 by 0.49&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-24

=== DAILY ===
NEW BENCHMARKS (14)
  - NanoGPT-Bench (% of Human Progress Recovered): leader Claude Opus 4.6 (9.3), 2 models
      Autonomous research benchmark built on the NanoGPT Speedrun, measuring how much of five months of human pretraining-speedup progress cod</summary></entry><entry><title>AI Benchmark Digest — 2026-05-23</title><id>https://aibenchmarks.dev/digest/2026-05-23</id><updated>2026-05-23T07:20:10.541511+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OSWorld&lt;/strong&gt;: Opus 4.7 (83.64) beat Holo3-35B-A3B by 1.08&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-23

=== DAILY ===
NEW #1 LEADERS (1)
  - OSWorld (Success Rate (%)): Opus 4.7 (83.64) beat Holo3-35B-A3B (82.56) by 1.08
</summary></entry><entry><title>AI Benchmark Digest — 2026-05-22</title><id>https://aibenchmarks.dev/digest/2026-05-22</id><updated>2026-05-22T07:36:15.662013+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Sycophancy (Lechmazur): 3.5 (#11)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;UGI - Writing&lt;/strong&gt;: gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro-preview (thinking_level=low) by 0.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Arabic Broad Leaderboard&lt;/strong&gt;: gemini-3.5-flash (9.253) beat gemini-3-pro-preview by 0.05&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-22

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (High) on Sycophancy (Lechmazur): 3.5 Sycophancy rate % (lower is better) (#11/31)

NEW #1 LEADERS (2)
  - UGI - Writing (Writing Score): gemini-3.5-flash (thinking_level=medium) (72.54) beat gemini-3.1-pro</summary></entry><entry><title>AI Benchmark Digest — 2026-05-21</title><id>https://aibenchmarks.dev/digest/2026-05-21</id><updated>2026-05-21T07:40:34.045646+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on WeirdML: 62.64 (#17)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Kaggle Game Arena Poker (Heads Up)&lt;/strong&gt;: GPT-5.5 (73.93) beat GPT-5.2 by 33.93&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA APEX-Agents&lt;/strong&gt;: Gemini 3.5 Flash (high) (47.05) beat GPT-5.5 (xhigh) by 9.37&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LA Leaderboard&lt;/strong&gt;: Qwen2.5-14B-Instruct-GPTQ-Int8 (63.6) beat gemma-2-9b-it by 0.27&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-21

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - Gemini 3.5 Flash (High) on WeirdML: 62.64 Average Score (#17/124)

NEW #1 LEADERS (3)
  - Kaggle Game Arena Poker (Heads Up) (Mean BB/100): GPT-5.5 (73.93) beat GPT-5.2 (40.0) by 33.93
  - AA APEX-Agents (Pass@1 (%</summary></entry><entry><title>AI Benchmark Digest — 2026-05-20</title><id>https://aibenchmarks.dev/digest/2026-05-20</id><updated>2026-05-20T07:43:37.557151+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; — ELO 1942, #9&lt;ul&gt;&lt;li&gt;AA MMMU-Pro: 84.28 (#1/190)&lt;/li&gt;&lt;li&gt;SEAL - MCP Atlas: 83.6 (#1/21)&lt;/li&gt;&lt;li&gt;AA Omniscience: 22.68 (#3/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Law: 57.4 (#4/393)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4/393)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 40.96 (#5/484)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 92.22 (#6/488)&lt;/li&gt;&lt;li&gt;AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6/393)&lt;/li&gt;&lt;li&gt;AA GDPval: 1655.7 (#7/365)&lt;/li&gt;&lt;li&gt;AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7/393)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (34)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (High)&lt;/strong&gt; on Multi-turn Debate (Lechmazur): 1583.6 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA CritPt: 13.14 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GDPval: 1655.7 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA GPQA Diamond: 92.22 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Humanity's Last Exam: 40.96 (#5)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA IFBench: 76.33 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Long Context Reasoning: 69.33 (#27)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience: 22.68 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Business: 45.8 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Health: 40.2 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Humanities &amp; Social Sciences: 52.3 (#7)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Law: 57.4 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Science, Engineering &amp; Mathematics: 50.1 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE): 65.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - C: 80.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Dart: 60.0 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Go: 50.0 (#32)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - HTML: 72.0 (#17)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Java: 51.0 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - JavaScript: 71.82 (#14)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Julia: 60.0 (#13)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Kotlin: 56.0 (#22)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - PHP: 84.0 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Python: 61.0 (#24)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - R: 56.0 (#18)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Rust: 80.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - Swift: 72.0 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Omniscience - Software Engineering (SWE) - TypeScript: 67.78 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA SciCode: 53.12 (#11)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA TAU-2 Bench: 95.32 (#20)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on AA Terminal-Bench Hard: 40.91 (#36)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-1: 92.5 (#16)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on ARC-AGI-2: 72.08 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 3.5 Flash (High)&lt;/strong&gt; on Artificial Analysis Intelligence Index: 55.33 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (GDPval-AA)&lt;/strong&gt;: Gemini 3.5 Flash (165600.0) beat Claude Sonnet 4.6 by 2300.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (MCP Atlas)&lt;/strong&gt;: Gemini 3.5 Flash (83.6) beat Claude Opus 4.7 by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA MMMU-Pro&lt;/strong&gt;: Gemini 3.5 Flash (high) (84.28) beat Gemini 3.1 Pro Preview by 1.85&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - MCP Atlas&lt;/strong&gt;: gemini-3.5-flash (high) (83.6) beat Muse Spark by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (Toolathlon)&lt;/strong&gt;: Gemini 3.5 Flash (56.5) beat GPT-5.5 by 0.9&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-20

=== DAILY ===
NEW MODELS (1)
  - Gemini 3.5 Flash (High) — ELO 1942, #9/609 (above: Claude Opus 4.7 (Thinking), below: GPT-5.5 (High))
      AA MMMU-Pro: 84.28 (#1/190)
      SEAL - MCP Atlas: 83.6 (#1/21)
      AA Omniscience: 22.68 (#3/393)
      AA Omniscience - </summary></entry><entry><title>AI Benchmark Digest — 2026-05-17</title><id>https://aibenchmarks.dev/digest/2026-05-17</id><updated>2026-05-17T08:02:54.093472+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite by 3.0&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on SEAL Showdown: 1115.7 (#12)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (16)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: intern-s2-preview (76.7) beat qwen3.5-397b-a17b by 6.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VisuLogic&lt;/strong&gt;: PEREA-1.0new (52.8) beat Human by 1.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) by 1.01&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA TAU-2 Bench&lt;/strong&gt;: JT-35B-Flash (99.12) beat GLM-4.7-Flash (Reasoning) by 0.32&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AISI Cyber TLO 10M&lt;/strong&gt;: GPT-5.5 (10.0) beat Claude Opus 4.6 by 0.2&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-17

=== DAILY ===
NEW #1 LEADERS (1)
  - OpenClawProBench (Overall Score (%)): intern-s2-preview (76.7) beat Sensenova 6.7 Flash Lite (73.7) by 3.0

=== WEEKLY ===
NEW SCORES FROM TOP-10 MODELS (3)
  - Claude Opus 4.7 (Thinking) on SEAL Showdown: 1115.7 Arena Score (#12</summary></entry><entry><title>AI Benchmark Digest — 2026-05-16</title><id>https://aibenchmarks.dev/digest/2026-05-16</id><updated>2026-05-16T07:15:27.727063+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Chatbot Arena (Code): 1501.0 (#9)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIVLEAN March&lt;/strong&gt;: AlephProver (34.15) beat Aristotle by 17.08&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GAIA&lt;/strong&gt;: Co-Sight Pro v1.0.1 (93.02) beat OPS-Agentic-Search by 0.66&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-16

=== DAILY ===
NEW SCORES FROM TOP-10 MODELS (1)
  - GPT-5.5 (xHigh) on Chatbot Arena (Code): 1501.0 Elo (#9/79)

NEW #1 LEADERS (2)
  - MathArena - ARXIVLEAN March (Accuracy (%)): AlephProver (34.15) beat Aristotle (17.07) by 17.08
  - GAIA (Accuracy (%)): Co-Sight </summary></entry><entry><title>AI Benchmark Digest — 2026-05-14</title><id>https://aibenchmarks.dev/digest/2026-05-14</id><updated>2026-05-14T07:26:43.169192+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Pro-260215 (High)&lt;/strong&gt; — ELO 1781, #73&lt;ul&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 65.2 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 77.3 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 95.0 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 82.1 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 83.8 (#1/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 77.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 94.6 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 44.2 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 69.6 (#4/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Doubao-Seed-2-0-Lite-260215 (High)&lt;/strong&gt; — ELO 1741, #103&lt;ul&gt;&lt;li&gt;OpenCompass Reasoning - Common: 78.1 (#2/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 77.1 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 42.4 (#6/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Science: 91.7 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 59.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 67.1 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 72.5 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 77.1 (#8/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hy3-preview (High)&lt;/strong&gt; — ELO 1729, #110&lt;ul&gt;&lt;li&gt;OpenCompass Math - College: 81.3 (#3/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Instruction Following: 76.0 (#4/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Math: 74.5 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 75.4 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 74.4 (#7/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Academic: 43.6 (#8/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Reasoning: 58.5 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - Competition: 67.6 (#10/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 28.7 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Reasoning - Common: 73.5 (#12/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ring-2.5-1T&lt;/strong&gt; — ELO 1711, #119&lt;ul&gt;&lt;li&gt;OpenCompass Knowledge - Social Science: 92.9 (#5/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - NLP: 65.4 (#11/23)&lt;/li&gt;&lt;li&gt;OpenCompass Language - Creation: 68.8 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Humanities: 90.0 (#12/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Agent: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Math - College: 75.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass Agent - Tool Use: 25.0 (#13/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Knowledge: 89.4 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass Knowledge - Engineering: 90.8 (#14/23)&lt;/li&gt;&lt;li&gt;OpenCompass LLM - Language: 69.8 (#15/23)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Opus 4.7 (Thinking)&lt;/strong&gt; on WeirdML: 75.45 (#8)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Common&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (82.1) beat Gemini-3-Pro-Preview by 8.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - College&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (83.8) beat Kimi-K2.5 by 7.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tau3-Bench Banking_Knowledge&lt;/strong&gt;: GPT-5.5 (37.4) beat Distyl ButtonAgent by 6.2&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Social Science&lt;/strong&gt;: Gemini-3.1-Pro-Preview (97.5) beat Gemini-3-Pro-Preview by 4.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Math&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (77.3) beat Qwen3-Max-2026-01-23 by 4.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass LLM - Reasoning&lt;/strong&gt;: Doubao-Seed-2-0-Pro-260215 (high) (65.2) beat Gemini-3-Pro-Preview by 3.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Math - Competition&lt;/strong&gt;: Kimi-K2.6 (72.1) beat Qwen3-Max-2026-01-23 by 2.1&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Reasoning - Academic&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (52.0) beat GPT-5.2-2025-12-11 (high) by 1.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenCompass Knowledge - Engineering&lt;/strong&gt;: GPT-5.4-2026-03-05 (high) (96.2) beat Gemini-3-Pro-Preview by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-14

=== DAILY ===
NEW MODELS (4)
  - Doubao-Seed-2-0-Pro-260215 (High) — ELO 1781, #73/796 (above: GPT-5.2 (Low), below: GLM-5-Turbo)
      OpenCompass LLM - Reasoning: 65.2 (#1/23)
      OpenCompass LLM - Math: 77.3 (#1/23)
      OpenCompass Knowledge - Humanities: 95.</summary></entry><entry><title>AI Benchmark Digest — 2026-05-13</title><id>https://aibenchmarks.dev/digest/2026-05-13</id><updated>2026-05-13T07:29:12.582080+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;ProgramBench&lt;/strong&gt; (Resolved (%)): leader GPT-5.5 (xHigh) (0.5), 13 models&lt;br&gt;&lt;span&gt;Meta and Stanford benchmark testing whether language-model agents can rebuild complete programs from only a compiled binary and documentation. Agents use mini-SWE-agent across 200 open-source program recreation tasks and are scored by hidden behavioral tests.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ProgramBench Almost&lt;/strong&gt; (Almost (%)): leader GPT-5.5 (xHigh) (13.5), 13 models&lt;br&gt;&lt;span&gt;Companion ProgramBench metric that counts near-complete program recreations: tasks where the generated implementation passes most hidden behavioral tests but does not fully resolve the benchmark task.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;JT-35B-Flash&lt;/strong&gt; — ELO 1693, #141&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 99.1 (#1/405)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - Go: 36.0 (#50/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - Java: 29.0 (#58/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - HTML: 48.0 (#60/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - JavaScript: 41.82 (#75/391)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 82.9 (#76/486)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - C: 53.0 (#78/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - PHP: 38.0 (#79/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE) - TypeScript: 36.67 (#82/391)&lt;/li&gt;&lt;li&gt;AA Omniscience - Software Engineering (SWE): 35.0 (#83/391)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on WeirdML: 84.91 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (xhigh) (84.91) beat gpt-5.5 (high) by 1.01&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA TAU-2 Bench&lt;/strong&gt;: JT-35B-Flash (99.1) beat GLM-4.7-Flash (Reasoning) by 0.3&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-13

=== DAILY ===
NEW BENCHMARKS (2)
  - ProgramBench (Resolved (%)): leader GPT-5.5 (xHigh) (0.5), 13 models
      Meta and Stanford benchmark testing whether language-model agents can rebuild complete programs from only a compiled binary and documentation. Agents use </summary></entry><entry><title>AI Benchmark Digest — 2026-05-11</title><id>https://aibenchmarks.dev/digest/2026-05-11</id><updated>2026-05-11T08:08:39.844852+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New #1 Leaders (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: Sensenova 6.7 Flash Lite (73.7) beat qwen3.5-397b-a17b by 3.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;VisuLogic&lt;/strong&gt;: PEREA-1.0new (52.8) beat Human by 1.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-11

=== DAILY ===
NEW #1 LEADERS (2)
  - OpenClawProBench (Overall Score (%)): Sensenova 6.7 Flash Lite (73.7) beat qwen3.5-397b-a17b (70.4) by 3.3
  - VisuLogic (Overall Accuracy (%)): PEREA-1.0new (52.8) beat Human (51.4) by 1.4
</summary></entry><entry><title>AI Benchmark Digest — 2026-05-10</title><id>https://aibenchmarks.dev/digest/2026-05-10</id><updated>2026-05-10T07:49:15.895022+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (43)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Arabic&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Bengali&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - German&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - English&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Spanish&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - French&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Hindi&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Indonesian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Italian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Japanese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Korean&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Burmese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Portuguese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Swahili&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Yoruba&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Chinese&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Health&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Law&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - C&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - HTML&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - JavaScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Julia&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Kotlin&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - PHP&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Rust&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - TypeScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MMS SQ&lt;/strong&gt; (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - WikiANN SQ&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - ScaLA SQ&lt;/strong&gt; (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MultiWikiQA SQ&lt;/strong&gt; (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - MMS BS&lt;/strong&gt; (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - WikiANN BS&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - Multi Wiki QA BS&lt;/strong&gt; (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;hr/&gt;
&lt;h2&gt;Weekly&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (43)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Arabic&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Bengali&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - German&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.75), 119 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - English&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (95.17), 120 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Spanish&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.42), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - French&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Hindi&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Indonesian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Italian&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.58), 117 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Japanese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.67), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Korean&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.0), 116 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Burmese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (91.17), 111 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Portuguese&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (94.25), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Swahili&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.33), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Yoruba&lt;/strong&gt; (Accuracy (%)): leader Gemini 3.1 Pro Preview (88.75), 112 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Global-MMLU-Lite - Chinese&lt;/strong&gt; (Accuracy (%)): leader Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (93.58), 113 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Business&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (49.1), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Health&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (48.8), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Humanities &amp; Social Sciences&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (56.6), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Law&lt;/strong&gt; (Accuracy (%)): leader Gemini 3 Pro Preview (high) (64.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Science, Engineering &amp; Mathematics&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (52.3), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE)&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (84.4), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - C&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Dart&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (80.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Go&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (high) (84.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - HTML&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Java&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (73.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - JavaScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.91), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Julia&lt;/strong&gt; (Accuracy (%)): leader GPT-5.4 (low) (88.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Kotlin&lt;/strong&gt; (Accuracy (%)): leader GPT-5.3 Codex (xhigh) (90.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - PHP&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Python&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (90.5), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - R&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (medium) (74.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Rust&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - Swift&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (92.0), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;AA Omniscience - Software Engineering (SWE) - TypeScript&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (91.11), 388 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MMS SQ&lt;/strong&gt; (Sentiment classification Score (%)): leader gemini-3-flash-preview#no-thinking (32.13), 196 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MMS SQ dataset, measuring sentiment classification from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - WikiANN SQ&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (86.6), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the WikiANN SQ dataset, measuring named entity recognition from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - ScaLA SQ&lt;/strong&gt; (Linguistic acceptability Score (%)): leader gemini-3.1-pro-preview (78.55), 166 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the ScaLA SQ dataset, measuring linguistic acceptability from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU - MultiWikiQA SQ&lt;/strong&gt; (Reading comprehension Score (%)): leader Qwen3.5-9B-Base (70.8), 200 models&lt;br&gt;&lt;span&gt;EuroEval Albanian NLU task column for the MultiWikiQA SQ dataset, measuring reading comprehension from the public albanian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - MMS BS&lt;/strong&gt; (Sentiment classification Score (%)): leader gpt-4.1-mini-2025-04-14 (56.43), 208 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the MMS BS dataset, measuring sentiment classification from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - WikiANN BS&lt;/strong&gt; (Named entity recognition Score (%)): leader multilingual-e5-large (84.87), 212 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the WikiANN BS dataset, measuring named entity recognition from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU - Multi Wiki QA BS&lt;/strong&gt; (Reading comprehension Score (%)): leader Olmo-3-1125-32B (78.64), 211 models&lt;br&gt;&lt;span&gt;EuroEval Bosnian NLU task column for the Multi Wiki QA BS dataset, measuring reading comprehension from the public bosnian_nlu.csv leaderboard.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (38)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GLM-5V Turbo (Reasoning)&lt;/strong&gt; — ELO 1738, #102&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 98.5 (#3/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 1330.87 (#43/360)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 72.8 (#44/188)&lt;/li&gt;&lt;li&gt;AA SciCode: 43.5 (#52/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 42.85 (#56/482)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 32.6 (#79/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -18.98 (#80/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 61.0 (#84/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 15.8 (#91/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 80.9 (#96/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ERNIE 5.0 Thinking Preview&lt;/strong&gt; — ELO 1631, #214&lt;ul&gt;&lt;li&gt;AA LiveCodeBench: 81.2 (#24/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 86.5 (#33/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 85.0 (#46/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 83.0 (#60/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 1.4 (#68/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 64.6 (#90/188)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 83.9 (#94/402)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 12.7 (#116/479)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 25.0 (#119/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 77.7 (#124/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Reasoning)&lt;/strong&gt; — ELO 1603, #245&lt;ul&gt;&lt;li&gt;AA AIME 2025: 90.3 (#25/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 76.8 (#41/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 83.8 (#44/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 1.1 (#76/388)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.86 (#80/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 64.7 (#85/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 13.1 (#111/479)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 55.7 (#117/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 78.3 (#119/483)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 74.3 (#121/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.5 33B&lt;/strong&gt; — ELO 1578, #277&lt;ul&gt;&lt;li&gt;AA MMMU-Pro: 67.3 (#77/188)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 79.4 (#106/483)&lt;/li&gt;&lt;li&gt;AA IFBench: 58.0 (#107/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 78.1 (#112/402)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.3 (#128/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 11.6 (#131/479)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 20.5 (#144/397)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 30.23 (#147/482)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 49.3 (#150/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 812.72 (#163/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (High)&lt;/strong&gt; — ELO 1562, #294&lt;ul&gt;&lt;li&gt;AA AIME 2025: 78.3 (#71/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 69.4 (#76/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.6 (#82/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 60.1 (#102/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.6 (#135/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.8 (#157/479)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 33.3 (#211/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 9.8 (#212/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 68.1 (#222/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 20.61 (#232/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Open 100B (Reasoning)&lt;/strong&gt; — ELO 1555, #307&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 81.58 (#61/120)&lt;/li&gt;&lt;li&gt;AA IFBench: 57.7 (#110/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.2 (#170/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 48.2 (#180/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 36.0 (#195/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#204/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 666.33 (#207/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 21.67 (#224/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 65.7 (#243/483)&lt;/li&gt;&lt;li&gt;AA Omniscience: -54.1 (#262/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;JT-MINI&lt;/strong&gt; — ELO 1546, #324&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 93.0 (#40/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 18.2 (#154/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 831.97 (#157/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 25.37 (#187/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.6 (#223/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 67.6 (#225/483)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#263/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 36.7 (#277/411)&lt;/li&gt;&lt;li&gt;AA SciCode: 27.2 (#292/477)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.7 (#308/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2 Think V2&lt;/strong&gt; — ELO 1545, #328&lt;ul&gt;&lt;li&gt;AA IFBench: 62.8 (#94/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -33.92 (#125/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 52.7 (#135/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 9.5 (#165/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 71.3 (#192/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 24.12 (#201/482)&lt;/li&gt;&lt;li&gt;AA GDPval: 607.98 (#222/360)&lt;/li&gt;&lt;li&gt;AA SciCode: 33.0 (#223/477)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#240/397)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#252/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HyperCLOVA X SEED Think (32B)&lt;/strong&gt; — ELO 1537, #342&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 87.4 (#68/402)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 78.6 (#83/120)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 62.9 (#107/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 59.0 (#118/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.5 (#137/345)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 12.1 (#194/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 678.83 (#199/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.72 (#204/482)&lt;/li&gt;&lt;li&gt;AA Omniscience: -52.87 (#255/388)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#257/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro&lt;/strong&gt; — ELO 1527, #352&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 86.5 (#75/402)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 76.7 (#77/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 65.6 (#92/343)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 74.23 (#94/120)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.9 (#97/345)&lt;/li&gt;&lt;li&gt;AA IFBench: 49.3 (#155/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 7.7 (#195/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 70.1 (#200/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.06 (#213/482)&lt;/li&gt;&lt;li&gt;AA GDPval: 643.11 (#213/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Motif-2-12.7B (Reasoning)&lt;/strong&gt; — ELO 1520, #366&lt;ul&gt;&lt;li&gt;AA AIME 2025: 80.3 (#65/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 65.1 (#97/343)&lt;/li&gt;&lt;li&gt;AA IFBench: 57.0 (#113/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 79.6 (#122/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 8.2 (#183/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 46.5 (#185/402)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 69.5 (#210/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 19.08 (#244/482)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#250/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 485.33 (#255/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro Preview&lt;/strong&gt; — ELO 1517, #371&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 81.43 (#63/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 78.7 (#70/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 81.3 (#92/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 57.6 (#125/343)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 8.8 (#175/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 49.4 (#177/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 45.6 (#180/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 72.2 (#185/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 29.7 (#251/477)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#255/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (Medium)&lt;/strong&gt; — ELO 1512, #382&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 76.7 (#87/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 64.7 (#107/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 55.1 (#122/411)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 54.1 (#137/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 76.1 (#165/345)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 8.3 (#220/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -49.97 (#222/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 578.73 (#227/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 28.0 (#232/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#251/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 30B&lt;/strong&gt; — ELO 1491, #425&lt;ul&gt;&lt;li&gt;AA IFBench: 44.4 (#191/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 42.1 (#198/402)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#228/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 495.5 (#253/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 18.7 (#273/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#310/397)&lt;/li&gt;&lt;li&gt;AA SciCode: 25.8 (#315/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.69 (#324/482)&lt;/li&gt;&lt;li&gt;AA Omniscience: -67.78 (#342/388)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 48.1 (#354/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Non-reasoning)&lt;/strong&gt; — ELO 1487, #432&lt;ul&gt;&lt;li&gt;AA MMLU-Pro: 81.0 (#94/345)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 71.03 (#104/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 44.0 (#150/269)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 47.0 (#157/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 59.1 (#162/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 767.0 (#174/360)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 23.41 (#207/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 69.5 (#209/483)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#239/397)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#242/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2-V2 (Low)&lt;/strong&gt; — ELO 1483, #444&lt;ul&gt;&lt;li&gt;AA Global-MMLU-Lite: 71.44 (#103/120)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 35.3 (#173/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 39.3 (#187/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 71.3 (#212/345)&lt;/li&gt;&lt;li&gt;AA Omniscience: -48.07 (#212/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 41.0 (#233/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#254/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 19.0 (#271/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 4.5 (#277/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 367.48 (#285/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Reasoning)&lt;/strong&gt; — ELO 1479, #450&lt;ul&gt;&lt;li&gt;AA MATH-500: 96.7 (#30/193)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 79.61 (#78/120)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.5 (#107/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 61.6 (#113/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 61.3 (#115/269)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#206/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 7.0 (#213/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 68.7 (#215/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 30.2 (#246/477)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 28.1 (#251/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Reasoning)&lt;/strong&gt; — ELO 1474, #458&lt;ul&gt;&lt;li&gt;AA Omniscience: -20.05 (#82/388)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.6 (#104/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 51.4 (#143/188)&lt;/li&gt;&lt;li&gt;AA IFBench: 44.2 (#193/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 8.3 (#218/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 30.7 (#222/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.76 (#250/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 57.6 (#297/483)&lt;/li&gt;&lt;li&gt;AA GDPval: 304.3 (#312/360)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.8 (#314/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Reasoning)&lt;/strong&gt; — ELO 1473, #461&lt;ul&gt;&lt;li&gt;AA MATH-500: 97.7 (#21/193)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 74.7 (#48/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 80.0 (#68/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 81.8 (#82/345)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 73.46 (#97/120)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 10.5 (#145/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 73.9 (#167/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 34.4 (#203/477)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#240/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 499.86 (#249/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think Preview&lt;/strong&gt; — ELO 1473, #462&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 93.3 (#38/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 47.1 (#169/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 19.99 (#236/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.7 (#257/479)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#259/388)&lt;/li&gt;&lt;li&gt;AA Omniscience: -55.28 (#267/388)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 14.7 (#294/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 337.02 (#299/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#315/397)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 53.8 (#320/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think&lt;/strong&gt; — ELO 1468, #468&lt;ul&gt;&lt;li&gt;AA TAU-2 Bench: 81.0 (#103/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 54.6 (#124/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.3 (#132/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.1 (#241/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.62 (#258/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 60.1 (#279/483)&lt;/li&gt;&lt;li&gt;AA GDPval: 374.11 (#282/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.0 (#312/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -63.3 (#321/388)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#342/397)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (March 2025, chatgpt-4o-latest)&lt;/strong&gt; — ELO 1449, #500&lt;ul&gt;&lt;li&gt;AA MATH-500: 89.3 (#73/193)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 80.3 (#110/345)&lt;/li&gt;&lt;li&gt;AA SciCode: 36.6 (#165/477)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.5 (#170/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 25.7 (#196/269)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 65.5 (#247/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.56 (#260/482)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.0 (#305/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Reasoning)&lt;/strong&gt; — ELO 1448, #502&lt;ul&gt;&lt;li&gt;AA MATH-500: 95.9 (#36/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 54.7 (#132/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 78.5 (#136/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#215/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.5 (#227/479)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 27.7 (#238/343)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 64.3 (#251/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 18.49 (#262/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 26.9 (#262/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.1 (#262/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Non-reasoning)&lt;/strong&gt; — ELO 1435, #524&lt;ul&gt;&lt;li&gt;AA MATH-500: 88.9 (#76/193)&lt;/li&gt;&lt;li&gt;AA Global-MMLU-Lite: 75.34 (#91/120)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.4 (#172/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 75.0 (#178/345)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 30.0 (#186/269)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#203/388)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 31.9 (#230/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 447.04 (#265/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 4.5 (#273/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 33.7 (#306/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)&lt;/strong&gt; — ELO 1408, #560&lt;ul&gt;&lt;li&gt;AA MATH-500: 77.5 (#113/193)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#216/388)&lt;/li&gt;&lt;li&gt;AA Omniscience: -49.68 (#219/388)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 69.8 (#221/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 28.0 (#235/343)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 7.7 (#237/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 39.5 (#247/411)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 11.3 (#309/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.7 (#330/483)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.35 (#336/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;NVIDIA Nemotron 3 Nano 4B&lt;/strong&gt; — ELO 1388, #586&lt;ul&gt;&lt;li&gt;AA IFBench: 58.2 (#106/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#211/388)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 6.8 (#238/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 28.1 (#252/402)&lt;/li&gt;&lt;li&gt;AA GDPval: 476.83 (#258/360)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 16.7 (#286/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.8 (#323/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.68 (#325/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.3 (#338/483)&lt;/li&gt;&lt;li&gt;AA Omniscience: -71.53 (#351/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 3B&lt;/strong&gt; — ELO 1380, #595&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#232/388)&lt;/li&gt;&lt;li&gt;AA GDPval: 366.32 (#286/360)&lt;/li&gt;&lt;li&gt;AA IFBench: 33.7 (#307/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 2.3 (#312/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 19.6 (#323/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 3.0 (#341/411)&lt;/li&gt;&lt;li&gt;AA Omniscience: -77.38 (#370/388)&lt;/li&gt;&lt;li&gt;AA SciCode: 11.9 (#412/477)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 8.54 (#435/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 31.4 (#441/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Reasoning)&lt;/strong&gt; — ELO 1376, #604&lt;ul&gt;&lt;li&gt;AA Omniscience: -23.98 (#94/388)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 44.6 (#160/188)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#170/388)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.0 (#265/411)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 15.0 (#292/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 3.0 (#299/397)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 15.21 (#309/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.8 (#315/402)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.8 (#322/479)&lt;/li&gt;&lt;li&gt;AA GDPval: 272.59 (#338/360)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)&lt;/strong&gt; — ELO 1351, #630&lt;ul&gt;&lt;li&gt;AA MATH-500: 94.7 (#41/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 50.0 (#140/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 49.3 (#153/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 55.6 (#283/345)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.1 (#289/479)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 14.43 (#334/482)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#358/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 11.7 (#362/402)&lt;/li&gt;&lt;li&gt;AA IFBench: 25.5 (#375/411)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 40.8 (#393/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ling-mini-2.0&lt;/strong&gt; — ELO 1346, #635&lt;ul&gt;&lt;li&gt;AA AIME 2025: 49.3 (#142/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 42.9 (#169/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 67.1 (#243/345)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#284/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.0 (#304/479)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 56.2 (#306/483)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 6.7 (#329/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 264.15 (#341/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#345/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 13.2 (#356/402)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Jamba Reasoning 3B&lt;/strong&gt; — ELO 1320, #657&lt;ul&gt;&lt;li&gt;AA IFBench: 52.4 (#137/411)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 10.7 (#231/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 21.0 (#267/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#268/388)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 57.7 (#274/345)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 7.0 (#323/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 15.8 (#342/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.8 (#344/397)&lt;/li&gt;&lt;li&gt;AA GDPval: 257.67 (#345/360)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.6 (#347/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Reasoning)&lt;/strong&gt; — ELO 1266, #696&lt;ul&gt;&lt;li&gt;AA AIME 2025: 50.3 (#139/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 51.6 (#143/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#241/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.8 (#251/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 58.8 (#268/345)&lt;/li&gt;&lt;li&gt;AA GDPval: 296.88 (#317/360)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 51.5 (#336/483)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 16.4 (#338/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#370/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#377/397)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Non-reasoning)&lt;/strong&gt; — ELO 1262, #697&lt;ul&gt;&lt;li&gt;AA AIME 2025: 24.0 (#200/269)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 29.3 (#226/343)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#239/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.8 (#250/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 50.0 (#294/345)&lt;/li&gt;&lt;li&gt;AA GDPval: 298.76 (#316/360)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 20.5 (#318/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#369/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#376/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 25.3 (#376/411)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 1B&lt;/strong&gt; — ELO 1258, #701&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#234/388)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 6.3 (#244/269)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.1 (#292/479)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 22.8 (#294/402)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 32.5 (#331/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 4.7 (#333/343)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 4.0 (#340/411)&lt;/li&gt;&lt;li&gt;AA GDPval: 259.61 (#342/360)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#373/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -81.82 (#377/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 350M&lt;/strong&gt; — ELO 1137, #759&lt;ul&gt;&lt;li&gt;AA CritPt: 0.0 (#227/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.4 (#228/479)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 1.3 (#262/269)&lt;/li&gt;&lt;li&gt;AA GDPval: 294.09 (#319/360)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 1.9 (#339/343)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 12.7 (#343/345)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 14.6 (#349/402)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#366/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#369/397)&lt;/li&gt;&lt;li&gt;AA Omniscience: -87.25 (#387/388)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 32B&lt;/strong&gt; — ELO 1037, #780&lt;ul&gt;&lt;li&gt;AA AIME 2025: 3.3 (#256/269)&lt;/li&gt;&lt;li&gt;AA IFBench: 38.1 (#264/411)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 51.1 (#292/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 6.8 (#328/343)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#391/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#393/411)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 10.57 (#397/482)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#401/402)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 32.8 (#429/483)&lt;/li&gt;&lt;li&gt;AA SciCode: 8.0 (#437/477)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Phi-3 Mini Instruct 3.8B&lt;/strong&gt; — ELO 1025, #781&lt;ul&gt;&lt;li&gt;AA MATH-500: 45.7 (#172/193)&lt;/li&gt;&lt;li&gt;AA AIME 2025: 0.3 (#265/269)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 43.5 (#308/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 11.6 (#308/343)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 2.0 (#345/411)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 4.4 (#372/479)&lt;/li&gt;&lt;li&gt;AA IFBench: 23.9 (#382/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#388/397)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#398/402)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 10.1 (#407/482)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 7B&lt;/strong&gt; — ELO 958, #787&lt;ul&gt;&lt;li&gt;AA AIME 2025: 0.7 (#263/269)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 5.5 (#265/479)&lt;/li&gt;&lt;li&gt;AA MMLU-Pro: 28.2 (#334/345)&lt;/li&gt;&lt;li&gt;AA LiveCodeBench: 4.1 (#335/343)&lt;/li&gt;&lt;li&gt;AA IFBench: 24.4 (#381/411)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 0.0 (#390/397)&lt;/li&gt;&lt;li&gt;AA Long Context Reasoning: 0.0 (#391/411)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 0.0 (#399/402)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 9.3 (#423/482)&lt;/li&gt;&lt;li&gt;AA GPQA Diamond: 28.8 (#455/483)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (7)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Claude Mythos Preview&lt;/strong&gt; on METR Benchmark: 17.41 (#1)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.4 (xHigh)&lt;/strong&gt; on OpenClawProBench: 68.0 (#8)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on OpenClawProBench: 69.3 (#4)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 68.8 (#6)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on Epoch AI - ECI: 159.5 (#3)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on PinchBench: 18.11 (#39)&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on VoxelBench: 2107.0 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (14)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FoodTruckBench&lt;/strong&gt;: GPT-5.5 (61408.0) beat Claude Opus 4.6 by 11889.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA2&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (64.72) beat GPT-4o by 28.05&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruQuALITY&lt;/strong&gt;: 01-ai_Yi-9B-200K (95.9) beat GPT-4o by 12.57&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - AudioMultiChallenge - Audio Output&lt;/strong&gt;: gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) by 12.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierSWE&lt;/strong&gt;: GPT-5.5 (83.0) beat Claude Opus 4.7 by 9.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4&lt;/strong&gt;: AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) by 8.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Story Theory Bench&lt;/strong&gt;: glm-5 (99.6) beat deepseek-v3.2 by 7.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Parametric&lt;/strong&gt;: Gemini 3.1 Pro Preview (78.96) beat Gemini 3 Flash Preview by 6.7&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) by 4.63&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciAbstractRetrieval&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (81.5) beat GLM-4 9B Chat by 3.69&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS (Google)&lt;/strong&gt;: GPT-5.5 (71.19) beat Gemini 3.1 Pro Preview by 3.48&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA1&lt;/strong&gt;: Qwen_Qwen3-30B-A3B-Instruct-2507 (80.5) beat GPT-4o by 2.17&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Android Bench&lt;/strong&gt;: GPT 5.5 (74.0) beat GPT-5.4 by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: green tree (68.2) beat Cassi ensemble_2_crowdadj by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-10

=== DAILY ===
NEW BENCHMARKS (43)
  - AA Global-MMLU-Lite - Arabic (Accuracy (%)): leader Gemini 3.1 Pro Preview (93.0), 119 models
  - AA Global-MMLU-Lite - Bengali (Accuracy (%)): leader Gemini 3.1 Pro Preview (92.17), 119 models
  - AA Global-MMLU-Lite - German (</summary></entry><entry><title>AI Benchmark Digest — 2026-05-09</title><id>https://aibenchmarks.dev/digest/2026-05-09</id><updated>2026-05-09T07:40:39.118338+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (8)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Factory Code Review Benchmark&lt;/strong&gt; (Mean F1 (%)): leader GPT-5.2 (60.5), 13 models&lt;br&gt;&lt;span&gt;Factory benchmark for code review quality, scoring model comments against expected findings with mean F1 across realistic pull request review tasks.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU&lt;/strong&gt; (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models&lt;br&gt;&lt;span&gt;Albanian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU&lt;/strong&gt; (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models&lt;br&gt;&lt;span&gt;Bosnian-language EuroEval natural-language-understanding suite, separating NLU task performance from the broader all-task EuroEval aggregate.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Knowledge&lt;/strong&gt; (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models&lt;br&gt;&lt;span&gt;EuroEval Albanian knowledge category: language-specific factual or domain-knowledge tasks from EuroEval&amp;#x27;s public albanian_all.csv leaderboard, scored as the average task score for each model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Common Sense Reasoning&lt;/strong&gt; (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models&lt;br&gt;&lt;span&gt;EuroEval Albanian common-sense reasoning category: language-specific commonsense tasks from EuroEval&amp;#x27;s public albanian_all.csv leaderboard, scored as the average task score for each model.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;IMO-Bench&lt;/strong&gt; (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models&lt;br&gt;&lt;span&gt;Advanced IMO-ProofBench leaderboard for rigorous mathematical proof writing on olympiad-level problems.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum&lt;/strong&gt; (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models&lt;br&gt;&lt;span&gt;Chart question-answering benchmark over real-world charts, testing visual, textual, and synthesis reasoning.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SvelteBench&lt;/strong&gt; (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models&lt;br&gt;&lt;span&gt;Frontend coding benchmark for Svelte component tasks, scored by average pass@1.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Grok 4.3 (Non-reasoning)&lt;/strong&gt; — ELO 1647, #259&lt;ul&gt;&lt;li&gt;AA GDPval: 1306.14 (#52/360)&lt;/li&gt;&lt;li&gt;AA MMMU-Pro: 64.8 (#88/188)&lt;/li&gt;&lt;li&gt;AA Omniscience: -32.3 (#121/388)&lt;/li&gt;&lt;li&gt;Artificial Analysis Intelligence Index: 31.02 (#139/482)&lt;/li&gt;&lt;li&gt;AA SciCode: 37.4 (#146/477)&lt;/li&gt;&lt;li&gt;AA TAU-2 Bench: 65.8 (#148/402)&lt;/li&gt;&lt;li&gt;AA Terminal-Bench Hard: 18.9 (#149/397)&lt;/li&gt;&lt;li&gt;AA IFBench: 47.6 (#165/411)&lt;/li&gt;&lt;li&gt;AA CritPt: 0.0 (#182/388)&lt;/li&gt;&lt;li&gt;AA Humanity's Last Exam: 6.5 (#226/479)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 (xHigh)&lt;/strong&gt; on Wolfram LLM Benchmarking Project: 68.8 (#6)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FrontierMath - Tier 4&lt;/strong&gt;: AI co-mathematician (47.9) beat GPT-5.5 Pro (xhigh) by 8.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark&lt;/strong&gt;: claude mythos preview early (17.41) beat claude opus 4 6 by 5.43&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark (80% Horizon)&lt;/strong&gt;: claude mythos preview early (3.1) beat gemini 3 1 pro by 1.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ForecastBench&lt;/strong&gt;: green tree (68.2) beat Cassi ensemble_2_crowdadj by 0.4&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-09

=== DAILY ===
NEW BENCHMARKS (8)
  - Factory Code Review Benchmark (Mean F1 (%)): leader GPT-5.2 (60.5), 13 models
      Factory benchmark for code review quality, scoring model comments against expected findings with mean F1 across realistic pull request review tas</summary></entry><entry><title>AI Benchmark Digest — 2026-05-08</title><id>https://aibenchmarks.dev/digest/2026-05-08</id><updated>2026-05-08T07:40:34.661988+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (8)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian NLU&lt;/strong&gt; (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian NLU&lt;/strong&gt; (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Knowledge&lt;/strong&gt; (Knowledge Average Score (%)): leader gemini-3-flash-preview#thinking (96.46), 167 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian Common Sense Reasoning&lt;/strong&gt; (Common Sense Reasoning Average Score (%)): leader gemini-3.1-pro-preview (85.24), 155 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MoNaCo&lt;/strong&gt; (F1): leader o3 (61.18), 15 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;IMO-Bench&lt;/strong&gt; (Advanced ProofBench Accuracy (%)): leader Aletheia (91.9), 9 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ChartMuseum&lt;/strong&gt; (Overall Accuracy (%)): leader Gemini-3.1-Pro (80.7), 22 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SvelteBench&lt;/strong&gt; (Average pass@1 (%)): leader claude-opus-4-6 (100.0), 123 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;SEAL - AudioMultiChallenge - Audio Output&lt;/strong&gt;: gpt-realtime-2 (xHigh) (48.45) beat gemini-3.1-flash-live-preview (Thinking) by 12.39&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Story Theory Bench&lt;/strong&gt;: glm-5 (99.6) beat deepseek-v3.2 by 7.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SEAL - SWE Atlas - Codebase QnA&lt;/strong&gt;: GPT 5.5 (Codex) (45.43) beat Gpt 5.4 xHigh (Codex) by 4.63&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-08

=== DAILY ===
NEW BENCHMARKS (8)
  - EuroEval Albanian NLU (NLU Average Score (%)): leader gemini-3.1-pro-preview (61.17), 208 models
  - EuroEval Bosnian NLU (NLU Average Score (%)): leader Ministral-3-14B-Reasoning-2512 (66.0), 214 models
  - EuroEval Albanian Kno</summary></entry><entry><title>AI Benchmark Digest — 2026-05-07</title><id>https://aibenchmarks.dev/digest/2026-05-07</id><updated>2026-05-07T07:40:24.104745+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (19)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LIBRA - MatreshkaNames *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (81.2), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciPassageCount *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (25.77), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ru2WikiMultihopQA *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (66.63), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LongContextMultiQ *&lt;/strong&gt; (Dataset Total Score (%)): leader 01-ai_Yi-34B-200K (53.14), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LibrusecMHQA *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (51.0), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA3 *&lt;/strong&gt; (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (38.38), 7 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kernel Arena - KernelBench HIP&lt;/strong&gt; (Mean Correctness+Speedup): leader GPT-5.2 (15.463), 11 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kernel Arena - WaferBench NVFP4&lt;/strong&gt; (Mean Correctness+Speedup): leader Gemini 3.1 Pro (2.274), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV_FALSE April&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (72.13), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - ARXIV April&lt;/strong&gt; (Accuracy (%)): leader GPT-5.5 (xhigh) (65.48), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;METR Benchmark (80% Horizon)&lt;/strong&gt; (80% Time Horizon (hours)): leader gemini 3 1 pro (1.5), 25 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Stats (HealthBench)&lt;/strong&gt; (Score (%)): leader Kimi K2-Thinking-0905 (58.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SCORE Robustness (Accuracy)&lt;/strong&gt; (Average Accuracy (%)): leader Llama-3.1-70B-Instruct (67.02), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SCORE Robustness (Consistency)&lt;/strong&gt; (Average Consistency Rate (%)): leader Llama-3.1-70B-Instruct (72.39), 6 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Multilingual MMLU Leaderboard&lt;/strong&gt; (Average Accuracy (%)): leader Claude-3.5-Sonnet (77.39), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Pinocchio Italian Leaderboard&lt;/strong&gt; (Average Accuracy (%)): leader gemma-2-27b-it (70.97), 45 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ukrainian LLM Leaderboard&lt;/strong&gt; (Average Score (%)): leader gemma-4-26B-A4B-it (reasoning) (63.29), 13 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Arabic Broad Leaderboard&lt;/strong&gt; (Average Score (0-10)): leader gemini-3-pro-preview (9.204), 87 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Darija Chatbot Arena&lt;/strong&gt; (Elo Rating): leader GPT-4o (1404.8), 13 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;FoodTruckBench&lt;/strong&gt;: GPT-5.5 (61408.0) beat Claude Opus 4.6 by 11889.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ASCIIBench&lt;/strong&gt;: claude-opus-4.5 (1656.0) beat claude-opus-4.1 by 5.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kaggle FACTS Parametric&lt;/strong&gt;: Gemini 3.1 Pro Preview (78.96) beat GPT-5.5 by 0.92&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-07

=== DAILY ===
NEW BENCHMARKS (19)
  - LIBRA - MatreshkaNames * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (81.2), 7 models
  - LIBRA - ruSciPassageCount * (Dataset Total Score (%)): leader Qwen_Qwen3-30B-A3B-Instruct-2507 (25.77), 7 models
  </summary></entry><entry><title>AI Benchmark Digest — 2026-05-04</title><id>https://aibenchmarks.dev/digest/2026-05-04</id><updated>2026-05-04T07:41:09.001799+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (62)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Doubao Seed Code&lt;/strong&gt; — ELO 1645, #209&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Reasoning)&lt;/strong&gt; — ELO 1645, #210&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.5 Flash Preview (Sep '25) (Reasoning)&lt;/strong&gt; — ELO 1638, #221&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 31B (Non-reasoning)&lt;/strong&gt; — ELO 1626, #232&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ERNIE 5.0 Thinking Preview&lt;/strong&gt; — ELO 1622, #240&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.5 33B&lt;/strong&gt; — ELO 1619, #245&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Nemotron Cascade 2 30B A3B&lt;/strong&gt; — ELO 1591, #288&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)&lt;/strong&gt; — ELO 1579, #309&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 26B A4B (Non-reasoning)&lt;/strong&gt; — ELO 1579, #311&lt;/li&gt;&lt;li&gt;&lt;strong&gt;JT-MINI&lt;/strong&gt; — ELO 1567, #329&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiniMax M1 40k&lt;/strong&gt; — ELO 1551, #347&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2 Think V2&lt;/strong&gt; — ELO 1550, #349&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HyperCLOVA X SEED Think (32B)&lt;/strong&gt; — ELO 1546, #354&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro&lt;/strong&gt; — ELO 1536, #369&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.0 Flash Thinking Experimental (Jan '25)&lt;/strong&gt; — ELO 1534, #373&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Non-reasoning)&lt;/strong&gt; — ELO 1527, #380&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 3&lt;/strong&gt; — ELO 1525, #383&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Open 100B (Reasoning)&lt;/strong&gt; — ELO 1521, #388&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro Preview&lt;/strong&gt; — ELO 1514, #392&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Reasoning)&lt;/strong&gt; — ELO 1511, #396&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (ChatGPT)&lt;/strong&gt; — ELO 1493, #416&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (March 2025, chatgpt-4o-latest)&lt;/strong&gt; — ELO 1492, #418&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Reasoning)&lt;/strong&gt; — ELO 1475, #438&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Preview) (Reasoning)&lt;/strong&gt; — ELO 1474, #441&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Reasoning)&lt;/strong&gt; — ELO 1456, #463&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Preview) (Non-reasoning)&lt;/strong&gt; — ELO 1456, #464&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Reasoning)&lt;/strong&gt; — ELO 1448, #476&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Step3 VL 10B&lt;/strong&gt; — ELO 1446, #479&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think&lt;/strong&gt; — ELO 1441, #489&lt;/li&gt;&lt;li&gt;&lt;strong&gt;NVIDIA Nemotron 3 Nano 4B&lt;/strong&gt; — ELO 1436, #496&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.0 Flash-Lite (Feb '25)&lt;/strong&gt; — ELO 1433, #498&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 30B&lt;/strong&gt; — ELO 1431, #501&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Tulu3 405B&lt;/strong&gt; — ELO 1425, #509&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Reasoning)&lt;/strong&gt; — ELO 1405, #531&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Non-reasoning)&lt;/strong&gt; — ELO 1405, #532&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Non-reasoning)&lt;/strong&gt; — ELO 1403, #538&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 1.5 Flash-8B&lt;/strong&gt; — ELO 1400, #540&lt;/li&gt;&lt;li&gt;&lt;strong&gt;QwQ 32B-Preview&lt;/strong&gt; — ELO 1385, #556&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Non-reasoning)&lt;/strong&gt; — ELO 1382, #559&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)&lt;/strong&gt; — ELO 1364, #576&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Non-reasoning)&lt;/strong&gt; — ELO 1355, #586&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)&lt;/strong&gt; — ELO 1347, #593&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepHermes 3 - Mistral 24B Preview (Non-reasoning)&lt;/strong&gt; — ELO 1343, #596&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen2.5 Coder Instruct 7B&lt;/strong&gt; — ELO 1312, #622&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ling-mini-2.0&lt;/strong&gt; — ELO 1305, #627&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3n E4B Instruct Preview (May '25)&lt;/strong&gt; — ELO 1304, #629&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 3B&lt;/strong&gt; — ELO 1302, #632&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Jamba Reasoning 3B&lt;/strong&gt; — ELO 1275, #641&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LFM 40B&lt;/strong&gt; — ELO 1265, #650&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Reasoning)&lt;/strong&gt; — ELO 1259, #655&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 2 Chat 13B&lt;/strong&gt; — ELO 1242, #667&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Non-reasoning)&lt;/strong&gt; — ELO 1234, #671&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 1B&lt;/strong&gt; — ELO 1205, #688&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Molmo 7B-D&lt;/strong&gt; — ELO 1204, #690&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning)&lt;/strong&gt; — ELO 1198, #694&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 1B&lt;/strong&gt; — ELO 1197, #695&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 32B&lt;/strong&gt; — ELO 1141, #728&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Phi-3 Mini Instruct 3.8B&lt;/strong&gt; — ELO 1126, #733&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 350M&lt;/strong&gt; — ELO 1103, #739&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3 270M&lt;/strong&gt; — ELO 1088, #744&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 350M&lt;/strong&gt; — ELO 1077, #748&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 7B&lt;/strong&gt; — ELO 1071, #752&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on VoxelBench: 2125.0 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;VoxelBench&lt;/strong&gt;: GPT-5.5 Pro (2125.0) beat GPT-5.5 (xHigh) by 103.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-04

=== DAILY ===
NEW MODELS (62)
  - Doubao Seed Code — ELO 1645, #209/778 (above: Qwen 3 235B A22B 2507 (Reasoning), below: K-EXAONE (Reasoning))
  - K-EXAONE (Reasoning) — ELO 1645, #210/778 (above: Doubao Seed Code, below: O4 Mini (High))
  - Gemini 2.5 Flash Previe</summary></entry><entry><title>AI Benchmark Digest — 2026-05-04</title><id>https://aibenchmarks.dev/digest/2026-05-04</id><updated>2026-05-04T00:19:05.942276+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (62)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Doubao Seed Code&lt;/strong&gt; — ELO 1645, #209&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Reasoning)&lt;/strong&gt; — ELO 1645, #210&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.5 Flash Preview (Sep '25) (Reasoning)&lt;/strong&gt; — ELO 1638, #221&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 31B (Non-reasoning)&lt;/strong&gt; — ELO 1626, #232&lt;/li&gt;&lt;li&gt;&lt;strong&gt;ERNIE 5.0 Thinking Preview&lt;/strong&gt; — ELO 1622, #241&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.5 33B&lt;/strong&gt; — ELO 1619, #248&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Nemotron Cascade 2 30B A3B&lt;/strong&gt; — ELO 1591, #288&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 26B A4B (Non-reasoning)&lt;/strong&gt; — ELO 1580, #309&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.5 Flash Preview (Sep '25) (Non-reasoning)&lt;/strong&gt; — ELO 1579, #310&lt;/li&gt;&lt;li&gt;&lt;strong&gt;JT-MINI&lt;/strong&gt; — ELO 1567, #329&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiniMax M1 40k&lt;/strong&gt; — ELO 1551, #347&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K2 Think V2&lt;/strong&gt; — ELO 1550, #349&lt;/li&gt;&lt;li&gt;&lt;strong&gt;HyperCLOVA X SEED Think (32B)&lt;/strong&gt; — ELO 1547, #352&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro&lt;/strong&gt; — ELO 1536, #370&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.0 Flash Thinking Experimental (Jan '25)&lt;/strong&gt; — ELO 1534, #373&lt;/li&gt;&lt;li&gt;&lt;strong&gt;K-EXAONE (Non-reasoning)&lt;/strong&gt; — ELO 1527, #380&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 3&lt;/strong&gt; — ELO 1525, #384&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Open 100B (Reasoning)&lt;/strong&gt; — ELO 1521, #388&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mi:dm K 2.5 Pro Preview&lt;/strong&gt; — ELO 1515, #392&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Reasoning)&lt;/strong&gt; — ELO 1511, #397&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (March 2025, chatgpt-4o-latest)&lt;/strong&gt; — ELO 1493, #416&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GPT-4o (ChatGPT)&lt;/strong&gt; — ELO 1493, #417&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Reasoning)&lt;/strong&gt; — ELO 1475, #438&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Preview) (Reasoning)&lt;/strong&gt; — ELO 1474, #441&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Preview) (Non-reasoning)&lt;/strong&gt; — ELO 1457, #463&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Reasoning)&lt;/strong&gt; — ELO 1456, #466&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Reasoning)&lt;/strong&gt; — ELO 1448, #476&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Step3 VL 10B&lt;/strong&gt; — ELO 1446, #481&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Tri-21B-Think&lt;/strong&gt; — ELO 1441, #490&lt;/li&gt;&lt;li&gt;&lt;strong&gt;NVIDIA Nemotron 3 Nano 4B&lt;/strong&gt; — ELO 1436, #497&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 2.0 Flash-Lite (Feb '25)&lt;/strong&gt; — ELO 1434, #498&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 30B&lt;/strong&gt; — ELO 1431, #502&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Tulu3 405B&lt;/strong&gt; — ELO 1425, #510&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Reasoning)&lt;/strong&gt; — ELO 1405, #531&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E4B (Non-reasoning)&lt;/strong&gt; — ELO 1405, #532&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Solar Pro 2 (Non-reasoning)&lt;/strong&gt; — ELO 1403, #537&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemini 1.5 Flash-8B&lt;/strong&gt; — ELO 1400, #540&lt;/li&gt;&lt;li&gt;&lt;strong&gt;QwQ 32B-Preview&lt;/strong&gt; — ELO 1386, #556&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.0 32B (Non-reasoning)&lt;/strong&gt; — ELO 1382, #559&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.3 Nemotron Super 49B v1 (Non-reasoning)&lt;/strong&gt; — ELO 1365, #575&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 4 E2B (Non-reasoning)&lt;/strong&gt; — ELO 1355, #585&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 3.1 Nemotron Nano 4B v1.1 (Reasoning)&lt;/strong&gt; — ELO 1348, #593&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepHermes 3 - Mistral 24B Preview (Non-reasoning)&lt;/strong&gt; — ELO 1343, #595&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Qwen2.5 Coder Instruct 7B&lt;/strong&gt; — ELO 1312, #622&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Ling-mini-2.0&lt;/strong&gt; — ELO 1305, #628&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3n E4B Instruct Preview (May '25)&lt;/strong&gt; — ELO 1304, #629&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 3B&lt;/strong&gt; — ELO 1302, #632&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Jamba Reasoning 3B&lt;/strong&gt; — ELO 1275, #642&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LFM 40B&lt;/strong&gt; — ELO 1265, #650&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Reasoning)&lt;/strong&gt; — ELO 1260, #655&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Llama 2 Chat 13B&lt;/strong&gt; — ELO 1242, #666&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Exaone 4.0 1.2B (Non-reasoning)&lt;/strong&gt; — ELO 1234, #671&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 1B&lt;/strong&gt; — ELO 1206, #688&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Molmo 7B-D&lt;/strong&gt; — ELO 1204, #689&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepHermes 3 - Llama-3.1 8B Preview (Non-reasoning)&lt;/strong&gt; — ELO 1198, #693&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 1B&lt;/strong&gt; — ELO 1197, #694&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 32B&lt;/strong&gt; — ELO 1142, #726&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Phi-3 Mini Instruct 3.8B&lt;/strong&gt; — ELO 1126, #732&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 350M&lt;/strong&gt; — ELO 1103, #738&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Gemma 3 270M&lt;/strong&gt; — ELO 1088, #744&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.0 H 350M&lt;/strong&gt; — ELO 1077, #748&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OLMo 2 7B&lt;/strong&gt; — ELO 1072, #752&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;Top-10 New Scores (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GPT-5.5 Pro&lt;/strong&gt; on VoxelBench: 2122.0 (#1)&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;VoxelBench&lt;/strong&gt;: GPT-5.5 Pro (2122.0) beat GPT-5.5 (xHigh) by 100.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-04

=== DAILY ===
NEW MODELS (62)
  - Doubao Seed Code — ELO 1645, #209/778 (above: Qwen 3 235B A22B 2507 (Reasoning), below: K-EXAONE (Reasoning))
  - K-EXAONE (Reasoning) — ELO 1645, #210/778 (above: Doubao Seed Code, below: O4 Mini (High))
  - Gemini 2.5 Flash Previe</summary></entry><entry><title>AI Benchmark Digest — 2026-05-03</title><id>https://aibenchmarks.dev/digest/2026-05-03</id><updated>2026-05-03T07:35:30.739888+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Open-R1 Eval Leaderboard&lt;/strong&gt; (Average Accuracy (%)): leader Qwen3-32B (73.74), 37 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SciEvalKit&lt;/strong&gt; (Scientific Capability Score): leader Gemini-3-Pro (48.74), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLM Benchmarker Suite&lt;/strong&gt; (Average Score (%)): leader LLaMA-2 (70B) (62.53), 8 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;FastEval&lt;/strong&gt; (Total Score): leader GPT-4-0613 (77.78), 33 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LMArena Preference Proxy&lt;/strong&gt; (Evaluator Accuracy (%)): leader gemma-2-9b-it (64.63), 4 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;SeaEval&lt;/strong&gt; (Average Score (%)): leader GPT4o_0513 (72.86), 30 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LLMZSZL Leaderboard&lt;/strong&gt; (Score): leader Qwen2.5-72B-Instruct (69.06), 99 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Swahili LLM Leaderboard&lt;/strong&gt; (Average Score (%)): leader Swahili Gemma (61.32), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MMLU-by-task Leaderboard&lt;/strong&gt; (MMLU Average (%)): leader FashionGPT-70B-V1.1 (70.99), 1257 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Spider 2.0-DBT&lt;/strong&gt;: Databao Agent (58.82) beat SignalPilot Agent by 7.26&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Video Editing)&lt;/strong&gt;: happy-horse-1.0 (1329.0) beat wan-v2.7-v2v by 7.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Chess Puzzles (Epoch AI)&lt;/strong&gt;: gpt-5.5-pro-pre-release_xhigh (64.0) beat gpt-5.4-pro-2026-03-05_xhigh by 5.4&lt;/li&gt;&lt;li&gt;&lt;strong&gt;WeirdML&lt;/strong&gt;: gpt-5.5 (high) (83.9) beat gpt-5.3-codex (xhigh) by 4.6&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BridgeBench Hallucination&lt;/strong&gt;: Grok 4.3 (79.8) beat Gemini 3.1 Pro by 0.7&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-03

=== DAILY ===
NEW BENCHMARKS (9)
  - Open-R1 Eval Leaderboard (Average Accuracy (%)): leader Qwen3-32B (73.74), 37 models
  - SciEvalKit (Scientific Capability Score): leader Gemini-3-Pro (48.74), 10 models
  - LLM Benchmarker Suite (Average Score (%)): leader LLaMA</summary></entry><entry><title>AI Benchmark Digest — 2026-05-01</title><id>https://aibenchmarks.dev/digest/2026-05-01</id><updated>2026-05-01T07:44:31.560358+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (56)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;LIBRA - Passkey&lt;/strong&gt; (Dataset Total Score (%)): leader GLM-4 9B Chat (100.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - MatreshkaYesNo&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (80.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - MatreshkaNames&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (51.67), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - PasskeyWithLibrusec&lt;/strong&gt; (Dataset Total Score (%)): leader GLM-4 9B Chat (100.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LibrusecHistory&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (97.5), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruGSM100&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (100.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciPassageCount&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (35.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ru2WikiMultihopQA&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (76.67), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LongContextMultiQ&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (36.67), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciAbstractRetrieval&lt;/strong&gt; (Dataset Total Score (%)): leader GLM-4 9B Chat (77.81), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruTREC&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (75.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruSciFi&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (75.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - LibrusecMHQA&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (50.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA1&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (78.33), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA2&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (36.67), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA3&lt;/strong&gt; (Dataset Total Score (%)): leader Llama 3.1 8B (29.65), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA4&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (78.95), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruBABILongQA5&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (90.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruQuALITY&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (83.33), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruTPO&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (100.0), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;LIBRA - ruQasper&lt;/strong&gt; (Dataset Total Score (%)): leader GPT-4o (31.72), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Wolfram LLM Benchmarking Project&lt;/strong&gt; (Correct Functionality (%)): leader Claude Opus 4.7 thinking on (72.5), 443 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Project Euler 943-970&lt;/strong&gt; (Accuracy (%, direct Project Euler problems 943-970)): leader GPT-5.4 (xhigh) (87.5), 17 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Project Euler 971-984&lt;/strong&gt; (Accuracy (%, direct Project Euler problems 971-984)): leader Claude-Opus-4.6 (High) (92.86), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MathArena - Project Euler 985-988&lt;/strong&gt; (Accuracy (%, direct Project Euler problems 985-988)): leader Gemini 3.1 Pro Preview (100.0), 5 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM OCRBench&lt;/strong&gt; (Score (normalized)): leader JT-VL-Chat-V3.0 (95.0), 285 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Vibe Code Bench&lt;/strong&gt; (Accuracy (%)): leader claude-opus-4-7 (71.0), 41 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Albanian&lt;/strong&gt; (Average Score (%)): leader gemini-3.1-pro-preview (65.43), 209 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bosnian&lt;/strong&gt; (Average Score (%)): leader gpt-4.1-mini-2025-04-14 (63.93), 218 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Bulgarian&lt;/strong&gt; (Average Score (%)): leader gemini-3-pro-preview (74.47), 219 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Catalan&lt;/strong&gt; (Average Score (%)): leader gemini-2.5-flash#thinking (68.12), 219 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Croatian&lt;/strong&gt; (Average Score (%)): leader gemini-3-pro-preview (69.99), 218 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Czech&lt;/strong&gt; (Average Score (%)): leader gemini-2.5-pro (70.02), 236 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Danish&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07#high (78.81), 454 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Dutch&lt;/strong&gt; (Average Score (%)): leader Llama-3.1-405B (78.43), 350 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Estonian&lt;/strong&gt; (Average Score (%)): leader gemini-2.5-pro (62.38), 258 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Faroese&lt;/strong&gt; (Average Score (%)): leader gemini-3-pro-preview (70.72), 391 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Finnish&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07#high (72.92), 382 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval French&lt;/strong&gt; (Average Score (%)): leader gemini-3-pro-preview (74.38), 383 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval German&lt;/strong&gt; (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (68.41), 329 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Greek&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (72.28), 209 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Hungarian&lt;/strong&gt; (Average Score (%)): leader gemini-2.5-pro (67.51), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Icelandic&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (70.59), 399 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Italian&lt;/strong&gt; (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (73.12), 435 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Latvian&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (70.85), 238 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Lithuanian&lt;/strong&gt; (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (66.49), 235 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Norwegian&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (76.81), 466 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Polish&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (71.84), 241 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Portuguese&lt;/strong&gt; (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (73.86), 445 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Romanian&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (72.03), 212 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Serbian&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (72.24), 209 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Slovak&lt;/strong&gt; (Average Score (%)): leader gemini-3-pro-preview (68.36), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Slovene&lt;/strong&gt; (Average Score (%)): leader claude-sonnet-4-5-20250929#thinking (67.68), 208 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Spanish&lt;/strong&gt; (Average Score (%)): leader Qwen3-235B-A22B-Thinking-2507-FP8 (68.78), 419 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Swedish&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07#high (78.64), 410 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EuroEval Ukrainian&lt;/strong&gt; (Average Score (%)): leader gpt-5-2025-08-07 (67.31), 205 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (3)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Grok 4.3&lt;/strong&gt; — ELO 1826, #104&lt;ul&gt;&lt;li&gt;AA IFBench: 81.3 (#2/409)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Mistral Medium 3.5&lt;/strong&gt; — ELO 1749, #189&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Hy3-preview (Non-reasoning)&lt;/strong&gt; — ELO 1711, #233&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (5)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Data Viz)&lt;/strong&gt;: mimo-v2.5-pro (1375.0) beat claude-sonnet-4-6 by 29.0&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CaseLaw v2&lt;/strong&gt;: grok-4.3 (79.31) beat gpt-5.1-2025-11-13 by 5.89&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI Terminal-Bench 2.0&lt;/strong&gt;: gpt-5.5 (73.2) beat claude-opus-4-7 by 4.66&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenClawProBench&lt;/strong&gt;: qwen3.5-397b-a17b (70.4) beat qwen3.5-plus by 0.3&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Vals AI CorpFin v2&lt;/strong&gt;: grok-4.3 (68.53) beat gpt-5.5 by 0.11&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-05-01

=== DAILY ===
NEW BENCHMARKS (56)
  - LIBRA - Passkey (Dataset Total Score (%)): leader GLM-4 9B Chat (100.0), 17 models
  - LIBRA - MatreshkaYesNo (Dataset Total Score (%)): leader GPT-4o (80.0), 17 models
  - LIBRA - MatreshkaNames (Dataset Total Score (%)): leade</summary></entry><entry><title>AI Benchmark Digest — 2026-04-30</title><id>https://aibenchmarks.dev/digest/2026-04-30</id><updated>2026-04-30T07:34:47.603527+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Models (9)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;kimi-k2.6_nitro&lt;/strong&gt; — ELO 1888, #57&lt;ul&gt;&lt;li&gt;GACL - WordMatrix: 66.72 (#2/21)&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Kimi K2.6 (Non-reasoning)&lt;/strong&gt; — ELO 1808, #106&lt;/li&gt;&lt;li&gt;&lt;strong&gt;deepseek-v4-flash_nitro&lt;/strong&gt; — ELO 1785, #126&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepSeek V4 Pro (Non-reasoning)&lt;/strong&gt; — ELO 1763, #148&lt;/li&gt;&lt;li&gt;&lt;strong&gt;DeepSeek V4 Flash (Non-reasoning)&lt;/strong&gt; — ELO 1733, #187&lt;/li&gt;&lt;li&gt;&lt;strong&gt;MiMo-V2.5-Pro (Non-reasoning)&lt;/strong&gt; — ELO 1729, #190&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 30B&lt;/strong&gt; — ELO 1502, #485&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 8B&lt;/strong&gt; — ELO 1445, #600&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Granite 4.1 3B&lt;/strong&gt; — ELO 1372, #742&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;GACL - Tic-Tac-Toe&lt;/strong&gt;: claude-sonnet-4.6 (83.6) beat claude-opus-4.6 by 20.46&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-04-30

=== DAILY ===
NEW MODELS (9)
  - kimi-k2.6_nitro — ELO 1888, #57/1066 (above: Grok 4.20 0309 (Reasoning), below: Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort))
      GACL - WordMatrix: 66.72 (#2/21)
  - Kimi K2.6 (Non-reasoning) — ELO 1808, #106/1066 (above: GP</summary></entry><entry><title>AI Benchmark Digest — 2026-04-29</title><id>https://aibenchmarks.dev/digest/2026-04-29</id><updated>2026-04-29T07:19:43.831412+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (29)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;OpenVLM MME&lt;/strong&gt; (Overall Score): leader InternVL3-78B (2538.6), 235 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM ScienceQA Test&lt;/strong&gt; (Accuracy (%)): leader InternVL2.5-78B-MPO (99.5), 218 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM POPE&lt;/strong&gt; (Overall (%)): leader InternVL2.5-26B-MPO (90.5), 216 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM SEED-Bench 2 Plus&lt;/strong&gt; (Accuracy (%)): leader Qwen2.5-VL-72B (73.8), 211 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM COCO Captions&lt;/strong&gt; (CIDEr): leader Emu2_chat (109.2), 211 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM MMT-Bench&lt;/strong&gt; (Accuracy (%)): leader InternVL3-78B (72.6), 207 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM A-Bench&lt;/strong&gt; (Accuracy (%)): leader Qwen2.5-VL-72B (81.0), 160 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM MTVQA&lt;/strong&gt; (Accuracy (%)): leader GPT-4.1-mini-20250414 (36.8), 157 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM OCR-VQA&lt;/strong&gt; (Accuracy (%)): leader Kimi-VL-A3B-Instruct (82.0), 118 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM SEED-Bench 2&lt;/strong&gt; (Accuracy (%)): leader GPT-4.1-20250414 (76.0), 59 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;OpenVLM VCR&lt;/strong&gt; (Overall Jaccard (%)): leader Qwen2-VL-7B (75.6), 48 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Clemscore&lt;/strong&gt; (Clemscore (%)): leader claude-sonnet-4-5-azure-high (90.1), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM AdventureGame&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-azure-high (99.17), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Clean Up&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-azure-high (100.0), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Codenames&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-azure-high (87.69), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Deal or No Deal&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-azure-high (99.12), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM GuessWhat&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-azure-high (93.33), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Hot Air Balloon&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (95.53), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM ImageGame&lt;/strong&gt; (Game Clemscore (%)): leader gpt-5.2-2025-12-11 (99.92), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM MatchIt ASCII&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (100.0), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM PrivateShared&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (98.7), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM ReferenceGame&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-20250929 (100.0), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Taboo&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-azure-low (98.33), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM TextMapWorld&lt;/strong&gt; (Game Clemscore (%)): leader gemini-3-flash (91.35), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM TextMapWorld GraphReasoning&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-azure-low (86.34), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM TextMapWorld SpecificRoom&lt;/strong&gt; (Game Clemscore (%)): leader Llama-3.1-70B-Instruct (100.0), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Wordle&lt;/strong&gt; (Game Clemscore (%)): leader kimi-k2-thinking (73.0), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Wordle with Clue&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (82.5), 31 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;CLEM Wordle with Critic&lt;/strong&gt; (Game Clemscore (%)): leader claude-sonnet-4-5-azure-high (86.11), 31 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;JSL-MedMNX-7B-SFT&lt;/strong&gt; — ELO 1309, #841&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (1)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - ECI&lt;/strong&gt;: GPT-5.5 Pro (xhigh) (158.67) beat GPT-5.4 Pro (xhigh) by 0.38&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-04-29

=== DAILY ===
NEW BENCHMARKS (29)
  - OpenVLM MME (Overall Score): leader InternVL3-78B (2538.6), 235 models
  - OpenVLM ScienceQA Test (Accuracy (%)): leader InternVL2.5-78B-MPO (99.5), 218 models
  - OpenVLM POPE (Overall (%)): leader InternVL2.5-26B-MPO (90.5), 2</summary></entry><entry><title>AI Benchmark Digest — 2026-04-28</title><id>https://aibenchmarks.dev/digest/2026-04-28</id><updated>2026-04-28T07:42:16.928716+00:00</updated><link href="https://aibenchmarks.dev/#/digest" /><author><name>AI Benchmark Hub</name></author><content type="html">&lt;h2&gt;Daily&lt;/h2&gt;
&lt;h3&gt;New Benchmarks (2)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;PredictionArena (Polymarket)&lt;/strong&gt; (Account Value ($)): leader claude-opus-4-6 (77298.59), 10 models&lt;/li&gt;&lt;li&gt;&lt;strong&gt;PredictionArena (Kalshi)&lt;/strong&gt; (Account Value ($)): leader gemini-3.1-pro-preview (15363.0), 10 models&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New Models (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;Hy3-preview (Reasoning)&lt;/strong&gt; — ELO 1839, #83&lt;/li&gt;&lt;li&gt;&lt;strong&gt;EXAONE 4.5 33B&lt;/strong&gt; — ELO 1698, #224&lt;/li&gt;&lt;li&gt;&lt;strong&gt;llama3-slerp-med&lt;/strong&gt; — ELO 1338, #770&lt;/li&gt;&lt;li&gt;&lt;strong&gt;BioMistralMerged&lt;/strong&gt; — ELO 1262, #921&lt;/li&gt;&lt;/ul&gt;
&lt;h3&gt;New #1 Leaders (4)&lt;/h3&gt;&lt;ul&gt;&lt;li&gt;&lt;strong&gt;MineBench&lt;/strong&gt;: GPT 5.5 Pro (2080.73) beat GPT 5.4 Pro by 364.29&lt;/li&gt;&lt;li&gt;&lt;strong&gt;GSO-Bench&lt;/strong&gt;: Claude Opus 4.7 (44.12) beat Claude-4.6-Opus by 10.79&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Epoch AI - Apex Agents&lt;/strong&gt;: gpt-5.5_xhigh (38.4) beat gpt-5.4-2026-03-05_xhigh by 2.5&lt;/li&gt;&lt;li&gt;&lt;strong&gt;Design Arena (Game Dev)&lt;/strong&gt;: gpt-5.5 (1360.0) beat claude-opus-4-7 by 2.0&lt;/li&gt;&lt;/ul&gt;</content><summary>AI Benchmark Digest — 2026-04-28

=== DAILY ===
NEW BENCHMARKS (2)
  - PredictionArena (Polymarket) (Account Value ($)): leader claude-opus-4-6 (77298.59), 10 models
  - PredictionArena (Kalshi) (Account Value ($)): leader gemini-3.1-pro-preview (15363.0), 10 models

NEW MODELS (4)
  - Hy3-preview (</summary></entry></feed>