AI Benchmark Digest

AI Benchmark Digest — 2026-07-04

2026-07-04T09:35:58.720045+00:00

Daily

New Benchmarks (292)

HarmVideoBench (Macro Avg. (self-reported)): leader HarmVideoBench (ours) (84.4), 21 models
Large vision-language models (LVLMs) have recently shown immense potential in automated content moderation, sparking growing interest in developing harmful-video benchmarks. However, we identify two primary limitations in existing works: 1) The multi-layered characteristics of harmful videos are overlooked. Existing benchmarks predominantly formulate evaluation as a binary classification task, fai
LibEvoBench (SEUS (self-reported)): leader GPT-5.4 (86.0), 13 models
Large software projects often depend on older versions of libraries, even as APIs continue to evolve across releases. This creates a challenge for LLMs: they must maintain knowledge of multiple API versions, not merely the latest or most common one. However, current LLMs are trained on temporally mixed corpora and lack explicit mechanisms for such version-specific reasoning, leading to anachronist
Age of LLM (Points per match (self-reported)): leader GPT-5.5 (3.0), 15 models
We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match use
AGORA (Overall (self-reported)): leader Gemini 3.1 Pro Preview (59.39), 8 models
Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of workplace files, reconciling inconsistent terminology, units, and time conventions, and computing an answer. Existing benchmarks address only parts of this setting and non
BehaviorBench (Game Behav. Sim. (W â) (self-reported)): leader GPT-5.4 (31.4), 20 models
Foundation models have been increasingly applied to behavioral science domains such as psychology, sociology, and economics. While these models show promise in individual tasks such as survey response prediction and human-subject experiment simulation, there remains no systematic understanding of how well they perform across diverse behavioral science tasks, contexts, and populations. We introduce
MedBench v5 (CCR-Agent (self-reported)): leader Claude Opus 4.7 (96.66), 10 models
Existing medical AI benchmarks lack process visibility, atomic skill evaluation, and integrated hallucination detection. We introduce MedBench v5, a redesigned benchmark for clinical multimodal models (language, vision-language, and agent systems) that moves from static QA to dynamic, process-oriented evaluation. MedBench v5 features: (1) a dual-dimensional framework combining Clinical Cognitive R
Qwen-AgentWorld (Avg. (self-reported)): leader Qwen-AgentWorld-397B-A17B (58.71), 18 models
A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B a
AgentCIBench (Leakage (self-reported)): leader Gemini 3.1 Pro Preview (98.3), 15 models
Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. This cross-application access is useful, but it also creates a privacy risk that has been largely overlooked: when an agent works in one context, it can pull in information from another that is inappropriate in that context. Hence, we introduce AgentCIBench, an evaluation h
GUI vs. CLI (Avg. (self-reported)): leader GPT-5.4 (59.1), 9 models
Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents
MuPPET (Multi-Party (self-reported)): leader Qwen3 8B (70.37), 7 models
LLM agents are increasingly deployed in multi-party environments, handling sensitive personal data on behalf of individual users, for instance in group chats. When such an agent discloses private information, it reaches every group member at once. This risk is structurally harder to control than in one-to-one settings, as every piece of private information must be appropriate for every recipient i
BLUEX v2 (Score (self-reported)): leader Gemini 3.1 Pro Preview (9.1), 10 models
Although Large Language Models (LLMs) excel in many tasks, their assessment in Portuguese has received less attention, particularly for open-ended, discursive tasks that demand deeper reasoning and generation capabilities. While the original BLUEX benchmark addressed the scarcity of Portuguese evaluation datasets through multiple-choice questions from Brazilian university entrance exams, it did no
MMGist (Macro â (self-reported)): leader Gemini 3.1 Pro Preview (66.8), 27 models
We conduct a systematic study of 18 widely used vision-language benchmarks and identify three major issues: 1) many items do not rely on visual cues and therefore fail to effectively measure multimodal understanding; 2) many items are already close to performance saturation for current LVLMs, which limits their discriminative power; 3) a small number of anomalous items affect the reliability of ev
PlanBench-XL (Accuracy (%) (self-reported)): leader Gemini 3.1 Pro Preview (77.06), 10 models
LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,6
Benchmarking Large Language Models for Graphem (Direct (self-reported)): leader Qwen3 8B (100.15), 37 models
Grapheme-to-phoneme (G2P) conversion is essential for controllable and robust text-to-speech, and large language models (LLMs), with broad linguistic knowledge, offer a promising approach. We benchmarked over 30 LLMs on Japanese G2P, comparing them with conventional morphological analyzers on 3000 manually annotated sentences. We evaluated two prompting strategies: a parse mode, where the LLM perf
Inverse Turing Bench (Accuracy (self-reported)): leader GPTZero-W-only (89.41), 17 models
As AI systems integrate into online spaces, differentiating them from humans in conversations is increasingly important. We present Inverse Turing Bench, a benchmark that evaluates LLMs and other models on their ability to differentiate humans and AI in multi-turn text. The benchmark provides a collection of paired dialogue transcripts, wherein one dialogue is between two humans and the other is b
CheXpercept (Stage 1 (End-to-End) (self-reported)): leader Qwen3.6 27B (92.2), 14 models
The evaluation of vision-language models (VLMs) for chest X-ray (CXR) analysis has largely been limited to disease-presence classification without visual grounding. Such evaluations fail to verify the expert-level lesion perception necessary to ensure the clinical reliability of VLMs. To address these limitations, we introduce CheXpercept, a sequential, multi-level perception benchmark that mirror
CulMind (S (self-reported)): leader Gemini 3 Flash Preview (50.7), 14 models
Evaluating Multimodal Large Language Models (MLLMs) in Chinese Cultural Heritage (CCH) requires fine-grained reasoning over visual, textual, stylistic, and historical clues. However, existing CCH benchmarks mainly emphasize final-answer accuracy, while the accuracy and completeness of reasoning processes remain underexplored. To address this gap, we introduce CulMind and CulMind-R: a high-quality
MedLayXPlain (S (self-reported)): leader Gemini-2.5-Flash +Thinking (70.6), 32 models
Medical Vision-Language Models (Med-VLMs) achieve strong expert-level performance, yet their ability to generate patient-accessible descriptions remains underexplored. With the 21st Century Cures Act now mandating immediate patient access to diagnostic imaging results, evaluating whether Med-VLMs can bridge this Expert-Lay Gap is both urgent and clinically consequential for patient education and s
The Metanym Game (T (self-reported)): leader Claude Opus 4.5 (7.0), 12 models
The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it -- a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-p
Trip+ (Plan Avg. (self-reported)): leader Gemini 3.1 Pro Preview (73.31), 18 models
Interactive travel planning has become a popular use case for language models. Agents are deployed to manage evolving preferences and unexpected disruptions over multiple turns. Such settings require models to make complex, profile-conditioned planning decisions. However, existing benchmarks often evaluate feasibility, personalization, or interaction in relatively isolated settings. We therefore i
BIM-Edit (Final (self-reported)): leader Gemini 3.0 Flash (49.48), 7 models
Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing exist
CombEval (Avg. (self-reported)): leader GPT-5.5 (93.6), 11 models
We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports syst
JamSet/JamBench (Task 1a SCS (self-reported)): leader GPT-5.4 (46.0), 9 models
Current AI-driven game development has made substantial progress in asset generation, gameplay design, and web-based game coding, yet project-level code engineering on professional game engines remains largely unexplored due to the absence of large-scale datasets and deterministic evaluation methods. We present JamSet and JamBench, the first project-level game code framework dataset and benchmark
ORAgentBench (Pass Rate All (self-reported)): leader GPT-5.4 (35.51), 14 models
Large language models are increasingly deployed as autonomous agents for multi-step tasks in executable environments, yet their ability to perform realistic operations research (OR) work remains unclear. Existing OR evaluations often decouple modeling from solving, rely on pre-formalized or text-only instances, and rarely test the full workflow from operational artifacts to validated decisions. In
ROSE (Avg. (self-reported)): leader Human (98.8), 10 models
Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \\textsc{ROSE} (\\textbf{R}eference-conditioned \\textbf{O}ddity and \\textbf{S}y
Are LLMs Ready to Assist Physicians? PhysAssis (en_mrs (self-reported)): leader GLM 5 (69.4), 14 models
The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms amb
Crosby micro1 RedlineBench (Turn-weighted score (self-reported)): leader GPT-5.5 (50.5), 4 models
Multi-turn contract-redlining benchmark for SaaS MSA negotiations, using document-native redlines, attorney-authored golden responses, and rubric-based turn-level and behavioral evaluation. (Source: benchmarklist.com, self-reported.)
LaViSA (PT-Acc (self-reported)): leader Gemini 3.1 Pro Preview (88.9), 10 models
Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision
The Wrong Kind of Right (MAR Disability (self-reported)): leader GPT-5.4 (48.8), 26 models
Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including b
Agentic Skills Evaluation Framework (Overall Score w/ (self-reported)): leader Claude Opus 4.8 (92.7), 19 models
Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic
CEO-Bench (Best run API cost (self-reported)): leader GLM 5.1 (0.0), 10 models
Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating
EComAgentBench (Acc. (self-reported)): leader Opus 4.6 (57.1), 7 models
As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To addr
ReproRepo (Issue Match EM@10 (self-reported)): leader GPT-5.5 (25.4), 4 models
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that le
GRACE (Avg Step (self-reported)): leader Gemini 3.1 Pro Preview (81.46), 13 models
Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify wher
SearchGEO (ASRâ (%) (self-reported)): leader Gemini 3 Flash Preview (31.4), 13 models
Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline, a five-mode at
UXBench (Automated Lift (self-reported)): leader GPT-5.4 (21.6), 8 models
Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first ru
MedCTA (Outcome Accuracy (self-reported)): leader GPT-5.4 (31.54), 18 models
MedCTA evaluates medical tool agents on clinician-validated, step-implicit tasks grounded in multimodal clinical inputs, including radiology images, pathology slides, and reports. The benchmark contains 107 real-world clinical tasks with clinician-verified executable trajectories over five deployed tools and measures tool selection, argument validity, execution stability, trajectory fidelity, and
FrontierCode (Main Score (self-reported)): leader Claude Fable 5 x-high (46.3), 16 models
Cognition benchmark for production-quality coding agents measuring whether maintainers would merge model PRs. It uses 150 maintainer-authored open source tasks with nested Extended, Main, and Diamond subsets scored by blockers and quality rubrics.
Dr. DocBench (Overall (self-reported)): leader GPT-5.5 (61.94), 12 models
Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limite
SmartHome-Bench (Overall (self-reported)): leader HomeFlow-RL-8B (87.03), 14 models
Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow
TukaBench (ASR â African Languages (self-reported)): leader GPT 3.5* (38.1), 13 models
Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African cont
Sandboxed Coding Agents are Competitive Omni-m (OmniGAIA Avg. (self-reported)): leader GPT-5.4 x-high (75.0), 12 models
As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks
When Safe Skills Collide (Full chain (self-reported)): leader Haiku 4.5 (9.0), 9 models
LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-
BilliardPhys-Bench (Total (Weighted) (self-reported)): leader GPT-5.5 (73.58), 13 models
Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elast
MineExplorer (Overall TSR (self-reported)): leader Claude Opus 4.6 (41.08), 18 models
Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer
RealityTest (Text disclosure probability (self-reported)): leader Claude Haiku 4.5 (92.3), 17 models
AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems
SpatialAct (Succ. Rate (self-reported)): leader Gemini 3.1 Pro Preview (20.6), 7 models
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understand
StemBind (F Overall (self-reported)): leader Qwen3.5 Plus 2026-04-20 (42.2), 24 models
Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We
ActTraitBench (G_KD (Global Knowledge-Decision Gap) (self-reported)): leader MiniMax M2.5 (2.17), 15 models
While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address
CardioLens (F1 Score (Random) (self-reported)): leader QoQMed-7B (58.72), 24 models
Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archi
Causal Sensitivity Score (CSS) (CSS (self-reported)): leader Grok 4.20 (47.3), 6 models
Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along f
Cookie-Bench (React Overall (self-reported)): leader Opus 4.7 (83.3), 13 models
Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs ov
FinVerBench (Accuracy (self-reported)): leader Claude Sonnet 4 (100.0), 15 models
We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year,
NICE (Weighted All (self-reported)): leader Gemini 3.1 Pro Preview (78.1), 6 models
As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fin
OmniMatBench (Avg. Score (self-reported)): leader Claude Opus 4.7 (37.2), 13 models
As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to ap
PassBench (AS Score (self-reported)): leader Eager (100.0), 11 models
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue tha
Personalized Turn-Level User Conversation Sati (Micro (self-reported)): leader MoonshotAI: Kimi K2.6 (4.74), 7 models
User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized tu
ResearchClawBench (Overall (self-reported)): leader Claude Code (Claude-Opus-4.6) (21.5), 18 models
ResearchClawBench evaluates model capability on agentic tasks from the linked upstream source with Average Score as the primary reported metric.
WMW (World Models in Words) (Traceâans. consistency (self-reported)): leader Claude Opus 4.7 (91.0), 7 models
Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \\wmw, an evaluation framework for auditing
ATRBench (TSAcc Default (%) (self-reported)): leader DeepSeek V4 Flash (23.7), 8 models
A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents,
Can Large Language Models Handle Discourse Par (Overall Avg (MS) (self-reported)): leader GPT-5 (72.4), 10 models
Discourse particles, such as \\textit{well} and \\textit{kind of}, are crucial components that enable LLMs to ``speak'' more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs' capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily
DisasterBench (Exact-match Accuracy (self-reported)): leader Gemini 3.1 Pro Preview (73.39), 12 models
Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct pa
Do Agents Know What They Can't Do? Evaluating (Avg. (self-reported)): leader GPT-5.5 (61.2), 9 models
Infeasibility-awareness benchmark — can tool-using agents detect that a task is impossible under a constrained tool environment instead of burning compute; self-reported average across settings.
From Knowing to Doing (Total ret. (self-reported)): leader Qwen3.6 Plus (85.29), 10 models
Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market
HardMTBench (HardMTBench zh-en GEMBA-DA (self-reported)): leader GPT-5.5 (91.4), 20 models
General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such
IFMTBench (IF$_\text{T}$ (self-reported)): leader Gemini 3.1 Pro Preview (89.08), 15 models
Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction fo
MUSE (Final Score (self-reported)): leader GPT-5.5 (67.14), 15 models
Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to
OR-Space (Build Pass@1 (self-reported)): leader Gemini 3.1 Pro Preview (72.0), 19 models
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces a
PDP-Bench (Macro-F1 (self-reported)): leader Gemini 3.1 Pro Preview (78.53), 7 models
Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases that have already passed prosecutorial review and been formally indicted. As a result, LJP leaves a substantial blind spot in assessing criminal liability, overlooking cases involving insufficient evidence, no criminal liability, or guilt exempted from punishm
Plant, Persist, Trigger (Avg. (self-reported)): leader Gemini 3 Flash Preview (51.1), 7 models
Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversa
When Context Flips, Safety Breaks (PacifAIst BSR (self-reported)): leader Llama 3.1 (8B) (77.6), 12 models
Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nomi
JuICE (F1 Score (self-reported)): leader Gemini 3.1 Pro Preview (56.67), 10 models
As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctive
LiveK12Bench (Mathematics Acc (self-reported)): leader Gemini 3 (88.3), 12 models
Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contami
Qiskit QuantumKatas (Best (self-reported)): leader GPT-5.5 (83.1), 16 models
We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error co
Self-Ensembling Vision-Language Models for Cha (ChartQA (self-reported)): leader TinyChart + Self-ens. (ours) (95.28), 14 models
Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic vari
Verus-SpecBench (Pass@1 (self-reported)): leader Gemini 2.5 Pro (77.8), 6 models
AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent.
VitaBench 2.0 (Avg@4 Full Context (self-reported)): leader Claude Opus 4.6 (50.3), 20 models
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent bench
Claw-Anything (Pass@1 (self-reported)): leader GPT-5.5 (34.5), 9 models
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad
DiscoverPhysics (Mean Explanation Score (self-reported)): leader Claude Opus 4.7 (61.0), 11 models
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened
QUIET (QUIET Total (self-reported)): leader Gemini 3.1 Pro Preview (8.69), 12 models
Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessme
RepoMirage (Avg. (self-reported)): leader Gemini 3.1 Pro Preview (41.4), 8 models
Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce Rep
StakeBench (Agg (self-reported)): leader GPT-5.5 (20.4), 15 models
Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what speakers have committed to in the market. We introduce StakeBench, an evaluation framework for language understanding grounded in market commitment. StakeBench links 560,876 comments from 2,261 resolved markets to verified position, action, and market-odds reco
VisualNeedle (w/ Tools Acc. (%) (self-reported)): leader Gemini 3.1 Pro Preview (56.01), 9 models
Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers
FrontierOR (Sol. quality (self-reported)): leader Gemini 3.1 Pro Preview (52.0), 7 models
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world
GlobalDentBench (Macro-Average Score (self-reported)): leader Gemini 3.1 Pro Preview (63.27), 12 models
While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continen
EvoCode-Bench (MT@4 (self-reported)): leader Opus 4.7 (54.0), 13 models
Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5
GENSTRAT (Alpha (chips/game) (self-reported)): leader GPT-5.4 high (85.0), 9 models
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchma
OpenSkillEval (Overall avg. (self-reported)): leader Claude Opus 4.6 (4.51), 10 models
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should
ForecastBench-Sim (FBSim) (ECI (self-reported)): leader Claude Opus 4.6 (155.0), 27 models
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting syn
Perception or Prejudice (HR (self-reported)): leader Gemini 3 Flash Preview (33.5), 27 models
Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with thr
SGR-Bench (Item-F1 (self-reported)): leader GPT-5.5 (66.18), 11 models
Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We
SpaceDG (Avg. (self-reported)): leader SpaceDG-SFT$_{\textit{Qwen3-VL-8B-Instruct}}$ (66.1), 29 models
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatia
ArchSIBench (Avg. (self-reported)): leader Human Level(w bg in Arch.) (89.2), 28 models
Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover onl
AttuneBench (Composite (self-reported)): leader Opus 4.6 (54.3), 11 models
Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer
DeepWeb-Bench (Overall score (self-reported)): leader GPT-5.5 (33.37), 9 models
Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is su
Hack-Verifiable TextArena (Avg HR (self-reported)): leader Grok 4.1 Fast (28.5), 12 models
Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce
HIDBench (DARPA E3 CADETS MCC (self-reported)): leader Claude Opus 4.6 (60.2), 9 models
Recent benchmark efforts have advanced the evaluation of large language models (LLMs) in cybersecurity, including tasks such as penetration testing and vulnerability identification. However, a critical cybersecurity task, namely intrusion detection from system logs, remains unexplored. In this work, we present a new benchmark to assess LLMs' capabilities in supporting host-based intrusion detectio
PlanningBench (All-pass (%) (self-reported)): leader GPT-5.4 x-high (63.17), 16 models
Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverag
QuestBench (Pass Rate (%) (self-reported)): leader GPT-5.5 (57.58), 13 models
As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teac
RankJudge (Elo (All) (self-reported)): leader Gemini 3.1 Pro Preview (1959.0), 21 models
As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation,
RefusalBench (Strict Refusal Rate (self-reported)): leader Command A (94.6), 19 models
Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, bord
TempGlitch (Acc. (1 FPS) (self-reported)): leader GPT-5.4 Mini (52.4), 10 models
Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are t
WikiVQABench (Accuracy (self-reported)): leader InternVL3-78B (75.6), 15 models
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images
HalluWorld (Overall (self-reported)): leader GPT-4o-mini (28.1), 13 models
Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation
WildRoadBench (AP50 (self-reported)): leader Gemini 3 (42.1), 25 models
We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain
ChildAgentEval (Total (self-reported)): leader GPT-5.4 (53.0), 6 models
While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational,
STT-Arena (Overall (self-reported)): leader Claude Opus 4.6 (35.39), 23 models
Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We intr
SVFSearch (Overall Acc. (self-reported)): leader Qwen3.5-27B (95.4), 23 models
Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowl
Time to REFLECT (Overall (Report Quality) (self-reported)): leader GPT-5.3-Codex (47.5), 14 models
Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research age
ASPI (ASR exec_tool (self-reported)): leader DeepSeek V3.2 (65.4), 10 models
Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent's susceptibility to prompt injection atta
CAM-Bench (Pass@32 (self-reported)): leader DeepSeek V4 Pro (19.67), 5 models
Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied
ContractBench (SR% (self-reported)): leader Claude Opus 4.6 (77.8), 20 models
Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is n
ConsumerSimBench (Avg (95% CI) (self-reported)): leader Gemini 3.1 Pro Preview (47.8), 13 models
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122
TOBench (Avg. (self-reported)): leader Qwen3.5 Plus 2026-04-20 (41.0), 18 models
Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark se
RoadmapBench (Resolved (%) (self-reported)): leader Claude Opus 4.7 (39.1), 13 models
Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To
SaaS-Bench (Overall (self-reported)): leader Claude Opus 4.7 (43.9), 15 models
Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agen
Are Agents Ready to Teach? A Multi-Stage Bench (Eq. pass (self-reported)): leader GLM 5.1 (63.8), 11 models
Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified deci
Do Coding Agents Understand Least-Privilege Au (TSR (self-reported)): leader Full-Access (94.0), 11 models
As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces. To study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a mode
RxEval (F1 (self-reported)): leader Gemini 3.1 Pro Preview (77.1), 17 models
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a
SciPaths (F1 (self-reported)): leader Gemini 3.1 Pro Preview (18.9), 10 models
Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the ta
CiteVQA (Overall SAA (self-reported)): leader Gemini 3.1 Pro Preview (66.0), 22 models
Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and m
ClawForge (Strict Acc. (self-reported)): leader Claude Opus 4.6 (45.3), 7 models
Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do
Ego2World (Goal Tasks (self-reported)): leader GLM 5 (183.0), 6 models
Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics,
PerfCodeBench (CGRE (self-reported)): leader GPT-5 (73.99), 20 models
Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executabl
AcuityBench (QA Exact (self-reported)): leader Claude Opus 4.7 (85.3), 12 models
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses th
Do Enterprise Systems Need Learned World Model (CascadeBench IoU w/ BR (self-reported)): leader Qwen-3.5-27B-LoRA (50.9), 10 models
World models enable agents to anticipate the effects of their actions by internalizing environment dynamics. In enterprise systems, however, these dynamics are often defined by tenant-specific business logic that varies across deployments and evolves over time, making models trained on historical transitions brittle under deployment shift. We ask a question the world-models literature has not addr
Human-Grounded Multimodal Benchmark with 900K (Science MC Accuracy (self-reported)): leader GPT-5 (90.9), 11 models
Authentic school examinations provide a high-validity test bed for evaluating multimodal large language models (MLLMs), yet benchmarks grounded in Japanese K-12 assessments remain scarce. We present a multimodal dataset constructed from Japan's National Assessment of Academic Ability, comprising officially released middle-school items in Science, Mathematics, and Japanese Language. Unlike existing
MMCL-Bench (Overall (self-reported)): leader GPT-5.4 (26.5), 5 models
We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manu
SpatialBabel (Three.js (self-reported)): leader Gemini 3 (83.4), 14 models
Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful i
Visual Aesthetic Benchmark (VAB) (Overall Top-1 ap@1 (self-reported)): leader Human Expert (77.7), 22 models
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study wit
Agent-ValueBench (Authority (self-reported)): leader Grok 4.20 (7.8), 14 models
Agent value-alignment benchmark with executable environments and value-conflict tasks, testing whether autonomous agents express stable values across domains, harnesses, and trajectories.
CADBench (IoU Aggregate (self-reported)): leader CADFit (85.9), 11 models
Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD,
Greenland Sovereignty Game (implicitly as a st (Escalation Composite (self-reported)): leader MoonshotAI: Kimi K2.6 (22.0), 8 models
What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms ag
gwBenchmarks (Waveform (self-reported)): leader Haiku 4.5 (59.3), 12 models
Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference,
IndustryBench (Final (SV) (self-reported)): leader Gemini 3.1 Pro Preview (2.08), 17 models
In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark
KnotBench (Accuracy (%) (self-reported)): leader Claude Opus 4.7 (54.6), 2 models
A vision-language model can look at a knot diagram and report what it sees, yet fail to act on that structure. KnotBench pairs an 858,318-image corpus from 1,951 prime-knot prototypes (crossing numbers 3 to 19) with a protocol whose answers are checked against Regina's canonical knot signature. Its 14 tasks span four families, equivalence judgment, move prediction, identification, and cross-modal
LITMUS (Attack Success Rate (self-reported)): leader DeepSeek V3.2 (71.51), 6 models
The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to i
Metacognitive Probe (Mean(T1,T2,T4,T5) (self-reported)): leader Claude Sonnet 4.6 (82.0), 8 models
The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motiva
Multi-domain Multi-modal Document Classificati (HF1 (self-reported)): leader Gemini 3.1 Pro Preview (64.98), 15 models
Document classification forms the backbone of modern enterprise content management, yet existing benchmarks remain trapped in oversimplified paradigms -- single domain settings with flat label structures -- that bear little resemblance to the hierarchical, multi-modal, and cross-domain nature of real-world business documents. This gap not only misrepresents practical complexity but also stifles pr
PaperFit-Bench (Compile (self-reported)): leader GPT-5.4 (100.0), 4 models
A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Tex
Polaris-Bench (Overall C (self-reported)): leader Gemini 3.1 Pro Preview (82.6), 15 models
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinate
StereoTales (Emissions (self-reported)): leader Grok 4 (304.0), 23 models
Multilingual studies of social bias in open-ended LLM generation remain limited: most existing benchmarks are English-centric, template-based, or restricted to recognizing pre-specified stereotypes. We introduce StereoTales, a multilingual dataset and evaluation pipeline for systematically studying the emergence of social bias in open-ended LLM generation. The dataset covers 10 languages and 79 so
Ambig-DS (Full (self-reported)): leader Gemini 3.1 Pro Preview (64.0), 5 models
As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Amb
CalBench (Excess (self-reported)): leader GPT-5.4 Mini (149.0), 7 models
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each ta
CodeClinic (Overall (self-reported)): leader Claude Sonnet 4.6 (53.1), 8 models
Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substan
SeePhys Pro (Cons4 (self-reported)): leader Human Performance (49.0), 16 models
We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual
TraceEval (Average F1 (self-reported)): leader Claude Opus 4.6 (72.9), 10 models
Evaluating whether large language models (LLMs) can recover execution-relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test-passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE-Bench); this is useful, but offers limited diagnostic signal about which progr
Beyond the All-in-One Agent (avg. (self-reported)): leader DeepSeek V4 Pro (62.0), 12 models
Large language model (LLM) agents are increasingly expected to operate in enterprise environments, where work is distributed across specialized roles, permission-controlled systems, and cross-departmental procedures. However, existing enterprise benchmarks largely evaluate single agents with broad tool access, while existing multi-agent benchmarks rarely capture realistic enterprise constraints su
DiagnosticIQ (Macro % D.IQ (self-reported)): leader Claude Opus 4.6 (73.59), 33 models
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule
DocScope (ACC All (self-reported)): leader Gemini 3.1 Pro Preview (78.9), 23 models
Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, support
Done, But Not Sure (B All (self-reported)): leader Gemini 3.1 Pro Preview (56.4), 20 models
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framewor
FORTIS (Task 1 EM (self-reported)): leader Claude Opus 4.7 (54.8), 10 models
Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present \\textbf{FORTIS}, a benchmark that evaluates over-privilege in agent skills across two stages:
ProactBench (Overall Pass Rate (self-reported)): leader GPT-5.5 (61.5), 16 models
Most LLM benchmarks score how well a model responds to explicit requests. They leave unmeasured a different conversational ability: noticing and acting on needs the user has implied but not said. We call this \\emph{conversational proactivity}. ProactBench decomposes it into three phase-tied types: \\textsc{Emergent}, inference from a single disclosed anchor; \\textsc{Critical}, synthesis across m
EnvSimBench (Cm Overall (self-reported)): leader Ours (Full-Balance2, 4B) (45.3), 8 models
Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: L
From 0-Order Selection to 2-Order Judgment (H-Comb (self-reported)): leader Human (79.5), 13 models
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that dete
InterLV-Search (Level 3 +Tool Avg (self-reported)): leader Gemini 3.1 Pro Preview (46.46), 8 models
Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \\textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condi
MathConstraint (Accuracy (self-reported)): leader GPT-5.5 (66.9), 12 models
We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed dataset
NARRA-Gym for Evaluating Interactive Narrative (StoryQ (self-reported)): leader Claude Sonnet 4.6 (3.9), 9 models
Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character sim
TeamBench (Solo (self-reported)): leader Claude Opus 4.7 (35.6), 13 models
Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coor
VeriContest (End-to-End (end2end) (self-reported)): leader GPT-5.5 (5.29), 10 models
Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code, but also formal specifications and machine-checkable proofs. Progress in this direction, however, is difficult to measure: existing benchmarks are often small, fo
An Empirical Study of Proactive Coding Assista (Pass@1 (self-reported)): leader Claude Sonnet 4.6 (13.57), 7 models
Large language model (LLM)-based coding assistants have made substantial progress, yet most systems remain reactive, requiring developers to explicitly formulate their needs. Proactive coding assistants aim to infer latent developer intent from integrated development environment (IDE) interactions and repository context, thereby reducing interaction overhead and supporting more seamless assistance
Artificial Intelligence Quotient (AIQ) Benchmark (Accuracy (self-reported)): leader Gemini 3.1 Pro Preview (99.46), 6 models
The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted
Cited but Not Verified (Relevant Content (self-reported)): leader Claude Opus 4.5 (95.7), 14 models
Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce
CVerifBench (Total (self-reported)): leader Claude Opus 4.7 (98.3), 14 models
We introduce an evaluation framework of 500 C verification tasks across five property types (memory safety, overflow, termination, reachability, data races) built on SV-COMP 2025, and evaluate 14 models across six families. We find that high overall accuracy masks a critical weakness: while most models reliably confirm properties hold, violation detection varies widely and degrades sharply with pr
IntentGrasp (Overall (All Set) Avg (Std) (self-reported)): leader Gemini 3.1 Pro Preview (59.68), 20 models
Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source
PinTrace (Ï_U(%) (self-reported)): leader MoonshotAI: Kimi K2.5 (45.78), 10 models
Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk
SmellBench (Weighted Effectiveness (E) (self-reported)): leader GPT-5.3-Codex (47.8), 9 models
Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first
STALE (Overall (self-reported)): leader CUPMem (Ours) (68.0), 15 models
Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negati
XL-SafetyBench (Overall ASR (self-reported)): leader Mistral: Mistral Large 3 2512 (98.8), 10 models
Country-grounded cross-cultural safety benchmark with 5,500 test cases across 10 country-language pairs, separating universal jailbreak robustness from culturally embedded sensitivities.
AA-LCR (Score (self-reported)): leader GPT-5.2-Codex (75.7), 331 models
AA-LCR evaluates model capability on long context tasks from the linked upstream source with Score as the primary reported metric.
Creative Writing v3 (Elo score (self-reported)): leader Claude Opus 4.7 (2215.9), 102 models
Creative Writing v3 evaluates model capability on writing tasks from the linked upstream source with Elo score as the primary reported metric.
GeneBench (Mean pass rate (self-reported)): leader GPT-5.5 Pro (33.2), 16 models
Genetics and quantitative-biology benchmark where models analyze noisy scientific data, detect confounders, and implement statistical workflows with minimal guidance.
Harvey Legal Agent Benchmark (All-Pass Task Success (self-reported)): leader Claude Mythos 5 (16.91), 11 models
Legal-agent benchmark for completing realistic legal workflows with all-pass grading, including held-out Harvey tasks and public legal-agent task sets.
HMMT 2025 (Score (self-reported)): leader GPT-5.2 OpenAI (100.0), 61 models
MathArena evaluation based on Harvard-MIT Mathematics Tournament 2025 problems, emphasizing olympiad-style high-school contest reasoning.
MathArena Apex (Score (self-reported)): leader GPT-5.5 OpenAI (80.21), 47 models
MathArena Apex is a challenging math contest benchmark featuring the most difficult mathematical problems designed to test advanced reasoning and problem-solving abilities of AI models. It focuses on olympiad-level mathematics and complex multi-step mathematical reasoning.
OmniDocBench 1.5 (Overall (self-reported)): leader PaddleOCR-VL-1.5 (94.5), 50 models
Document-understanding benchmark covering OCR, layout parsing, tables, formulas, and information extraction across diverse document types.
CC-OCR V2 (Average (self-reported)): leader Qwen3.6 Plus (75.77), 15 models
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this
MCJudgeBench (CJAR (self-reported)): leader Gemini 3.1 Pro Preview (85.8), 7 models
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint
AcademiClaw (Pass Rate (self-reported)): leader Claude Opus 4.6 (55.0), 7 models
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that the
DataClawBench (Overall Acc. (self-reported)): leader Claude Opus 4.6 (63.4), 8 models
Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However, existing benchmarks typically evaluate such agents in prior-guided settings, providing selected data sources, explicit data schemas, or cleaned data, thereby understating the exploratory burden. To evaluate this realistic exploratory data analysis task, we intr
MolViBench (Pass@1 rate (self-reported)): leader Claude Opus 4.6 Think (IR) (39.7), 13 models
Molecular Vibe Coding, a paradigm where chemists interact with LLMs to generate executable programs for molecular tasks, has emerged as a flexible alternative to chemical agents with predefined tools, enabling chemists to express arbitrarily complex, customized workflows. Unlike general coding tasks, molecular coding imposes a distinctive challenge that LLMs should jointly equip programming, molec
PhysicianBench (Pass@1 (self-reported)): leader GPT-5.5 OpenAI (46.3), 12 models
Long-horizon physician workflow benchmark grounded in clinical records, measuring checkpoint and end-to-end task success.
The Compliance Trap (Î (degradation) (self-reported)): leader Qwen3-80B Thinking (13.6), 10 models
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse.
TSCG (20 Tools (json-text) (self-reported)): leader Qwen3 14B (90.2), 13 models
Production agent frameworks (OpenAI Function Calling, Anthropic Tool Use, MCP) transmit tool schemas as JSON, a format designed for machine parsing, not for interpretation by language models. For small models (4B-14B), this protocol mismatch accounts for the majority of tool-use failure at production catalog sizes. We present TSCG, a deterministic tool-schema compiler that resolves this mismatch a
HealthBench Professional (Score (self-reported)): leader Claude Mythos 5 (66.0), 14 models
HealthBench professional subset for medically challenging, expert-oriented healthcare question answering.
BioMysteryBench Human-Difficult (Accuracy (self-reported)): leader Claude Mythos 5 (46.1), 7 models
Anthropic BioMysteryBench slice covering 23 real-world bioinformatics tasks no human benchmarker solved after QC, evaluated by average accuracy across five trials per problem.
BioMysteryBench Human-Solvable (Accuracy (self-reported)): leader Claude Mythos 5 (83.9), 7 models
Anthropic BioMysteryBench slice covering 76 real-world bioinformatics tasks solved by at least one human benchmarker, evaluated by average accuracy across five trials per problem.
OccuBench (Completion rate (self-reported)): leader Gemini 3.1 Pro Preview (45.3), 15 models
Professional-task benchmark using simulated domain tool environments to evaluate LLM agents across occupation-specific workflows.
EnterpriseArena (Full Survival % (self-reported)): leader Human (60.0), 26 models
EnterpriseArena evaluates LLM agents as CFO-style decision makers in a 132-month FinTech lending simulator. Agents manage liquidity, close books, buy costly signals, and choose equity or debt financing under partial observability, hard resource budgets, delayed consequences, and changing macroeconomic regimes. (Source: benchmarklist.com, self-reported.)
OrgForge-IT (Verdict F1 (self-reported)): leader Claude Opus 4.6 (100.0), 10 models
Synthetic insider-threat detection benchmark built from OrgForge organizational simulation telemetry with triage, verdict, and false-positive scoring.
EnterpriseOps-Gym (Task Success Rate (self-reported)): leader Claude Opus 4.6 (44.6), 22 models
Stateful enterprise operations benchmark for LLM agents performing long-horizon planning, tool use, and policy-governed workflows.
Vibe Code Bench v1.1 (Score (self-reported)): leader Claude Fable 5 maxAnthropic (90.35), 55 models
Vals AI benchmark for vibe-coding agents that build complete applications from product-style prompts and are scored on functional correctness and quality.
Arena AI Document (Arena ELO (self-reported)): leader Claude Opus 4.6 (1526.0), 19 models
Crowdsourced Arena AI pairwise human-preference leaderboard for PDF and document-understanding models.
LABBench2 Clinical Trials (Score (self-reported)): leader Claude Mythos 5 (91.2), 4 models
LABBench2 clinical-trials subset reported in Anthropic's Claude Opus 4.8 system card.
LABBench2 Patent Questions (Score (self-reported)): leader Claude Mythos 5 (79.8), 4 models
LABBench2 patent-question subset reported in Anthropic's Claude Opus 4.8 system card.
DeepSearchQA (Score (self-reported)): leader Claude Mythos Preview (94.4), 5 models
Deep-search question-answering benchmark for agents that must gather, compare, and synthesize evidence across multi-hop web research tasks.
APEX-Agents (Mean Score (ReAct) (self-reported)): leader Gemini 3.5 Flash (66.1), 40 models
The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.
APEX-Agents-AA (Pass@1 (self-reported)): leader Gemini 3.5 Flash (47.1), 24 models
Artificial Analysis implementation of APEX-Agents using the Stirrup agent harness for long-horizon, cross-application professional-services tasks.
EnigmaEval (Score (self-reported)): leader GPT-5.4 Pro (23.82), 39 models
EnigmaEval is a benchmark from puzzle hunts, testing AI with complex reasoning, creative problem-solving, and cross-domain knowledge synthesis.
MCP Atlas (Score (self-reported)): leader Gemini 3.5 Flash (83.6), 32 models
Evaluating real-world tool use through the Model Context Protocol (MCP).
MultiNRC (Score (self-reported)): leader GPT-5 Pro (65.2), 39 models
MultiNRC benchmarks LLMs on 1,000+ culturally grounded reasoning questions by native French, Spanish, and Chinese speakers across four reasoning categor...
PRBench Finance (Score (self-reported)): leader Claude Opus 4.6 (53.28), 28 models
Professional Reasoning Bench Finance evaluates frontier LLMs on complex financial reasoning tasks including analysis, modeling, and decision-making.
Professional Reasoning Bench (Score (self-reported)): leader Muse Spark (52.29), 28 models
Professional Reasoning Bench Legal evaluates frontier LLMs on complex legal reasoning tasks drawn from real-world legal practice and case analysis. (Source: benchmarklist.com, self-reported.)
SWE Atlas (Score (self-reported)): leader NexAU (45.4), 16 models
SWE Atlas Codebase QnA evaluates LLMs on deep code comprehension and question answering across real-world software repositories. (Source: benchmarklist.com, self-reported.)
TutorBench (Score (self-reported)): leader Muse Spark (68.55), 23 models
TutorBench evaluates how well LLMs perform common tutoring tasks for high school and AP-level subjects.
Visual-Language Understanding (Score (self-reported)): leader Gemini 2.5 Pro Experimental (March 2025) (54.65), 54 models
Scale's SEAL Leaderboard evaluates top models' visual-language understanding, testing perception, logic, calculation, and common sense.
VTB (Score (self-reported)): leader GPT-5.4 high (29.17), 17 models
Evaluating how LLMs can dynamically interact with and reason about visual information.
NL2Repo (Score (self-reported)): leader Claude Opus 4.8 (69.7), 12 models
NL2Repo evaluates long-horizon coding capabilities including repository-level understanding, where models must generate or modify code across entire repositories from natural language specifications.
Arena AI Code (Arena ELO (self-reported)): leader Claude Opus 4.7 (1570.0), 64 models
Crowdsourced Arena AI pairwise human-preference leaderboard for code generation and coding-assistant models.
LMArena WebDev Arena (Arena rating (self-reported)): leader Claude Opus 4.7 (1567.85), 21 models
LMArena's WebDev Arena leaderboard for model performance on interactive web development tasks judged by human preference.
IMO-AnswerBench (Score (self-reported)): leader DeepSeek V4 Flash (91.1), 14 models
International Mathematical Olympiad answer benchmark evaluating final-answer correctness on high-difficulty olympiad-style mathematical problems.
Toolathlon (Score (self-reported)): leader Claude Fable 5 (61.7), 18 models
Tool-use benchmark spanning many tool categories, testing whether agents can select, sequence, and combine tools to complete realistic tasks.
CritPt (Accuracy (self-reported)): leader GPT-5.5 Pro (30.6), 328 models
Research-level physics reasoning benchmark with composite challenges designed by active physics researchers.
GDPval (Wins/Ties vs Human (self-reported)): leader GPT-5.5 (84.9), 18 models
Real-world, economically valuable knowledge work tasks across 44 occupations and 9 U.S. GDP sectors.
OSWorld-Verified (Score (self-reported)): leader Claude Mythos Preview max (85.4), 20 models
OSWorld-Verified evaluates model capability on agentic tasks from the linked upstream source with Score as the primary reported metric.
CyberGym (Score (self-reported)): leader Claude Mythos 5 (83.8), 8 models
Cybersecurity agent benchmark for discovering, exploiting, and reasoning about vulnerabilities in controlled challenge environments.
LiveSQLBench (Success Rate (self-reported)): leader Gemini 3.1 Pro Preview (43.1), 33 models
Dynamic contamination-free text-to-SQL benchmark for real-world database tasks, including business-intelligence queries, CRUD/management SQL, hierarchical knowledge bases, and large industrial-scale database variants.
HealthBench Hard (Overall score (self-reported)): leader gpt-oss-120b (60.0), 55 models
Hard subset of HealthBench, evaluating difficult clinical and biomedical advice with physician-written rubrics and stricter scoring.
ChartQAPro (With tools score (self-reported)): leader Claude Mythos Preview (73.6), 4 models
Harder chart-understanding evaluation for professional and technical visual-question-answering tasks.
ScreenSpot-Pro (Grounding score (self-reported)): leader Claude Mythos Preview (93.0), 34 models
Professional GUI grounding benchmark requiring agents to identify precise screen locations in high-resolution development, creative, and scientific software.
ITBench-AA (Average Precision at Full Recall (self-reported)): leader Claude Opus 4.7 max (46.7), 23 models
Artificial Analysis implementation of IBM\
MedXpertQA (Score (self-reported)): leader Gemini 3.1 Pro Preview (80.7), 19 models
MedXpertQA evaluates model capability on healthcare & medical tasks from the linked upstream source with Score as the primary reported metric.
MultiChallenge (Score (self-reported)): leader Muse Spark (75.52), 34 models
MultiChallenge evaluates frontier LLMs on realistic multi-turn conversations, assessing instruction retention, inference memory, and self-coherence.
OCRBench v2 (Average (self-reported)): leader KDL Frontierð¥ (68.1), 29 models
OCRBench v2 evaluates large multimodal models on bilingual visual text localization and reasoning tasks.
OCRBench-V2 (en) (Score (self-reported)): leader Qwen3.7 Plus (70.7), 26 models
OCRBench v2 English subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with English text content.
OCRBench-V2 (zh) (Score (self-reported)): leader Qwen3.7 Plus (67.1), 26 models
OCRBench v2 Chinese subset: Enhanced benchmark for evaluating Large Multimodal Models on visual text localization and reasoning with Chinese text content.
MMMU Pro (Score (self-reported)): leader Claude Fable 5 maxAnthropic (89.31), 68 models
MMMU Pro evaluates model capability on intelligence & reasoning tasks from the linked upstream source with Score as the primary reported metric.
FigQA (Score (self-reported)): leader Claude Mythos 5 (90.7), 4 models
FigQA evaluates model capability on multimodal tasks from the linked upstream source with Score as the primary reported metric.
LiveBench (LiveBench average (self-reported)): leader GPT-5.5 x-high (81.28), 43 models
Continuously updated benchmark measuring many capabilities. Sort models by weighted score or sub-task scores. Tests math, coding, reasoning, language, instruction following, and data analysis.
CharXiv-R (Score (self-reported)): leader Claude Mythos 5 (93.5), 12 models
CharXiv-R evaluates model capability on multimodal tasks from the linked upstream source with Score as the primary reported metric.
LVBench (Score (self-reported)): leader GPT-5.4 (77.4), 27 models
LVBench evaluates model capability on multimodal tasks from the linked upstream source with Score as the primary reported metric.
VideoMME w sub. (Score (self-reported)): leader GPT-5.4 (89.5), 56 models
VideoMME w sub. evaluates model capability on multimodal tasks from the linked upstream source with Score as the primary reported metric.
LegalBench (Score (self-reported)): leader Claude Fable 5 maxAnthropic (88.56), 107 models
Evaluating language models on a wide range of open source legal reasoning tasks.
LMArena Text Arena (Arena rating (self-reported)): leader Claude Opus 4.6 (1500.24), 19 models
Crowdsourced pairwise human-preference leaderboard for text chat models in LMArena, formerly LMSYS Chatbot Arena.
100Q-Hard Net Score (Net score (self-reported)): leader Claude Mythos 5 (42.0), 7 models
Closed-book factuality benchmark reported by Anthropic as net score: correct responses minus incorrect responses, with abstentions scoring zero.
AA-Omniscience Net Score (Net score (self-reported)): leader Claude Mythos 5 (53.0), 7 models
AA-Omniscience factuality results reported by Anthropic as net score: correct responses minus incorrect responses, with abstentions scoring zero.
AIIQ Composite IQ (Composite IQ (self-reported)): leader GPT-5.5 (136.0), 47 models
AIIQ composite estimate that combines abstract, mathematical, programmatic, and academic reasoning benchmark evidence into IQ-like model scores.
ArxivMath (Score (self-reported)): leader Claude Fable 5 (78.6), 10 models
MathArena ArxivMath final-answer research-math benchmark slice from the March and April 2026 releases, as reported in Anthropic's Claude Opus 4.8 system card.
AutoLab (Overall Score (self-reported)): leader Claude Opus 4.6 (68.0), 11 models
AutoLab evaluates AI agents on iterative performance-engineering tasks across model development, puzzle/challenge tasks, and system optimization.
AutoMedBench (Average Overall Score (self-reported)): leader Claude Opus 4.6 (69.69), 7 models
AutoMedBench evaluates model capability on healthcare & medical tasks from the linked upstream source with Average Overall Score as the primary reported metric.
BioPipelineBench Verified (Accuracy (self-reported)): leader Claude Mythos Preview (88.1), 4 models
Verified BioPipelineBench slice for bioinformatics pipeline tasks, reported in Anthropic's Claude Opus 4.8 system card.
BLXBench (Score (self-reported)): leader Grok 4.3 (85.5), 25 models
Community benchmark runner and public leaderboard for AI model performance across coding, debugging, reasoning, hallucination, refactoring, security, and speed slices.
CAIS Risk Index (Risk Index (self-reported)): leader Claude Opus 4.7 (32.9), 37 models
Composite CAIS AI Dashboard risk index averaging VCT refusal risk, HLE miscalibration, MASK risk, Machiavelli, and TextQuests Harm for models with all component scores. Lower is better.
CAIS Text Capabilities Index (Text Capabilities Index (self-reported)): leader GPT-5.5 (54.1), 39 models
Composite CAIS AI Dashboard text index averaging Humanity's Last Exam, ARC-AGI-2, TextQuests, and SWE-bench Pro for models with all component scores.
CAIS Vision Capabilities Index (Vision Capabilities Index (self-reported)): leader Gemini 3.5 Flash (65.7), 28 models
Composite CAIS AI Dashboard vision index averaging EnigmaEval, IntPhys2, ERQA, MindCube, ART, and SpatialViz for models with all component scores.
CaseLaw v2 (Score (self-reported)): leader Grok 4.3 xAI (79.31), 53 models
Private question-answer benchmark over Canadian court-cases.
Claw-Eval-Live (Pass Rate (self-reported)): leader Claude Opus 4.6 (66.7), 13 models
Quarterly refreshed enterprise-workflow benchmark grounded in live ClawHub marketplace signals and scored with deterministic checks plus structured judging.
ClawProBench (Final Score (self-reported)): leader gpt-5.5-xhigh x-highopenai (67.9), 57 models
OpenClaw agent benchmark measuring model performance on reasoning, planning, tool use, reliability, efficiency, and safety across repeated runs.
CorpFin v2 (Score (self-reported)): leader Claude Fable 5 maxAnthropic (71.83), 101 models
A private benchmark evaluating understanding of long-context credit agreements.
CTFBench (Vulnerability Detection Rate (self-reported)): leader SavantChat Dec 2025 (100.0), 27 models
CTFBench: Measures model robustness, truthfulness, calibration, bias, harmfulness, jailbreak resistance, or alignment-relevant behavior.
DuelLab Overall (Avg score (self-reported)): leader Claude Opus 4.7 Anthropic (74.4), 51 models
DuelLab evaluates model-generated game-playing programs by compiling submitted code and running head-to-head tournaments on hidden abstract strategy games.
EQ-Bench (Normalized Elo (self-reported)): leader Claude Fable 5 (2069.4), 77 models
Emotional intelligence benchmark testing how well models understand and process complex emotional scenarios and nuanced human interactions.
Finance Agent v1.1 (Score (self-reported)): leader Claude Opus 4.7 (64.4), 56 models
Finance Agent v1.1 evaluates model capability on finance tasks from the linked upstream source with Score as the primary reported metric.
Finance Agent v2 (Score (self-reported)): leader Gemini 3.5 Flash (57.9), 33 models
Evaluating agents on core financial analyst tasks using the FAB v2 harness.
Graphwalks BFS 1M F1 (F1 (self-reported)): leader Claude Mythos 5 (79.4), 7 models
Graphwalks breadth-first-search long-context reasoning task reported at 1M context with F1 scoring.
Graphwalks BFS 256k F1 (F1 (self-reported)): leader Claude Mythos 5 (91.1), 7 models
Graphwalks breadth-first-search long-context reasoning task reported at 256k context with F1 scoring.
Graphwalks Parents 1M F1 (F1 (self-reported)): leader Claude Mythos 5 (97.5), 7 models
Graphwalks parent-node long-context reasoning task reported at 1M context with F1 scoring.
Graphwalks Parents 256k F1 (F1 (self-reported)): leader Claude Mythos 5 (99.96), 7 models
Graphwalks parent-node long-context reasoning task reported at 256k context with F1 scoring.
HMMT February 2026 (Score (self-reported)): leader Qwen3.7 Max max (97.1), 24 models
Official Hugging Face benchmark for model performance on the February 2026 Harvard-MIT Mathematics Tournament problem set.
INCLUDE (Score (self-reported)): leader Gemini 3.1 Pro Preview (90.7), 15 models
INCLUDE multilingual evaluation reported in Anthropic's Claude Opus 4.8 system card.
Lech Mazur Writing (Comparison Score (self-reported)): leader GPT-5.5 x-highx-high (3.4), 32 models
Lech Mazur Writing evaluates model capability on writing tasks from the linked upstream source with Comparison Score as the primary reported metric.
MedCode (Score (self-reported)): leader Gemini 3.1 Pro Preview highGoogle (59.06), 54 models
MedCode evaluates model capability on healthcare & medical tasks from the linked upstream source with Score as the primary reported metric.
MedScribe (Score (self-reported)): leader Claude Fable 5 maxAnthropic (88.52), 53 models
Can models support doctors with their administrative work?.
MILU (Score (self-reported)): leader Gemini 3.1 Pro Preview (93.6), 8 models
Multilingual knowledge-and-reasoning evaluation reported in Anthropic's Claude Opus 4.8 system card.
MortgageTax (Score (self-reported)): leader Claude Opus 4.7 maxAnthropic (70.27), 72 models
Evaluating reading and understanding tax certificates as images.
Multilingual Factual Questions Net Score (Net score (self-reported)): leader Claude Mythos Preview (48.0), 7 models
Closed-book multilingual factuality benchmark reported by Anthropic as net score: correct responses minus incorrect responses, with abstentions scoring zero.
OpenClaw Arena Model Leaderboard (Avg Score (self-reported)): leader claude-opus-4.5 (67.4), 13 models
Personal AI agent benchmark evaluating frontier models across real-world OpenClaw-style tasks.
PlaceboBench (Non-Hallucination Rate (self-reported)): leader Gemini 3 (73.91), 7 models
Medical-domain hallucination benchmark with labeled model answers to pharmaceutical questions grounded in EMA product information.
ProofBench (Score (self-reported)): leader Claude Fable 5 maxAnthropic (77.0), 37 models
ProofBench evaluates model capability on math tasks from the linked upstream source with Score as the primary reported metric.
ProteinGym Hard (Score (self-reported)): leader Claude Mythos 5 (45.0), 6 models
Hard ProteinGym subset reported in Anthropic's Claude Opus 4.8 system card.
RealWorldQA (RealWorldQA (self-reported)): leader Qwen3.7 Plus (86.9), 12 models
RealWorldQA: Evaluates multimodal understanding across image, text, chart, diagram, or cross-modal reasoning tasks.
Rogo Big Finance Bench (Rubric Score (self-reported)): leader Claude Opus 4.7 (59.0), 10 models
Vendor-reported 928-question finance-agent benchmark spanning vertical-specific skills, metrics, financial-statement analysis, and forecasting workflows.
scBench (Accuracy (self-reported)): leader Claude Mythos 5 (59.3), 21 models
Bioinformatics agent benchmark with verifiable single-cell RNA-seq workflow tasks and deterministic graders.
TaxBench (Mean pass^5 (computed) (self-reported)): leader GPT-5.5 Pro (29.27), 16 models
TaxBench evaluates AI models on real-world tax tasks from Rivet's active tax workflows, spanning tax knowledge and judgment, tax calculations, and agentic data-retrieval question answering.
TaxEval v2 (Score (self-reported)): leader Muse Spark Meta (77.68), 109 models
A Vals-created set of questions and responses to tax questions.
Vals Index (Score (self-reported)): leader Claude Fable 5 maxAnthropic (75.14), 25 models
Benchmark consisting of a weighted performance across finance and coding tasks. Showing the potential impact that LLM's can have on the economy.
Vals Multimodal Index (Score (self-reported)): leader Claude Fable 5 maxAnthropic (74.15), 20 models
Benchmark consisting of a weighted performance across finance, coding, and education tasks. Showing the potential impact that LLM's can have on the economy.
WildClawBench (Overall Score (self-reported)): leader Claude Opus 4.7 (62.2), 20 models
WildClawBench evaluates model capability on agentic tasks from the linked upstream source with Overall Score as the primary reported metric.
KernelBench Hub - Mega (Best Speedup vs Reference (x)): leader Claude Opus 4.8 (19.4), 8 models
kernelbench.com Mega suite (independent of Stanford KernelBench) — agentic GPU megakernel building; best speedup over a reference megakernel across RTX PRO 6000, H100, and B200 runs.
KernelBench Hub - Hard (Best % of Hardware Roofline): leader GLM-5.2 (26.0), 8 models
kernelbench.com Hard suite (independent of Stanford KernelBench) — agentic CUDA/Triton kernel optimization scored as percent of hardware roofline achieved, best across RTX PRO 6000, H100, and B200 runs.
Benchmarks.bio - BioSecBench-Refusal (Pass Rate (%)): leader Gemini 3.5 Flash (51.5), 10 models
BioSecBench-Refusal (benchmarks.bio) — biosecurity refusal and over-refusal benchmark; red-team prompts that should be refused plus routine biology work that should be answered, scored as combined pass rate.
Senior SWE-Bench (Tasteful Solve Rate (pass@1, %)): leader Claude Opus 4.8 (24.0), 10 models
Senior SWE-Bench (Snorkel) — 100 senior-engineer tasks from real production PRs (design-and-build, investigate-and-fix) with under-specified instructions; tasteful solve rate combines verifier passes, expert rubrics, and a code-quality taste judge.
TerminalWorld (Pass Rate (%, 200 verified tasks)): leader Claude Opus 4.7 (62.5), 8 models
We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from \
Business Utility Eval (Business Utility (0-1)): leader Claude Opus 4.8 (0.42), 6 models
deepsense.ai's benchmark for how well LLMs perform realistic analytical business workflows — multi-step tasks over spreadsheets, reports and data where the model must reason to a business-useful answer rather than a single fact. Scored as a Business Utility rate; frontier models still score low (top model ~0.42), making it a hard, discriminating agentic-reasoning benchmark.
Vals AI Excel Modeling (Accuracy (%)): leader claude-opus-4-8 (69.37), 17 models
Vals AI Excel Modeling Benchmark — financial spreadsheet modeling (LBO, DCF, M&A, comps) built from templates and from scratch; accuracy of produced Excel models.
Vals AI CyberBench (Accuracy (%)): leader gpt-5.5 (80.51), 13 models
Vals AI CyberBench — offensive/defensive cybersecurity tasks run agentically; overall accuracy on private cyber task set.
Vals AI Harvey Legal Agent Bench (Accuracy (%)): leader claude-fable-5 (11.25), 15 models
Harvey's Legal Agent Benchmark (Vals AI) — agentic legal work across documents, spreadsheets, presentations, and file-system tools spanning practice areas (M&A, antitrust, capital markets, tax); final score percentage.
Vals AI Legal Research Bench (Accuracy (%)): leader claude-opus-4-8 (43.75), 14 models
Vals AI Legal Research Bench — legal research questions requiring case-law and statute lookup with citations; accuracy on private legal research task set.
Vals AI ProgramBench (Raw Pass Rate (%)): leader claude-fable-5 (76.8), 25 models
ProgramBench (Vals AI run) — rebuild complete programs from binaries and documentation; raw pass rate is the average percent of hidden behavioral tests passed per task.
Vals AI SkillsBench (Accuracy (%)): leader gpt-5.5-codex (62.55), 12 models
Vals AI SkillsBench — evaluates how well models use skill files (procedural instructions plus scripts) to complete office/document tasks; accuracy per task set.
Vals AI Code Migration (Accuracy (%)): leader claude-fable-5 (55.06), 24 models
Vals AI Code Migration — large-scale code migration tasks (framework/language version upgrades) run agentically; accuracy on private migration task set.
Foresight Bench (Forecast Skill (100 - Brier x 100)): leader claude-opus-4.8 (91.61), 13 models
Foresight Bench (Aleatoric AI) — LLM forecasting on real-world open questions; models keep updating predictions as events unfold, scored by mean Brier on resolved questions (reported as 100 − Brier×100).

New Models (1)

Claude Haiku 4.5 (20251001) — ELO 1956, #161
- ForecastBench: 64.5 (#64/223)

Top-10 New Scores (11)

Claude Fable 5 on HalluHard: 0.4327 (#3)
Claude Fable 5 on TrackingAI IQ Test: 93.75 (#2)
Claude Fable 5 on TrackingAI IQ Test (Offline): 81.25 (#2)
Claude Mythos Preview on ExploitBench v8-bench: 78.0 (#1)
Claude Opus 4.8 on SQL Capability - Dialect Conversion: 74.0 (#12)
Claude Opus 4.8 on SQL Capability - SQL Optimization: 67.1 (#4)
Claude Opus 4.8 on SQL Capability - SQL Understanding: 77.1 (#20)
Claude Opus 4.8 on SQL Capability Leaderboard: 72.73 (#12)
Claude Opus 4.8 on Vals AI SWE-bench Verified: 88.6 (#2)
Claude Opus 4.8 on Vals AI SWE-bench Verified: 88.6 (#2)
Qwen 3.7 Max on Vals AI SWE-bench Verified: 68.8 (#48)

New #1 Leaders (4)

VoxelBench: Claude Fable 5 (Max) (2233.0) beat GPT-5.5 Pro by 198.0
Vals AI SWE-bench Verified: Claude Fable 5 (95.0) beat GPT-5.5 by 12.4
TrackingAI IQ Test (Vision): Claude Opus 4 (Thinking) (87.5) beat GPT-5 Pro by 5.15
ForecastBench: Cassi-2026-05-10 (69.1) beat Gemini by 0.7

AI Benchmark Digest — 2026-07-02

2026-07-02T07:19:30.237573+00:00

Daily

New Benchmarks (1)

SWE-Together (Jury Score (%)): leader Claude Opus 4.8 (80.06), 7 models
Interactive coding-agent evaluation over 109 real multi-turn software tasks (bugfix, feature, refactor) drawn from 27 open-source repositories across TypeScript, Go, Python and Rust. Agents work through multi-turn user intents in a harness (OpenCode, mini-swe-agent); each attempt is scored by a rubric jury on goal completion, averaged across tasks and replicates.

Top-10 New Scores (20)

Claude Fable 5 on BridgeBench Debugging: 86.2 (#9)
Claude Fable 5 on BridgeBench Hallucination: 75.9 (#15)
Claude Fable 5 on BridgeBench Refactoring: 73.6 (#3)
Claude Fable 5 on BridgeBench Security: 51.1 (#17)
Claude Fable 5 on BridgeBench UI: 79.3 (#7)
Claude Fable 5 on HieroglyphBench: 22.9 (#4)
Claude Opus 4.8 on BridgeBench Debugging: 86.3 (#8)
Claude Opus 4.8 on BridgeBench Hallucination: 74.9 (#18)
Claude Opus 4.8 on BridgeBench Refactoring: 67.1 (#12)
Claude Opus 4.8 on SEAL - Remote Labor Index (RLI): 8.33 (#2)
GPT-5.4 Pro on Epoch AI - Critpt: 30.0 (#2)
GPT-5.5 on BridgeBench UI: 77.0 (#11)
GPT-5.5 on Epoch AI - Critpt: 1.43 (#36)
Gemini 3.1 Pro (Preview) on Epoch AI - Critpt: 17.71 (#11)
Qwen 3.7 Max on BridgeBench Debugging: 86.7 (#3)
Qwen 3.7 Max on BridgeBench Hallucination: 76.6 (#10)
Qwen 3.7 Max on BridgeBench Refactoring: 65.0 (#16)
Qwen 3.7 Max on BridgeBench Security: 85.3 (#3)
Qwen 3.7 Max on BridgeBench UI: 67.5 (#17)
Qwen 3.7 Max on Epoch AI - Critpt: 13.43 (#13)

New #1 Leaders (4)

Epoch AI - Critpt: GPT-5.5 Pro (xHigh) (30.57) beat GPT-5 (High) by 17.97
Epoch AI - Rli: Claude Fable 5 (16.1) beat Claude Opus 4.6 (Unknown) by 11.93
SEAL - Remote Labor Index (RLI): Fable-5 (16.1) beat Claude Opus 4.6 by 11.93
BridgeBench UI: Claude Opus 4.8 (84.5) beat Claude Sonnet 4.6 by 3.0

AI Benchmark Digest — 2026-07-01

2026-07-01T07:29:27.012103+00:00

Daily

New Benchmarks (25)

LLM2014 Logic 2026-07 (Median Score): leader GPT-5.5 (xhigh) (80.47), 42 models
LLM Stats (HealthBench Professional) (Score (%)): leader Claude Fable 5 (66.0), 5 models
Slovenian OCR Benchmark (Score (%)): leader Kimi-K2.6 (98.65), 20 models
Benchmark scoring optical character recognition quality on Slovenian-language document images, comparing LLMs against specialized OCR tools.
ARFBench (Accuracy (%)): leader Toto-1.0-QA-Experimental 32B (TSFM-VLM) (63.9), 16 models
Datadog's Agentic Reasoning & Function-calling benchmark, scoring LLMs on multi-step tool-use trajectories in realistic observability and operations workflows.
SciEval - Overall (Overall (%)): leader GPT-5 (65.59), 19 models
SciEval multimodal scientific-reasoning leaderboard — overall accuracy across exam-style science questions.
SciEval - Physics (Physics (%)): leader Qwen3-VL-235B-A22B (49.93), 19 models
SciEval multimodal scientific-reasoning leaderboard — physics-discipline accuracy.
SciEval - Chemistry (Chemistry (%)): leader Gemini-3-Pro (83.08), 19 models
SciEval multimodal scientific-reasoning leaderboard — chemistry-discipline accuracy.
SciEval - Life Sciences (Life Sciences (%)): leader GPT-o3 (61.57), 19 models
SciEval multimodal scientific-reasoning leaderboard — life-sciences-discipline accuracy.
SciEval - Materials Science (Materials Science (%)): leader GPT-o3 (93.85), 19 models
SciEval multimodal scientific-reasoning leaderboard — materials-science-discipline accuracy.
VANTAGE-Bench (Overall (%)): leader Cosmos3-Super (63.01), 10 models
VANTAGE-Bench evaluating vision-language models on spatial, temporal, and semantic video understanding (overall %).
SciEval - Earth Sciences (Earth Sciences (%)): leader Kimi-k2 (77.77), 19 models
SciEval multimodal scientific-reasoning leaderboard — earth-sciences-discipline accuracy.
SciEval - Astronomy (Astronomy (%)): leader Qwen3-Max (75.64), 19 models
SciEval multimodal scientific-reasoning leaderboard — astronomy-discipline accuracy.
VANTAGE-Bench - Spatial (Spatial (%)): leader Qwen3.5-27B (79.04), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — spatial accuracy.
VANTAGE-Bench - Temporal (Temporal (%)): leader Gemini 3.1 Pro (40.74), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — temporal accuracy.
VANTAGE-Bench - Semantic (Semantic (%)): leader Cosmos-Reason2-32B (71.94), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — semantic accuracy.
VANTAGE-Bench - Video QA (Video QA (%)): leader Gemini 3.1 Pro (71.88), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — video question-answering accuracy.
VANTAGE-Bench - Event Verification (Event Verification (%)): leader Cosmos-Reason2-32B (73.58), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — event-verification accuracy.
VANTAGE-Bench - Single Object Tracking (Single Object Tracking (%)): leader Gemini 3.1 Pro (64.35), 10 models
VANTAGE-Bench vision-language video-understanding leaderboard — single-object-tracking accuracy.
SRE Skills Bench (Average (%)): leader Gemini-3.1-pro (98.8), 23 models
SRE Skills Bench — average accuracy across cloud and site-reliability-engineering tasks (AWS services, Kubernetes, networking).
SRE Skills Bench - S3 (Accuracy (%)): leader Gemini-3.1-pro (100.0), 23 models
SRE Skills Bench — accuracy on AWS S3 storage tasks.
SRE Skills Bench - IAM (Accuracy (%)): leader Gemini-3.1-pro (100.0), 23 models
SRE Skills Bench — accuracy on AWS IAM identity-and-access tasks.
SRE Skills Bench - Kubernetes (Accuracy (%)): leader Gemini-3.1-pro (99.3), 23 models
SRE Skills Bench — accuracy on Kubernetes operations tasks.
SRE Skills Bench - GMCQ (Accuracy (%)): leader Gemini-3.1-pro (92.0), 23 models
SRE Skills Bench — accuracy on the general multiple-choice site-reliability questions.
SRE Skills Bench - VPC (Accuracy (%)): leader Gemini-3.1-pro (99.3), 23 models
SRE Skills Bench — accuracy on AWS VPC networking tasks.
SRE Skills Bench - Storage (Accuracy (%)): leader opus-4.7 (100.0), 23 models
SRE Skills Bench — accuracy on cloud storage tasks.

New Models (1)

Claude Sonnet 5 — ELO 2302, #24
- LLM Stats (GDPval-AA): 1618.0 (#3/34)
- LLM Stats (Legal Agent Benchmark): 5.8 (#3/12)
- LLM Stats (OSWorld-Verified): 81.2 (#3/18)
- LLM Stats (CharXiv-R): 88.3 (#4/43)
- LLM Stats (OfficeQA Pro): 59.4 (#4/6)
- LLM Stats (Toolathlon): 54.3 (#5/24)
- LLM Stats (BrowseComp): 84.7 (#7/52)
- Ramp SWE-Bench: 73.75 (#9/20)
- YC-Bench: 1163.5 (#9/28)
- RuneBench: 2058.0 (#15/26)

New #1 Leaders (3)

Design Arena (Video): gemini-omni-flash (1483.0) beat seedance-2.0-mini by 145.0
WorldScore - Motion Smoothness: Inspatio-World (82.71) beat WorldScape-0.1 by 1.2
WorldScore - Dynamic: Inspatio-World (74.1) beat EvoPhys-World by 0.32

AI Benchmark Digest — 2026-06-28

2026-06-28T07:23:16.408329+00:00

Daily

New #1 Leaders (1)

CADGenBench: GPT-5.5 (xHigh) (0.4573) beat Claude Fable 5 by 0.01

Weekly

New Benchmarks (190)

CoffeeBench (Mean net income ($)): leader GPT-5.5 (3109.19), 7 models
CoffeeBench evaluates long-horizon business agents in a simulated coffee supply chain, scoring average net income after 90 days of multi-agent trade.
Pangram AI-Text Detection Hard Set (zero-shot) (Accuracy (%)): leader GPT-5.5 (86.8), 2 models
Pangram AI-text detection hard-set evaluation measuring zero-shot classification of human-written versus AI-generated mirrored passages.
Pangram AI-Text Detection Hard Set (8-shot) (Accuracy (%)): leader GPT-5.5 (96.2), 2 models
Pangram AI-text detection hard-set evaluation measuring whether models classify human-written versus AI-generated mirrored passages after in-context examples.
LiveMedBench (Overall Score (%)): leader GPT-5.2 (39.23), 38 models
Live medical benchmark with time-stamped real-world cases and after-cutoff scoring for measuring medical model robustness over time.
Benchmarks.bio - scBench-Long (Pass Rate (%)): leader Claude Opus 4.8 (28.57), 12 models
Long-form Benchmarks.bio single-cell RNA-seq tasks requiring multi-step biological data analysis, tool use, and synthesis over larger assay contexts.
HieroglyphBench (Sign accuracy (%)): leader Gemini 3.5 Flash (52.5), 11 models
HieroglyphBench evaluates VLM OCR on ancient Egyptian hieroglyph columns, requiring ordered Gardiner sign-code transcription scored by edit-distance accuracy.
NatureBench (Surpass-SOTA (%)): leader Claude Opus 4.7 (17.78), 10 models
NatureBench evaluates coding agents on reproducing published Nature-family paper SOTA results across scientific machine-learning tasks.
EnterpriseClawBench (Claude Code) (Primary score (%)): leader Claude Sonnet 4.6 (64.41), 3 models
EnterpriseClawBench measures Claude Code harness performance on reproducible workplace-session tasks with recovered files, tools, deliverables, and semantic rubrics.
EnterpriseClawBench (Codex) (Primary score (%)): leader GPT-5.5 (66.32), 2 models
EnterpriseClawBench measures Codex harness performance on reproducible workplace-session tasks with recovered files, tools, deliverables, and semantic rubrics.
EnterpriseClawBench (DeepAgents) (Primary score (%)): leader Claude Sonnet 4.6 (63.18), 9 models
EnterpriseClawBench measures DeepAgents harness performance on reproducible workplace-session tasks with recovered files, tools, deliverables, and semantic rubrics.
EnterpriseClawBench (Hermes) (Primary score (%)): leader GPT-5.5 (61.96), 9 models
EnterpriseClawBench measures Hermes harness performance on reproducible workplace-session tasks with recovered files, tools, deliverables, and semantic rubrics.
EnterpriseClawBench (OpenClaw) (Primary score (%)): leader Claude Sonnet 4.6 (62.3), 9 models
EnterpriseClawBench measures OpenClaw harness performance on reproducible workplace-session tasks with recovered files, tools, deliverables, and semantic rubrics.
Evals for Every Language - Language fi (Average Score (%)): leader gemini-3.1-pro-preview (73.62), 71 models
LanguageBench score for BCP-47 language code fi, averaging available language-specific task scores for each model.
Evals for Every Language - Language fil (Average Score (%)): leader gemini-3.1-pro-preview (67.25), 71 models
LanguageBench score for BCP-47 language code fil, averaging available language-specific task scores for each model.
Evals for Every Language - Language fj (Average Score (%)): leader gemini-3.1-pro-preview (67.55), 71 models
LanguageBench score for BCP-47 language code fj, averaging available language-specific task scores for each model.
Evals for Every Language - Language fo (Average Score (%)): leader gemini-2.5-pro (63.87), 71 models
LanguageBench score for BCP-47 language code fo, averaging available language-specific task scores for each model.
Evals for Every Language - Language fr (Average Score (%)): leader gemini-3.1-pro-preview (78.52), 71 models
LanguageBench score for BCP-47 language code fr, averaging available language-specific task scores for each model.
Evals for Every Language - Language fuv (Average Score (%)): leader step-3.7-flash-20260528 (38.18), 71 models
LanguageBench score for BCP-47 language code fuv, averaging available language-specific task scores for each model.
Evals for Every Language - Language ga (Average Score (%)): leader claude-sonnet-4.5 (75.44), 71 models
LanguageBench score for BCP-47 language code ga, averaging available language-specific task scores for each model.
Evals for Every Language - Language gd (Average Score (%)): leader claude-opus-4.8 (72.85), 71 models
LanguageBench score for BCP-47 language code gd, averaging available language-specific task scores for each model.
Evals for Every Language - Language gl (Average Score (%)): leader gemini-3.1-pro-preview (73.21), 71 models
LanguageBench score for BCP-47 language code gl, averaging available language-specific task scores for each model.
Evals for Every Language - Language gn (Average Score (%)): leader gemini-3.1-pro-preview (67.45), 71 models
LanguageBench score for BCP-47 language code gn, averaging available language-specific task scores for each model.
Evals for Every Language - Language gom (Average Score (%)): leader gemini-3.1-pro-preview (67.02), 71 models
LanguageBench score for BCP-47 language code gom, averaging available language-specific task scores for each model.
Evals for Every Language - Language gu (Average Score (%)): leader gemini-3.1-pro-preview (71.19), 71 models
LanguageBench score for BCP-47 language code gu, averaging available language-specific task scores for each model.
Evals for Every Language - Language ha (Average Score (%)): leader gemini-3.1-pro-preview (70.78), 71 models
LanguageBench score for BCP-47 language code ha, averaging available language-specific task scores for each model.
Evals for Every Language - Language he (Average Score (%)): leader claude-sonnet-4.5 (77.0), 71 models
LanguageBench score for BCP-47 language code he, averaging available language-specific task scores for each model.
Evals for Every Language - Language hi (Average Score (%)): leader claude-opus-4.7 (77.18), 71 models
LanguageBench score for BCP-47 language code hi, averaging available language-specific task scores for each model.
Evals for Every Language - Language hr (Average Score (%)): leader gemini-3.1-pro-preview (74.28), 71 models
LanguageBench score for BCP-47 language code hr, averaging available language-specific task scores for each model.
Evals for Every Language - Language ht (Average Score (%)): leader gemini-3.1-pro-preview (74.63), 71 models
LanguageBench score for BCP-47 language code ht, averaging available language-specific task scores for each model.
Evals for Every Language - Language hu (Average Score (%)): leader claude-opus-4.8 (74.93), 71 models
LanguageBench score for BCP-47 language code hu, averaging available language-specific task scores for each model.
Evals for Every Language - Language hy (Average Score (%)): leader claude-sonnet-4.5 (70.28), 71 models
LanguageBench score for BCP-47 language code hy, averaging available language-specific task scores for each model.
Evals for Every Language - Language id (Average Score (%)): leader claude-sonnet-4.5 (74.98), 71 models
LanguageBench score for BCP-47 language code id, averaging available language-specific task scores for each model.
Evals for Every Language - Language ig (Average Score (%)): leader gemini-3.1-pro-preview (74.28), 71 models
LanguageBench score for BCP-47 language code ig, averaging available language-specific task scores for each model.
Evals for Every Language - Language ilo (Average Score (%)): leader step-3.7-flash-20260528 (95.0), 71 models
LanguageBench score for BCP-47 language code ilo, averaging available language-specific task scores for each model.
Evals for Every Language - Language is (Average Score (%)): leader gemini-3.1-flash-lite (73.86), 71 models
LanguageBench score for BCP-47 language code is, averaging available language-specific task scores for each model.
Evals for Every Language - Language it (Average Score (%)): leader claude-opus-4.7 (73.92), 71 models
LanguageBench score for BCP-47 language code it, averaging available language-specific task scores for each model.
Evals for Every Language - Language ja (Average Score (%)): leader gemini-3.1-pro-preview (71.18), 71 models
LanguageBench score for BCP-47 language code ja, averaging available language-specific task scores for each model.
Evals for Every Language - Language jv (Average Score (%)): leader gemini-3.1-pro-preview (75.74), 71 models
LanguageBench score for BCP-47 language code jv, averaging available language-specific task scores for each model.
Evals for Every Language - Language ka (Average Score (%)): leader step-3.7-flash-20260528 (71.13), 71 models
LanguageBench score for BCP-47 language code ka, averaging available language-specific task scores for each model.
Evals for Every Language - Language kk (Average Score (%)): leader gemini-3.1-pro-preview (72.01), 71 models
LanguageBench score for BCP-47 language code kk, averaging available language-specific task scores for each model.
Evals for Every Language - Language km (Average Score (%)): leader step-3.7-flash-20260528 (76.39), 71 models
LanguageBench score for BCP-47 language code km, averaging available language-specific task scores for each model.
Evals for Every Language - Language kn (Average Score (%)): leader gemini-3.1-pro-preview (72.55), 71 models
LanguageBench score for BCP-47 language code kn, averaging available language-specific task scores for each model.
Evals for Every Language - Language ko (Average Score (%)): leader gemini-3.1-pro-preview (69.88), 71 models
LanguageBench score for BCP-47 language code ko, averaging available language-specific task scores for each model.
Evals for Every Language - Language ku (Average Score (%)): leader gemini-3.1-pro-preview (70.01), 71 models
LanguageBench score for BCP-47 language code ku, averaging available language-specific task scores for each model.
Evals for Every Language - Language ky (Average Score (%)): leader gemini-3.1-pro-preview (70.04), 71 models
LanguageBench score for BCP-47 language code ky, averaging available language-specific task scores for each model.
Evals for Every Language - Language lb (Average Score (%)): leader claude-sonnet-4.5 (72.87), 71 models
LanguageBench score for BCP-47 language code lb, averaging available language-specific task scores for each model.
Evals for Every Language - Language lg (Average Score (%)): leader gemini-3.1-pro-preview (66.8), 71 models
LanguageBench score for BCP-47 language code lg, averaging available language-specific task scores for each model.
Evals for Every Language - Language li (Average Score (%)): leader gemini-3.1-pro-preview (73.08), 71 models
LanguageBench score for BCP-47 language code li, averaging available language-specific task scores for each model.
Evals for Every Language - Language lij (Average Score (%)): leader claude-sonnet-4.5 (72.19), 71 models
LanguageBench score for BCP-47 language code lij, averaging available language-specific task scores for each model.
Evals for Every Language - Language lmo (Average Score (%)): leader gemini-3.1-pro-preview (67.8), 71 models
LanguageBench score for BCP-47 language code lmo, averaging available language-specific task scores for each model.
Evals for Every Language - Language ln (Average Score (%)): leader gemini-3.1-pro-preview (66.34), 71 models
LanguageBench score for BCP-47 language code ln, averaging available language-specific task scores for each model.
Evals for Every Language - Language lo (Average Score (%)): leader gemini-3.1-pro-preview (73.65), 71 models
LanguageBench score for BCP-47 language code lo, averaging available language-specific task scores for each model.
Evals for Every Language - Language lt (Average Score (%)): leader gemini-3.1-pro-preview (74.4), 71 models
LanguageBench score for BCP-47 language code lt, averaging available language-specific task scores for each model.
Evals for Every Language - Language ltg (Average Score (%)): leader gemini-3.1-pro-preview (71.23), 71 models
LanguageBench score for BCP-47 language code ltg, averaging available language-specific task scores for each model.
Evals for Every Language - Language luo (Average Score (%)): leader gemini-3.1-pro-preview (64.57), 71 models
LanguageBench score for BCP-47 language code luo, averaging available language-specific task scores for each model.
Evals for Every Language - Language lv (Average Score (%)): leader claude-opus-4.7 (72.54), 71 models
LanguageBench score for BCP-47 language code lv, averaging available language-specific task scores for each model.
Evals for Every Language - Language mai (Average Score (%)): leader gemini-3.1-pro-preview (70.54), 71 models
LanguageBench score for BCP-47 language code mai, averaging available language-specific task scores for each model.
Evals for Every Language - Language mg (Average Score (%)): leader gemini-3.1-pro-preview (71.76), 71 models
LanguageBench score for BCP-47 language code mg, averaging available language-specific task scores for each model.
Evals for Every Language - Language mi (Average Score (%)): leader gemini-3.1-pro-preview (72.69), 71 models
LanguageBench score for BCP-47 language code mi, averaging available language-specific task scores for each model.
Evals for Every Language - Language min (Average Score (%)): leader gemini-3.1-pro-preview (71.94), 71 models
LanguageBench score for BCP-47 language code min, averaging available language-specific task scores for each model.
Evals for Every Language - Language mk (Average Score (%)): leader gemini-3.1-pro-preview (74.29), 71 models
LanguageBench score for BCP-47 language code mk, averaging available language-specific task scores for each model.
Evals for Every Language - Language ml (Average Score (%)): leader gemini-3.1-pro-preview (73.73), 71 models
LanguageBench score for BCP-47 language code ml, averaging available language-specific task scores for each model.
Evals for Every Language - Language mn (Average Score (%)): leader gemini-3.1-pro-preview (72.36), 71 models
LanguageBench score for BCP-47 language code mn, averaging available language-specific task scores for each model.
Evals for Every Language - Language mr (Average Score (%)): leader gemini-3.1-pro-preview (74.38), 71 models
LanguageBench score for BCP-47 language code mr, averaging available language-specific task scores for each model.
Evals for Every Language - Language ms (Average Score (%)): leader gemini-3.1-flash-lite (79.05), 71 models
LanguageBench score for BCP-47 language code ms, averaging available language-specific task scores for each model.
Evals for Every Language - Language mt (Average Score (%)): leader gemini-3.1-pro-preview (80.02), 71 models
LanguageBench score for BCP-47 language code mt, averaging available language-specific task scores for each model.
Evals for Every Language - Language my (Average Score (%)): leader gemini-3.1-pro-preview (71.09), 71 models
LanguageBench score for BCP-47 language code my, averaging available language-specific task scores for each model.
Evals for Every Language - Language nb (Average Score (%)): leader gemini-3.1-pro-preview (80.43), 71 models
LanguageBench score for BCP-47 language code nb, averaging available language-specific task scores for each model.
Evals for Every Language - Language ne (Average Score (%)): leader claude-opus-4.7 (71.29), 71 models
LanguageBench score for BCP-47 language code ne, averaging available language-specific task scores for each model.
Evals for Every Language - Language nl (Average Score (%)): leader gemini-3.1-flash-lite (75.03), 71 models
LanguageBench score for BCP-47 language code nl, averaging available language-specific task scores for each model.
Evals for Every Language - Language nn (Average Score (%)): leader gemini-3.1-pro-preview (64.37), 71 models
LanguageBench score for BCP-47 language code nn, averaging available language-specific task scores for each model.
Evals for Every Language - Language nso (Average Score (%)): leader gemini-3.1-pro-preview (71.26), 71 models
LanguageBench score for BCP-47 language code nso, averaging available language-specific task scores for each model.
Evals for Every Language - Language nus (Average Score (%)): leader claude-opus-4.7 (51.04), 71 models
LanguageBench score for BCP-47 language code nus, averaging available language-specific task scores for each model.
Evals for Every Language - Language ny (Average Score (%)): leader gemini-3.1-pro-preview (69.28), 71 models
LanguageBench score for BCP-47 language code ny, averaging available language-specific task scores for each model.
Evals for Every Language - Language oc (Average Score (%)): leader gemini-3.1-pro-preview (77.59), 71 models
LanguageBench score for BCP-47 language code oc, averaging available language-specific task scores for each model.
Evals for Every Language - Language om (Average Score (%)): leader gemini-3.1-pro-preview (68.18), 71 models
LanguageBench score for BCP-47 language code om, averaging available language-specific task scores for each model.
Evals for Every Language - Language or (Average Score (%)): leader claude-sonnet-4.5 (71.09), 71 models
LanguageBench score for BCP-47 language code or, averaging available language-specific task scores for each model.
Evals for Every Language - Language pa (Average Score (%)): leader gemini-3.1-pro-preview (77.29), 71 models
LanguageBench score for BCP-47 language code pa, averaging available language-specific task scores for each model.
Evals for Every Language - Language pag (Average Score (%)): leader gemini-3.1-pro-preview (72.21), 71 models
LanguageBench score for BCP-47 language code pag, averaging available language-specific task scores for each model.
Evals for Every Language - Language pap (Average Score (%)): leader gemini-3.1-flash-lite (75.13), 71 models
LanguageBench score for BCP-47 language code pap, averaging available language-specific task scores for each model.
Evals for Every Language - Language pl (Average Score (%)): leader gemini-3.1-pro-preview (71.87), 71 models
LanguageBench score for BCP-47 language code pl, averaging available language-specific task scores for each model.
Evals for Every Language - Language pt (Average Score (%)): leader gemini-3.1-flash-lite (77.5), 71 models
LanguageBench score for BCP-47 language code pt, averaging available language-specific task scores for each model.
Evals for Every Language - Language rn (Average Score (%)): leader gemini-3.1-pro-preview (64.86), 71 models
LanguageBench score for BCP-47 language code rn, averaging available language-specific task scores for each model.
Evals for Every Language - Language ro (Average Score (%)): leader deepseek-v4-flash-20260423 (75.52), 71 models
LanguageBench score for BCP-47 language code ro, averaging available language-specific task scores for each model.
Evals for Every Language - Language ru (Average Score (%)): leader gemini-3.1-pro-preview (75.78), 71 models
LanguageBench score for BCP-47 language code ru, averaging available language-specific task scores for each model.
Evals for Every Language - Language rw (Average Score (%)): leader gemini-3.1-pro-preview (67.18), 71 models
LanguageBench score for BCP-47 language code rw, averaging available language-specific task scores for each model.
Evals for Every Language - Language sa (Average Score (%)): leader step-3.7-flash-20260528 (64.4), 71 models
LanguageBench score for BCP-47 language code sa, averaging available language-specific task scores for each model.
Evals for Every Language - Language scn (Average Score (%)): leader gemini-3.1-pro-preview (70.92), 71 models
LanguageBench score for BCP-47 language code scn, averaging available language-specific task scores for each model.
Evals for Every Language - Language sd (Average Score (%)): leader claude-opus-4.7 (71.63), 71 models
LanguageBench score for BCP-47 language code sd, averaging available language-specific task scores for each model.
Evals for Every Language - Language sg (Average Score (%)): leader gemini-3.1-pro-preview (61.63), 71 models
LanguageBench score for BCP-47 language code sg, averaging available language-specific task scores for each model.
Evals for Every Language - Language shn (Average Score (%)): leader claude-opus-4.8 (62.21), 71 models
LanguageBench score for BCP-47 language code shn, averaging available language-specific task scores for each model.
Evals for Every Language - Language si (Average Score (%)): leader gemini-3.1-pro-preview (69.9), 71 models
LanguageBench score for BCP-47 language code si, averaging available language-specific task scores for each model.
Evals for Every Language - Language sk (Average Score (%)): leader claude-opus-4.8 (73.72), 71 models
LanguageBench score for BCP-47 language code sk, averaging available language-specific task scores for each model.
Evals for Every Language - Language sl (Average Score (%)): leader claude-sonnet-4.5 (76.86), 71 models
LanguageBench score for BCP-47 language code sl, averaging available language-specific task scores for each model.
Evals for Every Language - Language sm (Average Score (%)): leader claude-opus-4.8 (72.36), 71 models
LanguageBench score for BCP-47 language code sm, averaging available language-specific task scores for each model.
Evals for Every Language - Language sn (Average Score (%)): leader gemini-3.1-pro-preview (67.47), 71 models
LanguageBench score for BCP-47 language code sn, averaging available language-specific task scores for each model.
Evals for Every Language - Language so (Average Score (%)): leader claude-opus-4.7 (69.48), 71 models
LanguageBench score for BCP-47 language code so, averaging available language-specific task scores for each model.
Evals for Every Language - Language sq (Average Score (%)): leader gemini-3.1-pro-preview (75.41), 71 models
LanguageBench score for BCP-47 language code sq, averaging available language-specific task scores for each model.
Evals for Every Language - Language sr (Average Score (%)): leader gemini-3.1-pro-preview (76.0), 71 models
LanguageBench score for BCP-47 language code sr, averaging available language-specific task scores for each model.
Evals for Every Language - Language ss (Average Score (%)): leader gemini-3.1-pro-preview (69.6), 71 models
LanguageBench score for BCP-47 language code ss, averaging available language-specific task scores for each model.
Evals for Every Language - Language st (Average Score (%)): leader gemini-3.1-pro-preview (70.62), 71 models
LanguageBench score for BCP-47 language code st, averaging available language-specific task scores for each model.
Evals for Every Language - Language su (Average Score (%)): leader claude-opus-4.7 (70.66), 71 models
LanguageBench score for BCP-47 language code su, averaging available language-specific task scores for each model.
Evals for Every Language - Language sv (Average Score (%)): leader gemini-3.1-pro-preview (77.13), 71 models
LanguageBench score for BCP-47 language code sv, averaging available language-specific task scores for each model.
Evals for Every Language - Language sw (Average Score (%)): leader gemini-3.1-pro-preview (76.71), 71 models
LanguageBench score for BCP-47 language code sw, averaging available language-specific task scores for each model.
Evals for Every Language - Language szl (Average Score (%)): leader gpt-5.5 (71.4), 71 models
LanguageBench score for BCP-47 language code szl, averaging available language-specific task scores for each model.
Evals for Every Language - Language ta (Average Score (%)): leader gemini-2.5-pro (73.75), 71 models
LanguageBench score for BCP-47 language code ta, averaging available language-specific task scores for each model.
Evals for Every Language - Language te (Average Score (%)): leader claude-sonnet-4.5 (75.54), 71 models
LanguageBench score for BCP-47 language code te, averaging available language-specific task scores for each model.
Evals for Every Language - Language tg (Average Score (%)): leader step-3.7-flash-20260528 (72.55), 71 models
LanguageBench score for BCP-47 language code tg, averaging available language-specific task scores for each model.
Evals for Every Language - Language th (Average Score (%)): leader gemini-3.1-pro-preview (75.2), 71 models
LanguageBench score for BCP-47 language code th, averaging available language-specific task scores for each model.
Evals for Every Language - Language ti (Average Score (%)): leader gemini-3.1-pro-preview (66.67), 71 models
LanguageBench score for BCP-47 language code ti, averaging available language-specific task scores for each model.
Evals for Every Language - Language tk (Average Score (%)): leader gemini-3.1-pro-preview (70.31), 71 models
LanguageBench score for BCP-47 language code tk, averaging available language-specific task scores for each model.
Evals for Every Language - Language tn (Average Score (%)): leader gemini-3.1-pro-preview (66.73), 71 models
LanguageBench score for BCP-47 language code tn, averaging available language-specific task scores for each model.
Evals for Every Language - Language tr (Average Score (%)): leader gemini-3.1-pro-preview (76.46), 71 models
LanguageBench score for BCP-47 language code tr, averaging available language-specific task scores for each model.
Evals for Every Language - Language ts (Average Score (%)): leader gemini-3.1-pro-preview (71.05), 71 models
LanguageBench score for BCP-47 language code ts, averaging available language-specific task scores for each model.
Evals for Every Language - Language tt (Average Score (%)): leader claude-opus-4.8 (70.77), 71 models
LanguageBench score for BCP-47 language code tt, averaging available language-specific task scores for each model.
Evals for Every Language - Language ug (Average Score (%)): leader gemini-3.1-pro-preview (70.13), 71 models
LanguageBench score for BCP-47 language code ug, averaging available language-specific task scores for each model.
Evals for Every Language - Language uk (Average Score (%)): leader gemini-3.1-pro-preview (73.69), 71 models
LanguageBench score for BCP-47 language code uk, averaging available language-specific task scores for each model.
Evals for Every Language - Language ur (Average Score (%)): leader gemini-3.1-flash-lite (71.55), 71 models
LanguageBench score for BCP-47 language code ur, averaging available language-specific task scores for each model.
Evals for Every Language - Language uz (Average Score (%)): leader gemini-3.1-pro-preview (72.09), 71 models
LanguageBench score for BCP-47 language code uz, averaging available language-specific task scores for each model.
Evals for Every Language - Language vi (Average Score (%)): leader gemini-3.1-pro-preview (76.2), 71 models
LanguageBench score for BCP-47 language code vi, averaging available language-specific task scores for each model.
Evals for Every Language - Language wuu (Average Score (%)): leader gemini-2.5-flash (44.94), 71 models
LanguageBench score for BCP-47 language code wuu, averaging available language-specific task scores for each model.
Evals for Every Language - Language xh (Average Score (%)): leader gemini-3.1-pro-preview (68.89), 71 models
LanguageBench score for BCP-47 language code xh, averaging available language-specific task scores for each model.
Evals for Every Language - Language yi (Average Score (%)): leader gemini-3.1-flash-lite (72.51), 71 models
LanguageBench score for BCP-47 language code yi, averaging available language-specific task scores for each model.
Evals for Every Language - Language yo (Average Score (%)): leader gemini-3.1-pro-preview (67.1), 71 models
LanguageBench score for BCP-47 language code yo, averaging available language-specific task scores for each model.
Evals for Every Language - Language yue (Average Score (%)): leader claude-opus-4.7 (63.48), 71 models
LanguageBench score for BCP-47 language code yue, averaging available language-specific task scores for each model.
Evals for Every Language - Language zh (Average Score (%)): leader gemini-3.1-pro-preview (70.53), 71 models
LanguageBench score for BCP-47 language code zh, averaging available language-specific task scores for each model.
Evals for Every Language - Language zu (Average Score (%)): leader gemini-3.1-pro-preview (72.93), 71 models
LanguageBench score for BCP-47 language code zu, averaging available language-specific task scores for each model.
Evals for Every Language - Language an (Average Score (%)): leader claude-sonnet-4.5 (56.84), 70 models
LanguageBench score for BCP-47 language code an, averaging available language-specific task scores for each model.
Evals for Every Language - Language ast (Average Score (%)): leader gemini-3.1-pro-preview (64.58), 70 models
LanguageBench score for BCP-47 language code ast, averaging available language-specific task scores for each model.
Evals for Every Language - Language bo (Average Score (%)): leader gpt-5 (50.78), 70 models
LanguageBench score for BCP-47 language code bo, averaging available language-specific task scores for each model.
Evals for Every Language - Language brx (Average Score (%)): leader step-3.7-flash-20260528 (90.0), 70 models
LanguageBench score for BCP-47 language code brx, averaging available language-specific task scores for each model.
Evals for Every Language - Language bug (Average Score (%)): leader step-3.7-flash-20260528 (52.78), 70 models
LanguageBench score for BCP-47 language code bug, averaging available language-specific task scores for each model.
Evals for Every Language - Language dar (Average Score (%)): leader claude-opus-4.7 (43.4), 70 models
LanguageBench score for BCP-47 language code dar, averaging available language-specific task scores for each model.
Evals for Every Language - Language dyu (Average Score (%)): leader step-3.7-flash-20260528 (70.0), 70 models
LanguageBench score for BCP-47 language code dyu, averaging available language-specific task scores for each model.
Evals for Every Language - Language fon (Average Score (%)): leader step-3.7-flash-20260528 (60.0), 70 models
LanguageBench score for BCP-47 language code fon, averaging available language-specific task scores for each model.
Evals for Every Language - Language fur (Average Score (%)): leader gemini-2.5-pro (59.36), 70 models
LanguageBench score for BCP-47 language code fur, averaging available language-specific task scores for each model.
Evals for Every Language - Language hne (Average Score (%)): leader claude-opus-4.7 (51.52), 70 models
LanguageBench score for BCP-47 language code hne, averaging available language-specific task scores for each model.
Evals for Every Language - Language kab (Average Score (%)): leader gemini-3.1-pro-preview (49.02), 70 models
LanguageBench score for BCP-47 language code kab, averaging available language-specific task scores for each model.
Evals for Every Language - Language kac (Average Score (%)): leader step-3.7-flash-20260528 (70.0), 70 models
LanguageBench score for BCP-47 language code kac, averaging available language-specific task scores for each model.
Evals for Every Language - Language kam (Average Score (%)): leader gemini-3.1-pro-preview (40.75), 70 models
LanguageBench score for BCP-47 language code kam, averaging available language-specific task scores for each model.
Evals for Every Language - Language kea (Average Score (%)): leader gemini-2.5-pro (62.21), 70 models
LanguageBench score for BCP-47 language code kea, averaging available language-specific task scores for each model.
Evals for Every Language - Language ki (Average Score (%)): leader gemini-2.5-pro (43.55), 70 models
LanguageBench score for BCP-47 language code ki, averaging available language-specific task scores for each model.
Evals for Every Language - Language kmb (Average Score (%)): leader step-3.7-flash-20260528 (80.0), 70 models
LanguageBench score for BCP-47 language code kmb, averaging available language-specific task scores for each model.
Evals for Every Language - Language ks (Average Score (%)): leader gpt-5.5 (45.36), 70 models
LanguageBench score for BCP-47 language code ks, averaging available language-specific task scores for each model.
Evals for Every Language - Language lua (Average Score (%)): leader step-3.7-flash-20260528 (51.75), 70 models
LanguageBench score for BCP-47 language code lua, averaging available language-specific task scores for each model.
Evals for Every Language - Language mag (Average Score (%)): leader gemini-3.1-pro-preview (59.83), 70 models
LanguageBench score for BCP-47 language code mag, averaging available language-specific task scores for each model.
Evals for Every Language - Language mfe (Average Score (%)): leader gemini-3.1-pro-preview (57.05), 70 models
LanguageBench score for BCP-47 language code mfe, averaging available language-specific task scores for each model.
Evals for Every Language - Language mni (Average Score (%)): leader gemini-3.1-pro-preview (49.55), 70 models
LanguageBench score for BCP-47 language code mni, averaging available language-specific task scores for each model.
Evals for Every Language - Language mos (Average Score (%)): leader gemini-3.1-pro-preview (40.3), 70 models
LanguageBench score for BCP-47 language code mos, averaging available language-specific task scores for each model.
Evals for Every Language - Language myv (Average Score (%)): leader gemini-3.1-pro-preview (50.07), 70 models
LanguageBench score for BCP-47 language code myv, averaging available language-specific task scores for each model.
Evals for Every Language - Language nqo (Average Score (%)): leader gemma-4-31b-it-20260402 (66.67), 70 models
LanguageBench score for BCP-47 language code nqo, averaging available language-specific task scores for each model.
Evals for Every Language - Language sat (Average Score (%)): leader gpt-5.5 (44.28), 70 models
LanguageBench score for BCP-47 language code sat, averaging available language-specific task scores for each model.
Evals for Every Language - Language sc (Average Score (%)): leader gemini-3.1-flash-lite (60.0), 70 models
LanguageBench score for BCP-47 language code sc, averaging available language-specific task scores for each model.
Evals for Every Language - Language tpi (Average Score (%)): leader step-3.7-flash-20260528 (51.21), 70 models
LanguageBench score for BCP-47 language code tpi, averaging available language-specific task scores for each model.
Evals for Every Language - Language tum (Average Score (%)): leader gemini-2.5-pro (49.09), 70 models
LanguageBench score for BCP-47 language code tum, averaging available language-specific task scores for each model.
Evals for Every Language - Language tyv (Average Score (%)): leader gemini-2.5-pro (51.21), 70 models
LanguageBench score for BCP-47 language code tyv, averaging available language-specific task scores for each model.
Evals for Every Language - Language umb (Average Score (%)): leader gemini-3.1-pro-preview (42.38), 70 models
LanguageBench score for BCP-47 language code umb, averaging available language-specific task scores for each model.
Evals for Every Language - Language vec (Average Score (%)): leader gemini-2.5-pro (54.06), 70 models
LanguageBench score for BCP-47 language code vec, averaging available language-specific task scores for each model.
Evals for Every Language - Language vmw (Average Score (%)): leader gemini-3.1-pro-preview (43.07), 70 models
LanguageBench score for BCP-47 language code vmw, averaging available language-specific task scores for each model.
Evals for Every Language - Language war (Average Score (%)): leader step-3.7-flash-20260528 (66.36), 70 models
LanguageBench score for BCP-47 language code war, averaging available language-specific task scores for each model.
Evals for Every Language - Language wo (Average Score (%)): leader gemini-3.1-pro-preview (56.45), 70 models
LanguageBench score for BCP-47 language code wo, averaging available language-specific task scores for each model.
Evals for Every Language - Language zgh (Average Score (%)): leader claude-sonnet-4.6 (46.84), 70 models
LanguageBench score for BCP-47 language code zgh, averaging available language-specific task scores for each model.
Evals for Every Language - Language co (Average Score (%)): leader jamba-large-1.7 (100.0), 69 models
LanguageBench score for BCP-47 language code co, averaging available language-specific task scores for each model.
Evals for Every Language - Language crs (Average Score (%)): leader claude-opus-4.7 (100.0), 69 models
LanguageBench score for BCP-47 language code crs, averaging available language-specific task scores for each model.
Evals for Every Language - Language la (Average Score (%)): leader claude-opus-4.7 (100.0), 69 models
LanguageBench score for BCP-47 language code la, averaging available language-specific task scores for each model.
Evals for Every Language - Language ps (Average Score (%)): leader gemini-3.1-flash-lite (100.0), 69 models
LanguageBench score for BCP-47 language code ps, averaging available language-specific task scores for each model.
Evals for Every Language - Language yua (Average Score (%)): leader claude-opus-4.7 (100.0), 69 models
LanguageBench score for BCP-47 language code yua, averaging available language-specific task scores for each model.
Evals for Every Language - Language ach (Average Score (%)): leader trinity-mini-20251201 (100.0), 68 models
LanguageBench score for BCP-47 language code ach, averaging available language-specific task scores for each model.
Evals for Every Language - Language bbc (Average Score (%)): leader claude-opus-4.7 (100.0), 68 models
LanguageBench score for BCP-47 language code bbc, averaging available language-specific task scores for each model.
Evals for Every Language - Language bew (Average Score (%)): leader nova-2-lite-v1 (100.0), 68 models
LanguageBench score for BCP-47 language code bew, averaging available language-specific task scores for each model.
Evals for Every Language - Language bik (Average Score (%)): leader claude-opus-4.8 (100.0), 68 models
LanguageBench score for BCP-47 language code bik, averaging available language-specific task scores for each model.
Evals for Every Language - Language br (Average Score (%)): leader nova-2-lite-v1 (100.0), 68 models
LanguageBench score for BCP-47 language code br, averaging available language-specific task scores for each model.
Evals for Every Language - Language bua (Average Score (%)): leader trinity-mini-20251201 (100.0), 68 models
LanguageBench score for BCP-47 language code bua, averaging available language-specific task scores for each model.
Evals for Every Language - Language dv (Average Score (%)): leader nova-2-lite-v1 (100.0), 68 models
LanguageBench score for BCP-47 language code dv, averaging available language-specific task scores for each model.
Evals for Every Language - Language fy (Average Score (%)): leader nova-2-lite-v1 (100.0), 68 models
LanguageBench score for BCP-47 language code fy, averaging available language-specific task scores for each model.
Evals for Every Language - Language haw (Average Score (%)): leader claude-opus-4.8 (100.0), 68 models
LanguageBench score for BCP-47 language code haw, averaging available language-specific task scores for each model.
Evals for Every Language - Language hil (Average Score (%)): leader claude-opus-4.8 (100.0), 68 models
LanguageBench score for BCP-47 language code hil, averaging available language-specific task scores for each model.
Evals for Every Language - Language kri (Average Score (%)): leader claude-opus-4.8 (100.0), 68 models
LanguageBench score for BCP-47 language code kri, averaging available language-specific task scores for each model.
Evals for Every Language - Language mak (Average Score (%)): leader trinity-mini-20251201 (100.0), 68 models
LanguageBench score for BCP-47 language code mak, averaging available language-specific task scores for each model.
Evals for Every Language - Language pam (Average Score (%)): leader gemma-4-31b-it-20260402 (100.0), 68 models
LanguageBench score for BCP-47 language code pam, averaging available language-specific task scores for each model.
Evals for Every Language - Language qu (Average Score (%)): leader trinity-mini-20251201 (100.0), 68 models
LanguageBench score for BCP-47 language code qu, averaging available language-specific task scores for each model.
Evals for Every Language - Language tet (Average Score (%)): leader claude-sonnet-4.6 (100.0), 68 models
LanguageBench score for BCP-47 language code tet, averaging available language-specific task scores for each model.
Evals for Every Language - Language ab (Average Score (%)): leader claude-opus-4.8 (100.0), 67 models
LanguageBench score for BCP-47 language code ab, averaging available language-specific task scores for each model.
Evals for Every Language - Language cgg (Average Score (%)): leader nova-2-lite-v1 (100.0), 67 models
LanguageBench score for BCP-47 language code cgg, averaging available language-specific task scores for each model.
Evals for Every Language - Language ff (Average Score (%)): leader gemini-3.1-pro-preview (89.63), 67 models
LanguageBench score for BCP-47 language code ff, averaging available language-specific task scores for each model.
Evals for Every Language - Language gaa (Average Score (%)): leader gemini-3.1-pro-preview (92.96), 67 models
LanguageBench score for BCP-47 language code gaa, averaging available language-specific task scores for each model.
Evals for Every Language - Language new (Average Score (%)): leader gemini-3.1-pro-preview (96.3), 67 models
LanguageBench score for BCP-47 language code new, averaging available language-specific task scores for each model.
Evals for Every Language - Language nr (Average Score (%)): leader claude-opus-4.8 (100.0), 67 models
LanguageBench score for BCP-47 language code nr, averaging available language-specific task scores for each model.
Evals for Every Language - Language vai (Average Score (%)): leader gemini-3.1-pro-preview (33.33), 61 models
LanguageBench score for BCP-47 language code vai, averaging available language-specific task scores for each model.
AutomataBench (Weighted pass@1 (%)): leader GPT-5.5 (45.28), 5 models
AutomataBench evaluates models on sparse reversible space-time completion: reconstructing cellular-automaton initial states from partial observations under a fixed time budget.

New Models (82)

Claude Mythos 5 — ELO 4507, #1
- FrontierCode Diamond (Fable/Mythos): 29.3 (#2/4)
- GDPval-AA (Fable/Mythos): 1932.0 (#2/5)
- GDP.pdf (Fable/Mythos): 29.8 (#2/5)
- AutomationBench (Fable/Mythos): 17.4 (#2/6)
- Blueprint-Bench 2 (Fable/Mythos): 38.6 (#2/6)
Claude Fable 5 — ELO 2918, #6
- DeepSWE: 69.9 (#1/10)
- EQ-Bench Creative Writing v3: 2027.9 (#1/106)
Claude Opus 4.8 — ELO 2669, #8
- DeepSWE: 59.0 (#4/10)
- EQ-Bench Creative Writing v3: 1831.1 (#4/106)
- Agents' Last Exam: 15.8 (#7/16)
GPT-5.5 — ELO 2554, #9
- DeepSWE: 64.4 (#3/10)
Qwen 3.7 Max — ELO 2508, #10
- Agents' Last Exam: 11.8 (#10/16)
Gemini 3.1 Pro (Preview) — ELO 2434, #12
- Kernel Arena - WaferBench NVFP4: 2.274 (#1/4)
- PredictionArena (Kalshi): 17425.7 (#1/10)
- JEE/NEET LLM Benchmark - JEE Advanced 2026: 94.72 (#1/7)
- PredictionArena (Polymarket): 28113.52 (#2/10)
- JEE/NEET LLM Benchmark - NEET 2026: 98.61 (#2/14)
- Terminal-Bench 2.1: 70.7 (#5/7)
- Terminal-Bench 2.1 (Terminus 2): 70.3 (#5/6)
- Agents' Last Exam: 15.8 (#8/16)
- DeepSWE: 11.8 (#10/10)
- WolfBench: 59.0 (#10/28)
Claude Opus 4.6 — ELO 2419, #13
- PredictionArena (Polymarket): 138944.64 (#1/10)
- PredictionArena (Kalshi): 7377.51 (#6/10)
- Ramp SWE-Bench: 78.75 (#6/19)
GPT-5.3 Codex — ELO 2383, #14
- WolfBench: 55.0 (#13/28)
Claude Opus 4.7 — ELO 2369, #15
- Agents' Last Exam: 18.4 (#6/16)
GLM-5.2 — ELO 2315, #19
- Tau3-Bench Retail: 85.7 (#1/10)
- Tau3-Bench Telecom: 99.3 (#1/9)
- Tau3-Bench Airline: 87.5 (#1/9)
- SEAL - SWE Atlas - Codebase QnA: 48.12 (#2/17)
- Ramp SWE-Bench: 80.0 (#4/19)
- Tau3-Bench Banking_Knowledge: 29.6 (#4/19)
- WolfBench: 71.0 (#4/28)
- SEAL - SWE Atlas - Test Writing: 41.48 (#5/18)
- DeepSWE: 43.8 (#6/10)
- MathArena - ARXIV March: 61.67 (#6/15)
GPT-5.4 — ELO 2310, #20
- PredictionArena (Polymarket): 15411.83 (#3/10)
- PredictionArena (Kalshi): 10121.56 (#3/10)
GPT-5.2 — ELO 2264, #24
- PredictionArena (Polymarket): 9823.1 (#7/10)
- PredictionArena (Kalshi): 7322.19 (#8/10)
Nex N2 Pro — ELO 2225, #27
- Design Arena (SVG): 1275.0 (#15/96)
- AA GDPval: 1261.92 (#15/71)
- GDPval-AA: 1262.0 (#15/71)
- Artificial Analysis Intelligence Index: 40.96 (#25/510)
- AA CritPt: 8.57 (#26/417)
- AA Omniscience - Software Engineering (SWE) - Rust: 74.0 (#28/416)
- AA GPQA Diamond: 89.19 (#30/511)
- AA Humanity's Last Exam: 32.39 (#30/507)
- AA Omniscience - Science, Engineering & Mathematics: 42.4 (#31/416)
- AA Long Context Reasoning: 67.67 (#41/439)
Qwen 3.7 Plus — ELO 2183, #34
- JEE/NEET LLM Benchmark - JEE Advanced 2026: 94.17 (#3/7)
- Ramp SWE-Bench: 61.25 (#15/19)
- Chatbot Arena (Vision): 1266.0 (#19/131)
- Chatbot Arena (Text): 1464.0 (#32/368)
DeepSeek V4 Pro — ELO 2176, #36
- Agents' Last Exam: 12.4 (#9/16)
- WolfBench: 57.0 (#12/28)
- Epoch AI - ECI: 149.74 (#79/391)
Kimi K2.7 Code — ELO 2174, #38
- Ramp SWE-Bench: 78.75 (#5/19)
- DeepSWE: 30.5 (#8/10)
- WolfBench: 58.0 (#11/28)
- WebDev Arena: 1479.2 (#20/81)
Gemini 3 Pro (Preview) — ELO 2173, #39
- PredictionArena (Polymarket): 9770.22 (#9/10)
- PredictionArena (Kalshi): 6903.08 (#10/10)
Gemini 3 Flash (Preview) — ELO 2166, #41
- JEE/NEET LLM Benchmark - NEET 2026: 99.31 (#1/14)
- WolfBench: 44.0 (#21/28)
Grok 0.1 — ELO 2164, #43
- BoxPwnr CTF Bench: 44.67 (#3/16)
GLM-5 FP8 — ELO 2154, #45
- WolfBench: 47.0 (#19/28)
DeepSeek V4 Flash — ELO 2141, #47
- WolfBench: 51.0 (#15/28)
MiMo-V2.5-Pro — ELO 2136, #48
- WebDev Arena: 1471.33 (#23/81)
Claude Opus 4.5 (20251101) — ELO 2135, #49
- PredictionArena (Polymarket): 9887.5 (#6/10)
- PredictionArena (Kalshi): 7337.75 (#7/10)
MiniMax-M3 — ELO 2120, #52
- JEE/NEET LLM Benchmark - JEE Advanced 2026: 88.61 (#5/7)
- AA MMMU-Pro: 78.55 (#15/201)
- Wolfram LLM Benchmarking Project: 50.1 (#133/493)
Kimi K2.6 — ELO 2112, #56
- ProphetArena: 0.9197 (#5/44)
- Agents' Last Exam: 9.2 (#12/16)
Grok 4.3 — ELO 2097, #60
- Agents' Last Exam: 6.6 (#15/16)
- Epoch AI - ECI: 148.96 (#100/391)
GLM-5.1 — ELO 2062, #73
- Agents' Last Exam: 11.5 (#11/16)
- Ramp SWE-Bench: 71.25 (#12/19)
Gemini 2.5 Pro — ELO 2054, #80
- AA GDPval: 621.73 (#51/71)
- GDPval-AA: 622.0 (#51/71)
O1 — ELO 2053, #81
- CRMArena - Overall: 64.3 (#1/9)
- CRMArena - HTU: 68.5 (#1/9)
- CRMArena - NED: 60.0 (#1/9)
- CRMArena - TII: 99.2 (#1/9)
- CRMArena - BRI: 74.8 (#1/9)
- gg-bench: 44.27 (#1/7)
- CRMArena - KQA: 58.8 (#2/9)
- CRMArena - MTA: 84.6 (#2/9)
- CRMArena - NCR: 70.0 (#3/9)
- CRMArena - TCU: 66.9 (#3/9)
GLM-5 — ELO 2034, #88
- PredictionArena (Kalshi): 15294.97 (#2/10)
- PredictionArena (Polymarket): 7053.46 (#10/10)
Qwen 3.6 Plus — ELO 2029, #93
- Agents' Last Exam: 8.6 (#13/16)
- Ramp SWE-Bench: 65.0 (#14/19)
Kimi K2.5 — ELO 2020, #100
- WolfBench: 48.0 (#17/28)
GLM-4.7 — ELO 2017, #102
- PredictionArena (Kalshi): 8459.6 (#4/10)
- PredictionArena (Polymarket): 9982.59 (#5/10)
- Kernel Arena - KernelBench HIP: 1.307 (#11/11)
Qwen 3.5 27B — ELO 2009, #104
- WebDev Arena: 1356.86 (#60/81)
O4 Mini — ELO 2006, #106
- Epoch AI - ECI: 146.72 (#124/391)
GPT-4o ChatGPT — ELO 1996, #109
- Darija Chatbot Arena: 1400.6 (#1/14)
Gemma 4 31B (IT) — ELO 1989, #111
- JEE/NEET LLM Benchmark - JEE Advanced 2026: 81.11 (#6/7)
- JEE/NEET LLM Benchmark - NEET 2026: 97.22 (#6/14)
MiniMax-M2.7 — ELO 1985, #114
- Appwrite Arena (With Skills): 93.2 (#12/16)
- WolfBench: 52.0 (#14/28)
- Agents' Last Exam: 5.9 (#16/16)
- Appwrite Arena (Without Skills): 85.2 (#16/16)
Qwen 3.5 122B A10B — ELO 1982, #118
- LIBRA - ruSciPassageCount *: 21.38 (#3/13)
- LIBRA - ruBABILongQA1: 66.8 (#3/13)
- LIBRA - ruBABILongQA2: 53.71 (#3/13)
- LIBRA - ruBABILongQA3 *: 31.85 (#3/13)
- LIBRA - MatreshkaNames *: 67.39 (#4/13)
- LIBRA - LibrusecHistory: 79.77 (#4/13)
- LIBRA - ru2WikiMultihopQA *: 55.3 (#4/13)
- LIBRA - ruSciFi: 50.29 (#4/13)
- LIBRA - LibrusecMHQA *: 42.32 (#4/13)
- LIBRA - ruBABILongQA4: 58.91 (#4/13)
Mercury 2 — ELO 1979, #122
- WebDev Arena: 1164.83 (#79/81)
- Design Arena (UI Components): 1024.0 (#111/128)
- Design Arena (3D): 1039.0 (#113/122)
- Design Arena (Game Dev): 1043.0 (#115/134)
- Design Arena (Data Viz): 1001.0 (#116/130)
- Design Arena (Website): 1031.0 (#125/146)
MiMo-V2-Pro — ELO 1972, #129
- WebDev Arena: 1432.18 (#34/81)
GPT-5.4 Mini — ELO 1971, #130
- Epoch AI - ECI: 148.98 (#97/391)
MiMo-V2.5 — ELO 1952, #139
- JEE/NEET LLM Benchmark - JEE Advanced 2026: 78.61 (#7/7)
- Agents' Last Exam: 8.6 (#14/16)
- WebDev Arena: 1432.12 (#35/81)
Gemini 2.5 Pro (Preview 05-06) — ELO 1934, #144
- JEE/NEET LLM Benchmark - JEE Advanced 2025: 89.72 (#1/5)
O3 Mini — ELO 1916, #157
- gg-bench: 30.27 (#2/7)
- Epoch AI - ECI: 141.43 (#214/391)
Gemini 3.1 Flash Lite (Preview) — ELO 1905, #166
- WolfBench: 25.0 (#26/28)
Gemini 2.5 Pro (Preview 03-25) — ELO 1894, #171
- JEE/NEET LLM Benchmark - JEE Advanced 2024: 89.44 (#1/1)
- JEE/NEET LLM Benchmark - NEET 2024: 90.88 (#1/1)
- JEE/NEET LLM Benchmark - NEET 2025: 95.14 (#2/2)
MiMo-V2-Flash — ELO 1892, #174
- WebDev Arena: 1336.8 (#63/81)
GPT-4.1 — ELO 1888, #177
- Ramp SWE-Bench: 15.0 (#19/19)
O1 Mini — ELO 1876, #186
- KataGo-Bench-1K: 27.3 (#4/11)
KAT-Coder-Pro V1 — ELO 1869, #191
- AA GDPval: 880.69 (#41/71)
- GDPval-AA: 881.0 (#41/71)
GPT-5.4 Nano — ELO 1855, #199
- Epoch AI - ECI: 146.74 (#119/391)
nemotron-3-ultra-550B-a55B — ELO 1850, #205
- Design Arena (SVG): 1136.0 (#65/96)
MiniMax-M2.5 — ELO 1832, #215
- WolfBench: 47.0 (#20/28)
Qwen 3.6 35B A3B — ELO 1796, #238
- AI Chess Leaderboard (Reasoning): 1519.0 (#15/282)
- WebDev Arena: 1249.48 (#72/81)
- AI Chess Leaderboard (Continuation): 446.0 (#166/230)
Mistral Large 3 — ELO 1768, #261
- Appwrite Arena (Without Skills): 86.2 (#15/16)
- Appwrite Arena (With Skills): 87.4 (#16/16)
- AA GDPval: 599.9 (#54/71)
- GDPval-AA: 600.0 (#54/71)
DeepSeek V3 — ELO 1755, #273
- Darija Chatbot Arena: 1221.2 (#5/14)
Qwen 2.5 72B Instruct — ELO 1751, #280
- Darija Chatbot Arena: 1188.3 (#7/14)
Gemma 3 27B (IT) — ELO 1738, #290
- GDPval-AA: 141.0 (#67/71)
- AA GDPval: -140.52 (#71/71)
Qwen 3.5 35B A3B — ELO 1729, #297
- LIBRA - MatreshkaNames *: 68.97 (#2/13)
- LIBRA - ruSciPassageCount *: 21.89 (#2/13)
- LIBRA - ruSciFi: 51.47 (#2/13)
- LIBRA - ruBABILongQA1: 68.38 (#2/13)
- LIBRA - ruBABILongQA2: 54.97 (#2/13)
- LIBRA - ruBABILongQA3 *: 32.6 (#2/13)
- LIBRA - LibrusecHistory: 81.65 (#3/13)
- LIBRA - ru2WikiMultihopQA *: 56.6 (#3/13)
- LIBRA - LibrusecMHQA *: 43.32 (#3/13)
- LIBRA - ruBABILongQA4: 60.29 (#3/13)
GPT-4o Mini — ELO 1698, #321
- Darija Chatbot Arena: 1212.6 (#6/14)
- gg-bench: 2.77 (#6/7)
Devstral 2 — ELO 1696, #322
- AA GDPval: 714.11 (#47/71)
- GDPval-AA: 714.0 (#47/71)
Qwen 3.5 9B — ELO 1687, #329
- LIBRA - ruSciPassageCount *: 20.77 (#4/13)
- LIBRA - ruBABILongQA1: 64.88 (#4/13)
- LIBRA - ruBABILongQA2: 52.16 (#4/13)
- LIBRA - ruBABILongQA3 *: 30.94 (#4/13)
- LIBRA - MatreshkaNames *: 65.44 (#5/13)
- LIBRA - LibrusecHistory: 77.47 (#5/13)
- LIBRA - ru2WikiMultihopQA *: 53.7 (#5/13)
- LIBRA - ruSciFi: 48.84 (#5/13)
- LIBRA - LibrusecMHQA *: 41.1 (#5/13)
- LIBRA - ruBABILongQA4: 57.21 (#5/13)
Qwen 3 VL 8B — ELO 1675, #339
- JMMMU-Pro - Overall: 47.273 (#3/14)
- JMMMU-Pro - Culture Agnostic: 47.083 (#3/14)
- JMMMU-Pro - Japanese History: 44.667 (#4/14)
- JMMMU-Pro - World History: 70.667 (#4/14)
- JMMMU-Pro - Culture Specific: 47.5 (#5/14)
- JMMMU-Pro - Japanese Art: 42.667 (#5/14)
- JMMMU-Pro - Japanese Heritage: 37.333 (#5/14)
Qwen 3 30B A3B 2507 Instruct — ELO 1674, #340
- LIBRA - MatreshkaNames *: 81.2 (#1/13)
- LIBRA - PasskeyWithLibrusec: 100.0 (#1/13)
- LIBRA - LibrusecHistory: 96.12 (#1/13)
- LIBRA - ruSciPassageCount *: 25.77 (#1/13)
- LIBRA - ru2WikiMultihopQA *: 66.63 (#1/13)
- LIBRA - ruSciAbstractRetrieval: 81.5 (#1/13)
- LIBRA - ruSciFi: 60.6 (#1/13)
- LIBRA - LibrusecMHQA *: 51.0 (#1/13)
- LIBRA - ruBABILongQA1: 80.5 (#1/13)
- LIBRA - ruBABILongQA2: 64.72 (#1/13)
Ministral 3 14B — ELO 1657, #364
- AA GDPval: 426.41 (#60/71)
- GDPval-AA: 426.0 (#60/71)
Ministral 3 8B — ELO 1651, #369
- AA GDPval: 385.16 (#63/71)
- GDPval-AA: 385.0 (#63/71)
Ministral 3 3B — ELO 1565, #480
- AA GDPval: 212.8 (#66/71)
- GDPval-AA: 213.0 (#66/71)
QwQ 32B-Preview — ELO 1563, #483
- Darija Chatbot Arena: 1106.4 (#12/14)
Claude 3.5 Haiku — ELO 1559, #491
- AA GDPval: 409.64 (#61/71)
- GDPval-AA: 410.0 (#61/71)
Gemma 4 26B A4B — ELO 1520, #583
- Icelandic LLM - ARC-Challenge-IS: 93.09 (#21/87)
- Icelandic LLM - Belebele-IS: 92.11 (#29/87)
- Icelandic LLM - WinoGrande-IS: 88.42 (#30/87)
- Icelandic LLM - GED: 59.0 (#41/87)
- Icelandic LLM Leaderboard - Average: 67.08 (#43/87)
- Icelandic LLM - Inflection: 53.58 (#51/87)
- Icelandic LLM - WikiQA-IS: 16.26 (#57/87)
- WebDev Arena: 1359.32 (#58/81)
Command-R+ (08-2024) — ELO 1510, #600
- Darija Chatbot Arena: 1148.7 (#10/14)
Qwen 3.5 4B — ELO 1503, #607
- LIBRA - ruSciPassageCount *: 19.57 (#5/13)
- LIBRA - ruBABILongQA1: 61.13 (#5/13)
- LIBRA - ruBABILongQA2: 49.14 (#5/13)
- LIBRA - ruBABILongQA3 *: 29.15 (#5/13)
- LIBRA - MatreshkaNames *: 61.66 (#6/13)
- LIBRA - LibrusecHistory: 72.99 (#6/13)
- LIBRA - ru2WikiMultihopQA *: 50.6 (#6/13)
- LIBRA - ruSciFi: 46.02 (#6/13)
- LIBRA - LibrusecMHQA *: 38.73 (#6/13)
- LIBRA - ruBABILongQA4: 53.9 (#6/13)
Llama 3.1 8B Instruct — ELO 1431, #751
- LIBRA - MatreshkaYesNo: 75.0 (#1/13)
- LIBRA - PasskeyWithLibrusec: 100.0 (#2/13)
- LIBRA - LibrusecHistory: 93.0 (#2/13)
- LIBRA - ru2WikiMultihopQA *: 64.77 (#2/13)
- LIBRA - ruSciAbstractRetrieval: 79.92 (#2/13)
- LIBRA - LibrusecMHQA *: 48.7 (#2/13)
- LIBRA - ruBABILongQA4: 66.17 (#2/13)
- LIBRA - ruBABILongQA5: 75.3 (#2/13)
- LIBRA - ruTPO: 93.9 (#2/13)
- LIBRA - MatreshkaNames *: 68.02 (#3/13)
Phi-3.5-mini-instruct — ELO 1408, #836
- LIBRA - MatreshkaYesNo: 73.42 (#2/13)
- LIBRA - ruBABILongQA5: 71.22 (#4/13)
- LIBRA - ruQuALITY: 81.3 (#4/13)
- LIBRA - Passkey: 99.83 (#6/13)
- LIBRA - PasskeyWithLibrusec: 91.25 (#6/13)
- LIBRA - LibrusecHistory: 72.65 (#7/13)
- LIBRA - ruTPO: 74.7 (#7/13)
- LIBRA - LongContextMultiQ *: 34.92 (#8/13)
- LIBRA - ruBABILongQA3 *: 27.75 (#8/13)
- LIBRA - MatreshkaNames *: 37.4 (#9/13)
Qwen3.5 0.8B — ELO 1356, #969
- LIBRA - ruSciPassageCount *: 17.79 (#7/13)
- LIBRA - ruBABILongQA2: 44.67 (#7/13)
- LIBRA - MatreshkaNames *: 56.05 (#8/13)
- LIBRA - ru2WikiMultihopQA *: 46.0 (#8/13)
- LIBRA - LibrusecMHQA *: 35.21 (#8/13)
- LIBRA - ruBABILongQA1: 55.57 (#8/13)
- LIBRA - ruBABILongQA4: 49.0 (#8/13)
- LIBRA - ruSciAbstractRetrieval: 56.26 (#9/13)
- LIBRA - ruSciFi: 41.83 (#9/13)
- LIBRA - ruBABILongQA3 *: 26.5 (#9/13)
Qwen 3.5 2B — ELO 1263, #1135
- LIBRA - ruSciPassageCount *: 18.72 (#6/13)
- LIBRA - ruBABILongQA2: 47.01 (#6/13)
- LIBRA - ruBABILongQA3 *: 27.88 (#6/13)
- LIBRA - MatreshkaNames *: 58.98 (#7/13)
- LIBRA - ru2WikiMultihopQA *: 48.4 (#7/13)
- LIBRA - ruSciFi: 44.02 (#7/13)
- LIBRA - LibrusecMHQA *: 37.05 (#7/13)
- LIBRA - ruBABILongQA1: 58.48 (#7/13)
- LIBRA - ruBABILongQA4: 51.56 (#7/13)
- LIBRA - LibrusecHistory: 69.83 (#8/13)
aya-vision-8B — ELO 1246, #1153
- JMMMU-Pro - Japanese Heritage: 30.667 (#6/14)
- JMMMU-Pro - Culture Specific: 27.0 (#10/14)
- JMMMU-Pro - World History: 28.667 (#10/14)
- JMMMU-Pro - Japanese History: 26.667 (#12/14)
- JMMMU-Pro - Overall: 26.742 (#13/14)
- JMMMU-Pro - Culture Agnostic: 26.528 (#13/14)
- JMMMU-Pro - Japanese Art: 26.0 (#13/14)
Gemma 4 E4B — ELO 1156, #1238
- SEA-HELM: 61.23 (#14/43)
GPT-3.5 — ELO 1076, #1282
- Nexus Function Calling - VirusTotal: 81.0 (#2/4)
- Nexus Function Calling - Overall: 36.52 (#3/4)
- Nexus Function Calling - OTX: 89.13 (#3/4)
- Nexus Function Calling - CVECPE: 48.0 (#3/4)
- Nexus Function Calling - CVECPE Multi APIs: 7.14 (#3/4)
- Nexus Function Calling - VT Multi Dependency: 2.04 (#3/4)
- Nexus Function Calling - VT Multi Disconnected: 14.29 (#3/4)
- Nexus Function Calling - Climate: 25.53 (#3/4)
- Nexus Function Calling - Places API: 25.0 (#3/4)
Gemma 4 E2B — ELO 1016, #1314
- SEA-HELM: 50.1 (#26/43)
LFM2.5-1.2B Instruct — ELO 986, #1325
- LIBRA - ruQuALITY: 71.4 (#7/13)
- LIBRA - ruTPO: 70.4 (#9/13)
- LIBRA - MatreshkaYesNo: 38.83 (#12/13)
- LIBRA - ru2WikiMultihopQA *: 36.2 (#12/13)
- LIBRA - LibrusecMHQA *: 22.7 (#12/13)
- LIBRA - Passkey: 66.67 (#13/13)
- LIBRA - MatreshkaNames *: 6.72 (#13/13)
- LIBRA - PasskeyWithLibrusec: 66.08 (#13/13)
- LIBRA - LibrusecHistory: 29.7 (#13/13)
- LIBRA - ruSciPassageCount *: 2.93 (#13/13)

Top-10 New Scores (12)

Claude Fable 5 on DeepSWE: 69.9 (#1)
Claude Fable 5 on EQ-Bench Creative Writing v3: 2027.9 (#1)
Claude Mythos 5 on AutomationBench (Fable/Mythos): 17.4 (#2)
Claude Mythos 5 on Blueprint-Bench 2 (Fable/Mythos): 38.6 (#2)
Claude Mythos 5 on FrontierCode Diamond (Fable/Mythos): 29.3 (#2)
Claude Mythos 5 on GDP.pdf (Fable/Mythos): 29.8 (#2)
Claude Mythos 5 on GDPval-AA (Fable/Mythos): 1932.0 (#2)
Claude Opus 4.8 on Agents' Last Exam: 15.8 (#7)
Claude Opus 4.8 on DeepSWE: 59.0 (#4)
Claude Opus 4.8 on EQ-Bench Creative Writing v3: 1831.1 (#4)
GPT-5.5 on DeepSWE: 64.4 (#3)
Qwen 3.7 Max on Agents' Last Exam: 11.8 (#10)

New #1 Leaders (28)

PredictionArena (Polymarket): Claude Opus 4.6 (138944.64) beat Claude Opus 4.6 by 18832.9
WebDev Arena: Claude 5 (1653.93) beat Claude Opus 4.7 (Unknown) by 87.08
LLM Stats (ZEROBench): Seed 2.1 Turbo (57.2) beat Muse Spark by 24.2
WorldCupBench: MiMo-V2.5-Pro (46.0) beat Grok 4.3 by 24.0
SWE-bench Live: AMI Agent + Claude-4.6-Oups (63.0) beat SWE-agent + Claude-4.5-Sonnet by 23.0
LLM Stats (BLINK): Seed 2.1 Pro (81.4) beat Qwen 3 VL 235B A22B Instruct by 10.7
WDCD R2 In-Document Resistance: Grok 4 (100.0) beat Gemini 2.5 Pro by 10.0
ExploitBench v8-bench: Claude Mythos Preview 5 seeds (78.0) beat Claude Mythos Preview by 9.0
Coding Agent Leaderboard - swe-bench-pro--ansible: Opus 4.8 + OpenCode (78.1) beat Opus 4.8 + Claude Code by 8.3
LLM Stats (OSWorld): Seed 2.1 Pro (78.8) beat Claude Opus 4.6 by 6.1
Design Arena (3D): silo (1374.0) beat Claude Fable 5 by 6.0
LLM Stats (CharXiv-D): Seed 2.1 Pro (95.5) beat Qwen 3 VL 32B Instruct by 5.0
LLM Stats (MathVista): Seed 2.1 Pro (90.7) beat O3 by 3.9
Tau3-Bench Airline: GLM-5.2 (87.5) beat Claude Opus 4.5 by 3.5
LLM Stats (BabyVision): Seed 2.1 Pro (73.7) beat Qwen 3.7 Plus by 3.3
Coding Agent Leaderboard: Opus 4.8 + OpenCode (80.8) beat Opus 4.8 + Claude Code by 2.5
LLM Stats (ERQA): Seed 2.1 Pro (72.0) beat Qwen 3.7 Plus by 2.2
Chatbot Arena (Vision): Claude Fable 5 (1311.0) beat Claude Opus 4.7 (Thinking) by 2.0
LLM Stats (LVBench): Seed 2.1 Pro (78.0) beat Qwen 3.7 Plus by 1.8
SpreadsheetBench: Data Analysis Agent (96.5) beat Qingqiu Agent by 1.75
Tau3-Bench Telecom: GLM-5.2 (99.3) beat Qwen 3.5 397B A17B by 1.5
Tau3-Bench Retail: GLM-5.2 (85.7) beat Qwen 3.5 397B A17B by 1.3
LLM Stats (MathVision): Seed 2.1 Pro (94.5) beat Kimi K2.6 by 1.3
LLM Stats (Video-MME): Seed 2.1 Pro (89.2) beat Qwen 3.7 Plus by 1.2
ForecastBench: Gemini (68.4) beat Grok 4.20 by 0.5
LLM Stats (MCP Atlas): Seed 2.1 Pro (83.8) beat Gemini 3.5 Flash by 0.2
AA Omniscience - Health: Grok Build 0.1 0616 (48.9) beat GPT-5.5 (Medium) by 0.1
CADGenBench: GPT-5.5 (xHigh) (0.4573) beat Claude Fable 5 by 0.01

AI Benchmark Digest — 2026-06-27

2026-06-27T07:12:07.142902+00:00

Daily

New Benchmarks (13)

MirrorCode (Solve@100 Rate (%)): leader Claude Opus 4.7 (56.0), 3 models
Epoch AI coding benchmark where models reimplement whole open-source programs from observed behavior rather than copying the original source. Reports per-target mean solve rates across program-reconstruction tasks.
Epoch AI - Scicode (Score): leader claude-fable-5_max (60.19), 37 models
OSWorld 2.0 (Binary Accuracy (%)): leader Claude Opus 4.8 (20.6), 7 models
Long-horizon computer-use benchmark with 108 end-to-end desktop workflows. Tests agents on realistic GUI, file, browser, and application tasks with binary completion as the primary metric.
OSWorld 2.0 Partial (Partial Score (%)): leader Claude Opus 4.8 (54.8), 7 models
Partial-score companion metric for OSWorld 2.0, measuring graded progress on the same long-horizon desktop workflows in addition to binary task completion.
Medical Chronology LLM Benchmark (Composite Score (%)): leader claude-opus-4.6 (92.28), 11 models
Medical chronology extraction benchmark for building structured timelines from synthetic medical-legal records.
LiveSecBench (Overall Score (%)): leader Claude-Haiku-4.5 (91.43), 43 models
Dynamic live safety benchmark for large language models across ethics, legality, privacy, factuality, and psychological health.
RealDataAgentBench (Average DAB Score (%)): leader claude-opus-4-8 (88.89), 14 models
Data-science agent benchmark evaluating whether LLM agents solve real-data analysis tasks correctly and robustly across correctness, code quality, efficiency, and statistical validity.
SecCodeBench (Total Score): leader Claude Opus 4.5 (68.1), 36 models
Security benchmark for AI-generated and AI-repaired code, reporting secure-code repair and generation scores with and without hints.
Gert Labs Rankings (GScore (%)): leader claude-fable-5 (73.87), 76 models
Gert Labs global model ranking across game environments that evaluate agentic coding, one-shot coding, and decision-making performance.
RP-Bench (Combined Score): leader claude opus 4 6 (82.0), 8 models
Roleplay benchmark evaluating character consistency, user agency, lorebook use, temporal reasoning, and interactive writing quality.
RP-Bench - Flaw Hunter (Flaw Hunter Score): leader claude opus 4 6 (72.1), 8 models
RP-Bench - Objective (Objective Score): leader gpt 4 1 (90.0), 8 models
RP-Bench - Elo (Elo): leader claude opus 4 6 (1705.7), 8 models

AI Benchmark Digest — 2026-06-26

2026-06-26T07:15:54.445119+00:00

Daily

New Benchmarks (8)

SEAL - SWE Atlas - Refactoring (Score): leader Fable-5 (Claude Code) xHigh (54.76), 14 models
Scale SEAL SWE Atlas refactoring benchmark measuring software-engineering agents on codebase refactoring tasks.
DRACO (Score (%)): leader Claude Mythos 5 (86.4), 13 models
Perplexity DRACO deep-research agent evaluation reported in Anthropic's Claude Opus 4.8 system card.
HealthAdminBench (Success Rate (%)): leader Claude Mythos 5 (browser-use) (51.9), 11 models
Healthcare administration agent benchmark for prior authorization, appeals, durable medical equipment, payer portals, fax, and EHR-adjacent workflows.
Vals AI AIME (Accuracy (%)): leader gemini-3.1-pro-preview (98.12), 96 models
Vals AI AIME benchmark measuring competition-math problem solving on American Invitational Mathematics Examination-style tasks.
LLM Stats (APEX-Agents) (Score (%)): leader Seed 2.1 Pro (33.8), 5 models
LLM Stats (MathArena Apex) (Score (%)): leader DeepSeek-V4-Pro-Max (90.2), 6 models
LLM Stats (OfficeQA Pro) (Score (%)): leader Seed 2.1 Pro (72.2), 5 models
LLM Stats (Terminal-Bench 2.1) (Score (%)): leader Claude Fable 5 (84.3), 6 models

New #1 Leaders (18)

LLM Stats (ZEROBench): Seed 2.1 Turbo (57.2) beat Muse Spark by 24.2
SWE-bench Live: AMI Agent + Claude-4.6-Oups (63.0) beat SWE-agent + Claude-4.5-Sonnet by 23.0
LLM Stats (BLINK): Seed 2.1 Pro (81.4) beat Qwen 3 VL 235B A22B Instruct by 10.7
Coding Agent Leaderboard - swe-bench-pro--ansible: Opus 4.8 + OpenCode (78.1) beat Opus 4.8 + Claude Code by 8.3
LLM Stats (OSWorld): Seed 2.1 Pro (78.8) beat Claude Opus 4.6 by 6.1
APEX v1 Medicine (MD): Opus 4.6 (Max) (71.6) beat GPT-5 by 5.6
LLM Stats (CharXiv-D): Seed 2.1 Pro (95.5) beat Qwen 3 VL 32B Instruct by 5.0
LLM Stats (MathVista): Seed 2.1 Pro (90.7) beat O3 by 3.9
LLM Stats (BabyVision): Seed 2.1 Pro (73.7) beat Qwen 3.7 Plus by 3.3
APEX v1 Consulting: GPT-5.2 Codex (High) (66.9) beat Gemini 3 Flash by 2.9
Coding Agent Leaderboard: Opus 4.8 + OpenCode (80.8) beat Opus 4.8 + Claude Code by 2.5
LLM Stats (ERQA): Seed 2.1 Pro (72.0) beat Qwen 3.7 Plus by 2.2
LLM Stats (LVBench): Seed 2.1 Pro (78.0) beat Qwen 3.7 Plus by 1.8
LLM Stats (MathVision): Seed 2.1 Pro (94.5) beat Kimi K2.6 by 1.3
LLM Stats (Video-MME): Seed 2.1 Pro (89.2) beat Qwen 3.7 Plus by 1.2
APEX v1 Investment Banking: GPT-5.3 Codex (High) (65.0) beat GPT-5.2 Pro by 1.0
APEX v1: GPT-5.4 (High) (67.2) beat GPT-5 by 0.2
LLM Stats (MCP Atlas): Seed 2.1 Pro (83.8) beat Gemini 3.5 Flash by 0.2

AI Benchmark Digest — 2026-06-25

2026-06-25T07:23:59.136022+00:00

Daily

New Benchmarks (87)

ParallelKernelBench (Fast1@3 (% of problems)): leader GPT-5.5 (31.03), 6 models
ParallelKernelBench evaluates multi-GPU CUDA kernel generation, asking models to replace PyTorch plus NCCL references with direct NVLink communication kernels.
ParallelKernelBench Pass@3 (Pass@3 (% of problems)): leader GPT-5.5 (41.38), 6 models
ParallelKernelBench pass@3 measures best-of-three correctness on multi-GPU CUDA kernels that replace PyTorch plus NCCL references.
ParallelKernelBench Fast1@1 (Fast1@1 (% of problems)): leader GPT-5.5 (25.29), 6 models
ParallelKernelBench fast1@1 measures single-shot model outputs that are both correct and faster than the PyTorch plus NCCL baseline.
ParallelKernelBench Pass@1 (Pass@1 (% of problems)): leader GPT-5.5 (32.18), 6 models
ParallelKernelBench pass@1 measures single-shot correctness on production-style multi-GPU kernel generation tasks.
Surface Evolver Bench (Mean Score (%)): leader gpt-5.5 (high) (89.29), 14 models
Surface Evolver Bench evaluates agentic scientific simulation writing, with models creating liquid-surface physics datafiles and using tool feedback before hidden checks.
Surface Evolver Bench Pass Rate (Pass Rate (%)): leader gpt-5.5 (high) (78.57), 14 models
Surface Evolver Bench pass rate measures fully passing agentic submissions for custom liquid-surface physics simulations.
LLM2014 Code 2025-11 (Multi-round Score): leader Gemini 3 Pro (96.0), 26 models
LLM2014 Code 2025-11 - Python (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-11 - TypeScript (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-11 - Golang (Score): leader GPT-5 Mini (high) (9.2), 26 models
LLM2014 Code 2025-11 - C# (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-11 - Java (Score): leader Gemini 3 Pro (9.67), 26 models
LLM2014 Code 2025-11 - C++ (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-12 (Multi-round Score): leader Gemini 3 Pro (96.0), 26 models
LLM2014 Code 2025-12 - Python (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-12 - TypeScript (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-12 - Golang (Score): leader GPT-5 Mini (high) (9.2), 26 models
LLM2014 Code 2025-12 - C# (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2025-12 - Java (Score): leader Gemini 3 Pro (9.67), 26 models
LLM2014 Code 2025-12 - C++ (Score): leader Gemini 3 Pro (10.0), 26 models
LLM2014 Code 2026-01 (Multi-round Score): leader Gemini 3 Pro (96.0), 23 models
LLM2014 Code 2026-01 - Python (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-01 - TypeScript (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-01 - Golang (Score): leader GPT-5 Mini (high) (9.2), 23 models
LLM2014 Code 2026-01 - C# (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-01 - Java (Score): leader Gemini 3 Pro (9.67), 23 models
LLM2014 Code 2026-01 - C++ (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-02 (Multi-round Score): leader Gemini 3 Pro (96.0), 23 models
LLM2014 Code 2026-02 - Python (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-02 - TypeScript (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-02 - Golang (Score): leader GPT-5 Mini (high) (9.2), 23 models
LLM2014 Code 2026-02 - C# (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Code 2026-02 - Java (Score): leader Gemini 3 Pro (9.67), 23 models
LLM2014 Code 2026-02 - C++ (Score): leader Gemini 3 Pro (10.0), 23 models
LLM2014 Vision 2025-11 (Median Score): leader Gemini 3 Pro (70.47), 20 models
LLM2014 Logic 2024-05 (Score (%)): leader GPT-4 Turbo 0409 (77.05), 22 models
LLM2014 Logic 2024-06 (Score (%)): leader GPT-4 Turbo 0409 (76.53), 30 models
LLM2014 Logic 2024-07 (Score (%)): leader GPT-4 Turbo 0409 (76.65), 27 models
LLM2014 Logic 2024-08 (Score (%)): leader GPT-4 Turbo 0409 (74.86), 25 models
LLM2014 Logic 2024-09 (Score (%)): leader O1 Preview (87.52), 28 models
LLM2014 Logic 2024-10 (Score (%)): leader O1 Preview (86.55), 28 models
LLM2014 Logic 2024-11 (Score (%)): leader O1 Preview (86.55), 29 models
LLM2014 Logic 2025-11 (Median Score): leader GPT-5 (high) (83.75), 53 models
LLM2014 Logic 2025-12 (Median Score): leader GPT-5.2 (high) (81.83), 51 models
LLM2014 Logic 2026-01 (Median Score): leader GPT-5.2 (high) (80.71), 45 models
LLM2014 Logic 2026-02 (Median Score): leader Claude Opus 4.6 (Thinking) (78.02), 46 models
LLM2014 Logic 2026-03 (Median Score): leader GPT-5.4 (high) (78.85), 42 models
LLM2014 Logic 2026-04 (Median Score): leader GPT-5.5 (xhigh) (83.96), 42 models
LLM2014 Logic 2026-05 (Median Score): leader GPT-5.5 (xhigh) (80.47), 43 models
LLM2014 Logic 2026-06 (Median Score): leader GPT-5.5 (xhigh) (80.47), 42 models
HAL GAIA (Accuracy (%)): leader Claude Sonnet 4.5 (September 2025) (74.55), 32 models
Princeton HAL cost-aware agent leaderboard for GAIA multi-step web assistance tasks, reporting overall accuracy.
HAL GAIA Level 1 (Accuracy (%)): leader Claude Sonnet 4.5 (September 2025) (82.07), 32 models
Princeton HAL GAIA level-1 slice, covering the easiest GAIA web assistance tasks.
HAL GAIA Level 2 (Accuracy (%)): leader Claude Sonnet 4.5 High (September 2025) (74.42), 32 models
Princeton HAL GAIA level-2 slice, covering intermediate GAIA web assistance tasks.
HAL GAIA Level 3 (Accuracy (%)): leader Claude Sonnet 4.5 (September 2025) (65.39), 32 models
Princeton HAL GAIA level-3 slice, covering the hardest GAIA web assistance tasks.
HAL SciCode (Accuracy (%)): leader o4-mini Low (April 2025) (9.23), 33 models
Princeton HAL cost-aware agent leaderboard for SciCode scientific programming tasks.
Wordle Arena (Win Rate (%)): leader Gemini 2.5 Pro (100.0), 49 models
Wordle Arena evaluates models on daily Wordle games, measuring lexical deduction and constraint tracking from public game logs.
Fibble Arena (Win Rate (%)): leader Gemini 2.5 Pro (80.0), 47 models
Fibble Arena evaluates Wordle-style play when each clue can contain one lie, testing robust lexical reasoning under corrupted feedback.
Fibble2 Arena (Win Rate (%)): leader Gemini 3.1 Pro (50.0), 46 models
Fibble2 Arena evaluates Wordle-style play with two lies per clue, increasing the need to reason through inconsistent feedback.
Fibble3 Arena (Win Rate (%)): leader DeepSeek-R1 (33.33), 43 models
Fibble3 Arena evaluates Wordle-style play with three lies per clue, testing resilient hypothesis search under noisy constraints.
Fibble4 Arena (Win Rate (%)): leader Gemini 3.1 Pro (60.0), 43 models
Fibble4 Arena evaluates Wordle-style play with four lies per clue, stressing deduction from heavily corrupted feedback.
Fibble5 Arena (Win Rate (%)): leader Gemini 3.1 Pro (58.33), 46 models
Fibble5 Arena evaluates Wordle-style play with every clue position potentially deceptive, testing adversarial constraint reasoning.
APEX v1 (Score (%)): leader GPT 5 (67.0), 7 models
Mercor APEX v1 evaluates professional task performance across expert-domain work samples.
APEX v1 Consulting (Score (%)): leader Gemini 3 Flash (64.0), 3 models
Mercor APEX v1 consulting slice evaluates model performance on consulting-style professional reasoning tasks.
APEX v1 Investment Banking (Score (%)): leader GPT 5.2 Pro (64.0), 3 models
Mercor APEX v1 investment-banking slice evaluates finance-focused professional work tasks.
APEX v1 Medicine (MD) (Score (%)): leader GPT 5 (66.0), 3 models
Mercor APEX v1 medicine slice evaluates primary-care physician style professional tasks.
BountyBench DetectWorkflow (Success Rate (%)): leader claude-opus-4-6 (13.04), 1 models
BountyBench DetectWorkflow evaluates cybersecurity agents on identifying exploitable bounty-style workflows.
CocoaBench (Accuracy): leader CodeX (45.1), 10 models
CocoaBench evaluates autonomous agents on computer-control tasks, measuring successful completion across released aggregate runs.
GSM-MC (Accuracy (%)): leader DeepSeek-V4-Flash-FP8 (99.47), 68 models
GSM-MC evaluates grade-school math reasoning in a multiple-choice format.
HAL SWE-bench Verified Mini (Score (%)): leader Claude Sonnet 4.5 High (September 2025) (72.0), 18 models
HAL SWE-bench Verified Mini evaluates software issue resolution on a compact SWE-bench Verified subset.
Journalistic Bias Accuracy (Accuracy (%)): leader GPT-4o (44.44), 7 models
Journalistic Bias accuracy evaluates classification of media bias labels in news-style examples.
Journalistic Bias F1-macro (F1-Macro (%)): leader GPT-4o (50.24), 7 models
Journalistic Bias F1-macro evaluates balanced classification quality across media bias categories.
JudgeBench Coding (Accuracy (%)): leader DeepSeek-R1-0528 (97.62), 52 models
JudgeBench coding slice evaluates judge-model accuracy on code-answer comparisons.
JudgeBench Knowledge (Accuracy (%)): leader gemini-3.1-pro-preview (91.88), 52 models
JudgeBench knowledge slice evaluates judge-model accuracy on MMLU-Pro-derived knowledge comparisons.
JudgeBench Math (Accuracy (%)): leader qwen3.6-plus (96.43), 52 models
JudgeBench math slice evaluates judge-model accuracy on mathematical answer comparisons.
JudgeBench Reasoning (Accuracy (%)): leader DeepSeek-V3.2-Speciale (96.94), 52 models
JudgeBench reasoning slice evaluates judge-model accuracy on reasoning comparisons from LiveBench-style tasks.
MATH-MC Level 1 (Accuracy (%)): leader Kimi-K2.5 (99.3), 69 models
MATH-MC Level 1 evaluates multiple-choice mathematical reasoning on the easiest MATH difficulty tier.
MATH-MC Level 2 (Accuracy (%)): leader claude-opus-4.6 (99.66), 69 models
MATH-MC Level 2 evaluates multiple-choice mathematical reasoning on low-intermediate MATH problems.
MATH-MC Level 3 (Accuracy (%)): leader Kimi-K2.5 (99.73), 69 models
MATH-MC Level 3 evaluates multiple-choice mathematical reasoning on intermediate MATH problems.
MATH-MC Level 4 (Accuracy (%)): leader gemini-3.1-pro-preview (99.58), 69 models
MATH-MC Level 4 evaluates multiple-choice mathematical reasoning on advanced MATH problems.
MATH-MC Level 5 (Accuracy (%)): leader Qwen3.5-122B-A10B (99.92), 69 models
MATH-MC Level 5 evaluates multiple-choice mathematical reasoning on the hardest MATH difficulty tier.
RewardBench 2 Factuality (Accuracy (%)): leader gpt-5.5 (88.21), 52 models
RewardBench 2 factuality slice evaluates preference-model accuracy on factual response comparisons.
RewardBench 2 Focus (Accuracy (%)): leader DeepSeek-V4-Flash-FP8 (93.64), 52 models
RewardBench 2 focus slice evaluates preference-model accuracy on responses that must stay on task.
RewardBench 2 Math (Accuracy (%)): leader Qwen3.5-397B-A17B (91.8), 52 models
RewardBench 2 math slice evaluates preference-model accuracy on mathematical response comparisons.
RewardBench 2 Precise IF (Accuracy (%)): leader gemini-3.1-pro-preview (75.78), 52 models
RewardBench 2 precise-instruction-following slice evaluates preference accuracy on tightly constrained instructions.
RewardBench 2 Safety (Accuracy (%)): leader Qwen3-VL-235B-A22B-Thinking-FP8 (96.44), 52 models
RewardBench 2 safety slice evaluates preference-model accuracy on safety-sensitive response comparisons.
ALL Bench LLM (Average Numeric Benchmark Score (%)): leader DeepSeek R2 (85.76), 39 models
Composite LLM leaderboard aggregating cross-verified scores across reasoning, knowledge, coding, and instruction-following evaluations.
ALL Bench Multimodal (Average Numeric VLM Score (%)): leader GPT-5.2 (86.7), 16 models
Composite multimodal leaderboard aggregating model results across VLM, image generation, video generation, and agent-style multimodal evaluations.

AI Benchmark Digest — 2026-06-24

2026-06-24T07:11:03.487710+00:00

Daily

New Benchmarks (7)

You're Absolutely Right! (Average anti-sycophancy score (1-5)): leader Claude Opus 4.8 (4.5), 16 models
Typebulb sycophancy benchmark with eight single- and multi-turn pressure prompts, scored 1-5 by a Gemini judge where higher scores mean the model resisted user-pleasing agreement and stayed truth-following.
OpenAI ChatGPT Pro - AIME 2024 (pass@1 accuracy (%)): leader O1 Pro (86.0), 3 models
OpenAI ChatGPT Pro launch result for AIME 2024 pass@1 accuracy, comparing o1-preview, o1, and o1 pro mode.
OpenAI ChatGPT Pro - Codeforces (pass@1 percentile): leader O1 Pro (90.0), 3 models
OpenAI ChatGPT Pro launch result for Codeforces pass@1 percentile, comparing competitive-programming performance across o1 variants.
OpenAI ChatGPT Pro - GPQA Diamond (pass@1 accuracy (%)): leader O1 Pro (79.0), 3 models
OpenAI ChatGPT Pro launch result for GPQA Diamond pass@1 accuracy, testing graduate-level science reasoning across o1 variants.
OpenAI ChatGPT Pro - AIME 2024 4/4 Reliability (4/4 reliability (%)): leader O1 Pro (80.0), 3 models
OpenAI ChatGPT Pro launch result for AIME 2024 4/4 reliability, where a question counts only if all four model attempts are correct.
OpenAI ChatGPT Pro - Codeforces 4/4 Reliability (4/4 reliability percentile): leader O1 Pro (75.0), 3 models
OpenAI ChatGPT Pro launch result for Codeforces 4/4 reliability percentile, using the worst-performing solution across four model samples.
OpenAI ChatGPT Pro - GPQA Diamond 4/4 Reliability (4/4 reliability (%)): leader O1 Pro (74.0), 3 models
OpenAI ChatGPT Pro launch result for GPQA Diamond 4/4 reliability, where an item counts only if all four attempts are correct.

Top-10 New Scores (1)

GPT-5.5 on CADGenBench: 0.4111 (#3)

New #1 Leaders (1)

AA Omniscience - Health: Grok Build 0.1 0616 (48.9) beat GPT-5.5 (Medium) by 0.1

AI Benchmark Digest — 2026-06-23

2026-06-23T07:11:28.475576+00:00

Daily

New Benchmarks (23)

FutureSearch BTF-3 (Pooled Brier/RPS Score): leader FutureSearch SOTA (0.116), 7 models
FutureSearch Bench to the Future 3 pooled pastcasting score across binary and numeric forecasting questions.
FutureSearch BTF-3 Binary (Binary Brier Score): leader FutureSearch SOTA (0.114), 7 models
FutureSearch Bench to the Future 3 binary-question slice scored by Brier score.
FutureSearch BTF-3 Numeric (Numeric RPS): leader GPT-5.5 (0.122), 7 models
FutureSearch Bench to the Future 3 numeric-question slice scored by ranked probability score.
FutureSearch DRB (Average Score): leader Claude Opus 4.6 (0.553), 22 models
FutureSearch Deep Research Bench aggregate score for open-web research agents across task categories.
FutureSearch DRB - Find Number (Average Score): leader Claude Opus 4.6 (0.738), 22 models
FutureSearch Deep Research Bench task slice for finding specific numeric answers from web evidence.
FutureSearch DRB - Derive Number (Average Score): leader Claude Sonnet 4.6 (0.477), 22 models
FutureSearch Deep Research Bench task slice for deriving numeric answers from gathered evidence.
FutureSearch DRB - Find Dataset (Average Score): leader Claude Opus 4.1 (0.708), 22 models
FutureSearch Deep Research Bench task slice for locating relevant datasets on the web.
FutureSearch DRB - Compile Dataset (Average Score): leader Claude Sonnet 4.6 (0.754), 22 models
FutureSearch Deep Research Bench task slice for compiling structured datasets from web sources.
FutureSearch DRB - Populate Reference Class (Average Score): leader Claude Opus 4.6 (0.398), 22 models
FutureSearch Deep Research Bench task slice for building reference classes from researched examples.
FutureSearch DRB - Gather Evidence (Average Score): leader Claude Opus 4 (0.395), 22 models
FutureSearch Deep Research Bench task slice for gathering supporting evidence from web sources.
FutureSearch DRB - Validate Claim (Average Score): leader Claude Sonnet 4.6 (0.799), 22 models
FutureSearch Deep Research Bench task slice for validating claims against web evidence.
FutureSearch DRB - Find Original Source (Average Score): leader Claude Opus 4.5 (0.591), 22 models
FutureSearch Deep Research Bench task slice for tracing facts back to original sources.
FutureSearch BTF-2 (Brier Score): leader FutureSearch Agent (0.119), 5 models
FutureSearch Bench to the Future 2 pastcasting benchmark scored by Brier accuracy on resolved forecasting questions.
FutureSearch BTF-2 Calibration (Calibration Error): leader FutureSearch Agent (0.002), 5 models
FutureSearch Bench to the Future 2 calibration-error slice for forecasting agents.
FutureSearch BTF-2 Refinement (Refinement): leader FutureSearch Agent (0.081), 5 models
FutureSearch Bench to the Future 2 refinement slice measuring forecast sharpness and discrimination.
Epoch AI - Rli (Score): leader claude-opus-4-6_unknown (4.17), 8 models
Epoch AI - Algotune (Score): leader gpt-5.2-2025-12-11_medium (2.05), 18 models
Epoch AI - Vending Bench 2 (Score): leader gpt-5.5_unknown (10626.96), 45 models
Tau3 Banking (Success Rate (%)): leader GPT-5.5 (xhigh) (31.34), 64 models
Artificial Analysis Tau3 banking-domain customer-service tasks, measuring agent success at policy-grounded banking support workflows.
AA-Omniscience Accuracy (Accuracy (%)): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (61.35), 414 models
Artificial Analysis Omniscience accuracy on factual-recall questions across law, health, business, software engineering, humanities, and STEM.
MathArena Arxiv False (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 9 models
MathArena arXiv false-premise mathematics slice testing whether models avoid solving invalid or inconsistent research-style problems.
MathArena Arxiv (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 9 models
MathArena arXiv mathematics competition slice with research-style final-answer problems from recent arXiv-derived tasks.
HiL-Bench (Combined Pass@3 (%)): leader GPT-5.5 (29.1), 12 models
Scale Human-in-Loop benchmark measuring when agents should ask for help, escalate uncertainty, or continue autonomously.

Top-10 New Scores (3)

Claude Fable 5 on DeepSWE: 69.9 (#1)
Claude Opus 4.8 on DeepSWE: 59.0 (#4)
GPT-5.5 on DeepSWE: 64.4 (#3)

New #1 Leaders (1)

WebDev Arena: Claude 5 (1653.93) beat Claude Opus 4.7 (Unknown) by 87.08

AI Benchmark Digest — 2026-06-21

2026-06-21T07:54:51.530368+00:00

Daily

New Benchmarks (19)

Physical AI Bench - Understanding Overall (Overall Score (%)): leader Cosmos-Reason2-32B (70.8), 25 models
Physical AI Bench understanding track evaluating multimodal reasoning about physical scenarios across robotics, autonomous driving, space, time, and physics.
JMMMU-Pro - Overall (Accuracy (%)): leader Gemini 3 Pro (87.045), 14 models
JMMMU-Pro overall multimodal reasoning accuracy on Japanese cultural and academic questions.
JMMMU-Pro - Culture Specific (Accuracy (%)): leader Gemini 3 Pro (95.0), 14 models
JMMMU-Pro culture-specific accuracy on Japanese multimodal questions requiring local cultural knowledge.
JMMMU-Pro - Culture Agnostic (Accuracy (%)): leader Gemini 3 Pro (80.417), 14 models
JMMMU-Pro culture-agnostic accuracy on Japanese multimodal academic questions.
JMMMU-Pro - Japanese Art (Accuracy (%)): leader Gemini 3 Pro (91.333), 14 models
JMMMU-Pro Japanese art category accuracy.
JMMMU-Pro - Japanese Heritage (Accuracy (%)): leader Gemini 3 Pro (96.667), 14 models
JMMMU-Pro Japanese heritage category accuracy.
JMMMU-Pro - Japanese History (Accuracy (%)): leader Gemini 3 Pro (95.333), 14 models
JMMMU-Pro Japanese history category accuracy.
JMMMU-Pro - World History (Accuracy (%)): leader Gemini 3 Pro (96.667), 14 models
JMMMU-Pro world history category accuracy.
MMLongBench-Doc - Accuracy (Accuracy (%)): leader Claude 4.5 Opus (61.9), 19 models
MMLongBench-Doc accuracy for multimodal long-document understanding over lengthy document images and text.
OmniGAIA - Overall (Overall Accuracy (%)): leader Orchestra-o1-GPT-5 (72.8), 18 models
OmniGAIA overall accuracy on multimodal general-assistant questions spanning geography, technology, history, finance, sports, art, movies, science, and food.
OmniGAIA - Geo (Geography Accuracy (%)): leader Orchestra-o1-GPT-5 (72.5), 16 models
OmniGAIA geography category accuracy.
OmniGAIA - Tech (Technology Accuracy (%)): leader Orchestra-o1-GPT-5 (69.4), 16 models
OmniGAIA technology category accuracy.
OmniGAIA - History (History Accuracy (%)): leader Orchestra-o1-GPT-5 (75.8), 16 models
OmniGAIA history category accuracy.
OmniGAIA - Finance (Finance Accuracy (%)): leader Gemini-3-Pro (72.0), 16 models
OmniGAIA finance category accuracy.
OmniGAIA - Sport (Sport Accuracy (%)): leader Orchestra-o1-GPT-5 (83.8), 16 models
OmniGAIA sports category accuracy.
OmniGAIA - Art (Art Accuracy (%)): leader Orchestra-o1-GPT-5 (63.9), 16 models
OmniGAIA art category accuracy.
OmniGAIA - Movie (Movie Accuracy (%)): leader Orchestra-o1-GPT-5 (69.7), 16 models
OmniGAIA movie category accuracy.
OmniGAIA - Science (Science Accuracy (%)): leader Orchestra-o1-GPT-5 (73.1), 16 models
OmniGAIA science category accuracy.
OmniGAIA - Food (Food Accuracy (%)): leader Gemini-3-Pro (88.9), 16 models
OmniGAIA food category accuracy.

Weekly

New Benchmarks (39)

OpenAI GPT-5 System Card - HealthBench (Score (%)): leader GPT-5 (Thinking) (67.2), 7 models
OpenAI GPT-5 system-card benchmark for HealthBench.
OpenAI GPT-5 System Card - HealthBench Hard (Score (%)): leader GPT-5 (Thinking) (46.2), 7 models
OpenAI GPT-5 system-card benchmark for HealthBench Hard.
OpenAI GPT-5 System Card - HealthBench Consensus (Score (%)): leader GPT-5 Mini (Thinking) (96.5), 7 models
OpenAI GPT-5 system-card benchmark for HealthBench Consensus.
OpenAI GPT-5 System Card - MMLU Language Arabic (Accuracy): leader O3 (High) (0.904), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Arabic.
OpenAI GPT-5 System Card - MMLU Language Bengali (Accuracy): leader GPT-5 (Thinking) (0.892), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Bengali.
OpenAI GPT-5 System Card - MMLU Language Chinese (Accuracy): leader GPT-5 (Thinking) (0.902), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Chinese.
OpenAI GPT-5 System Card - MMLU Language French (Accuracy): leader O3 (High) (0.906), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language French.
OpenAI GPT-5 System Card - MMLU Language German (Accuracy): leader O3 (High) (0.905), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language German.
OpenAI GPT-5 System Card - MMLU Language Hindi (Accuracy): leader GPT-5 (Thinking) (0.899), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Hindi.
OpenAI GPT-5 System Card - MMLU Language Indonesian (Accuracy): leader GPT-5 (Thinking) (0.909), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Indonesian.
OpenAI GPT-5 System Card - MMLU Language Italian (Accuracy): leader O3 (High) (0.912), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Italian.
OpenAI GPT-5 System Card - MMLU Language Japanese (Accuracy): leader GPT-5 (Thinking) (0.898), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Japanese.
OpenAI GPT-5 System Card - MMLU Language Korean (Accuracy): leader GPT-5 (Thinking) (0.896), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Korean.
OpenAI GPT-5 System Card - MMLU Language Portuguese (Accuracy): leader GPT-5 (Thinking) (0.91), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Portuguese.
OpenAI GPT-5 System Card - MMLU Language Spanish (Accuracy): leader O3 (High) (0.911), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Spanish.
OpenAI GPT-5 System Card - MMLU Language Swahili (Accuracy): leader GPT-5 (Thinking) (0.88), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Swahili.
OpenAI GPT-5 System Card - MMLU Language Yoruba (Accuracy): leader GPT-5 (Thinking) (0.806), 3 models
OpenAI GPT-5 system-card benchmark for MMLU Language Yoruba.
OpenAI GPT-5 System Card - BBQ Ambiguous (Accuracy): leader GPT-5 (Thinking) (0.93), 3 models
OpenAI GPT-5 system-card benchmark for BBQ Ambiguous.
OpenAI GPT-5 System Card - BBQ Disambiguated (Accuracy): leader GPT-5 (Thinking) (0.88), 3 models
OpenAI GPT-5 system-card benchmark for BBQ Disambiguated.
Physical AI Bench - Understanding Overall (Overall Score (%)): leader Cosmos-Reason2-32B (70.8), 25 models
Physical AI Bench understanding track evaluating multimodal reasoning about physical scenarios across robotics, autonomous driving, space, time, and physics.
JMMMU-Pro - Overall (Accuracy (%)): leader Gemini 3 Pro (87.045), 14 models
JMMMU-Pro overall multimodal reasoning accuracy on Japanese cultural and academic questions.
JMMMU-Pro - Culture Specific (Accuracy (%)): leader Gemini 3 Pro (95.0), 14 models
JMMMU-Pro culture-specific accuracy on Japanese multimodal questions requiring local cultural knowledge.
JMMMU-Pro - Culture Agnostic (Accuracy (%)): leader Gemini 3 Pro (80.417), 14 models
JMMMU-Pro culture-agnostic accuracy on Japanese multimodal academic questions.
JMMMU-Pro - Japanese Art (Accuracy (%)): leader Gemini 3 Pro (91.333), 14 models
JMMMU-Pro Japanese art category accuracy.
JMMMU-Pro - Japanese Heritage (Accuracy (%)): leader Gemini 3 Pro (96.667), 14 models
JMMMU-Pro Japanese heritage category accuracy.
JMMMU-Pro - Japanese History (Accuracy (%)): leader Gemini 3 Pro (95.333), 14 models
JMMMU-Pro Japanese history category accuracy.
JMMMU-Pro - World History (Accuracy (%)): leader Gemini 3 Pro (96.667), 14 models
JMMMU-Pro world history category accuracy.
MMLongBench-Doc - Accuracy (Accuracy (%)): leader Claude 4.5 Opus (61.9), 19 models
MMLongBench-Doc accuracy for multimodal long-document understanding over lengthy document images and text.
LLM Stats (TAU3-Bench) (Score (%)): leader MiMo-V2.5-Pro (72.9), 5 models
LLM Stats aggregate of Tau3-Bench agentic customer-service tasks across retail, telecom, airline, and banking-knowledge domains.
OmniGAIA - Overall (Overall Accuracy (%)): leader Orchestra-o1-GPT-5 (72.8), 18 models
OmniGAIA overall accuracy on multimodal general-assistant questions spanning geography, technology, history, finance, sports, art, movies, science, and food.
OmniGAIA - Geo (Geography Accuracy (%)): leader Orchestra-o1-GPT-5 (72.5), 16 models
OmniGAIA geography category accuracy.
OmniGAIA - Tech (Technology Accuracy (%)): leader Orchestra-o1-GPT-5 (69.4), 16 models
OmniGAIA technology category accuracy.
OmniGAIA - History (History Accuracy (%)): leader Orchestra-o1-GPT-5 (75.8), 16 models
OmniGAIA history category accuracy.
OmniGAIA - Finance (Finance Accuracy (%)): leader Gemini-3-Pro (72.0), 16 models
OmniGAIA finance category accuracy.
OmniGAIA - Sport (Sport Accuracy (%)): leader Orchestra-o1-GPT-5 (83.8), 16 models
OmniGAIA sports category accuracy.
OmniGAIA - Art (Art Accuracy (%)): leader Orchestra-o1-GPT-5 (63.9), 16 models
OmniGAIA art category accuracy.
OmniGAIA - Movie (Movie Accuracy (%)): leader Orchestra-o1-GPT-5 (69.7), 16 models
OmniGAIA movie category accuracy.
OmniGAIA - Science (Science Accuracy (%)): leader Orchestra-o1-GPT-5 (73.1), 16 models
OmniGAIA science category accuracy.
OmniGAIA - Food (Food Accuracy (%)): leader Gemini-3-Pro (88.9), 16 models
OmniGAIA food category accuracy.

New Models (35)

GPT-5.4 Pro (xHigh) — ELO 2980, #6
- FrontierMath - Tiers 1-3 (v2): 82.46 (#4/30)
- FrontierMath - Tier 4 (v2): 58.54 (#5/31)
Claude Fable 5 — ELO 2955, #7
- Chatbot Arena (Search): 1237.0 (#3/31)
- Epoch AI - ECI: 160.87 (#4/381)
Qwen 3.7 Max — ELO 2748, #8
- LLM Stats (GDPval-AA): 1308.0 (#12/33)
Claude Opus 4.8 — ELO 2678, #9
- Vals AI Vibe Code Bench: 82.72 (#2/66)
- Vals AI Terminal-Bench 2.1: 71.91 (#4/35)
- Chatbot Arena (Search): 1203.0 (#11/31)
GPT-5.5 — ELO 2502, #10
- LLM Stats (GDPval-AA): 1135.0 (#23/33)
GPT-5.4 — ELO 2370, #15
- LLM Stats (GDPval-AA): 1429.0 (#6/33)
Nemotron 3 Ultra — ELO 2352, #18
- LLM Stats (IMO-AnswerBench): 92.3 (#1/18)
- LLM Stats (LongBench v2): 61.9 (#3/16)
- LLM Stats (MMLU-ProX): 83.0 (#5/32)
- LLM Stats (Multi-Challenge): 63.8 (#6/29)
- LLM Stats (WMT24++): 83.7 (#6/23)
- LLM Stats (Finance Agent): 53.7 (#8/8)
- LLM Stats (GDPval-AA): 1183.0 (#18/33)
- ZeroEval GPQA Diamond: 87.0 (#34/226)
- LLM Stats (BrowseComp): 44.4 (#40/49)
Claude Opus 4.7 — ELO 2349, #19
- SEAL - SWE Atlas - Codebase QnA: 40.32 (#4/16)
- LLM Stats (GDPval-AA): 1542.0 (#4/33)
- SEAL - SWE Atlas - Test Writing: 38.52 (#7/17)
Gemini 3.5 Flash — ELO 2333, #22
- EQ-Bench Longform Writing: 71.8 (#17/116)
Qwen 3.7 Plus — ELO 2325, #23
- LLM Stats (DeepPlanning): 62.3 (#1/9)
- LLM Stats (ERQA): 69.8 (#1/20)
- LLM Stats (LVBench): 76.2 (#1/21)
- LLM Stats (MLVU): 87.4 (#1/10)
- LLM Stats (MRCR v2): 91.7 (#1/8)
- LLM Stats (RealWorldQA): 86.9 (#1/23)
- LLM Stats (SimpleVQA): 81.7 (#1/11)
- LLM Stats (Video-MME): 88.0 (#1/15)
- LLM Stats (MathVision): 90.3 (#2/29)
- LLM Stats (MAXIFE): 88.8 (#2/11)
GLM-5.2 — ELO 2280, #24
- LLM Stats (AIME 2026): 99.2 (#1/17)
- LLM Stats (NL2Repo): 48.9 (#1/9)
- NYT Connections Older Models: 92.7 (#1/108)
- Vending-Bench 2: 8313.78 (#2/49)
- LLM Stats (IMO-AnswerBench): 91.0 (#2/18)
- LiveBench Python: 90.0 (#3/126)
- FrontierSWE: 74.0 (#3/14)
- LiveBench TypeScript: 65.0 (#4/125)
- LLM Stats (MCP Atlas): 76.8 (#4/25)
- RuneBench: 3230.0 (#4/25)
Qwen Max — ELO 2249, #27
- Epoch AI - ECI: 154.12 (#48/381)
DeepSeek V4 Pro — ELO 2226, #32
- SEAL - SWE Atlas - Codebase QnA: 27.15 (#10/16)
- SEAL - SWE Atlas - Test Writing: 27.05 (#15/17)
Kimi K2.6 — ELO 2179, #40
- Vals AI Multimodal Index: 56.43 (#8/21)
- Vals AI CorpFin v2: 66.74 (#9/116)
- Vals AI LiveCodeBench: 86.77 (#9/122)
- Vals AI Finance Agent v2: 44.9 (#10/28)
- Vals AI (Vals Index): 55.17 (#11/30)
- Vals AI SAGE: 50.22 (#11/61)
- Vals AI MMMU: 86.3 (#11/76)
- Vals AI Finance Agent: 57.06 (#12/51)
- Vals AI Terminal-Bench 2.0: 57.3 (#13/68)
- Vals AI MMLU-Pro: 87.57 (#13/115)
MiniMax-M3 — ELO 2169, #42
- LLM Stats (GDPval-AA): 1431.0 (#5/33)
- Vellum - GPQA: 93.0 (#7/58)
- Vellum - HumanEval: 80.5 (#8/39)
- Vending-Bench 2: 2157.77 (#31/49)
- NYT Connections Extended: 74.2 (#31/85)
kimi-k2.7-code — ELO 2152, #46
- RuneBench: 3099.0 (#6/25)
- Lynchmark: 75.0 (#10/15)
- Agent Arena - Steerability: 7.31 (#12/28)
- Vending-Bench 2: 5082.94 (#15/49)
- AA GDPval: 1198.9 (#18/52)
- GDPval-AA: 1199.0 (#18/52)
- Chatbot Arena (Code): 1478.0 (#19/89)
- AA CritPt: 10.0 (#20/414)
- Agent Arena - Confirmed Success: 3.22 (#21/28)
- AA Omniscience - Science, Engineering & Mathematics: 44.8 (#21/414)
Grok 4.3 — ELO 2101, #62
- LLM Stats (GDPval-AA): 1100.0 (#25/33)
GLM-5.1 — ELO 2097, #64
- Vals AI Finance Agent: 57.66 (#10/51)
- Vals AI Finance Agent v2: 44.79 (#11/28)
- Vals AI SWE-bench Verified: 76.4 (#12/46)
- Vals AI (Vals Index): 52.45 (#13/30)
- LLM Stats (GDPval-AA): 1281.0 (#14/33)
- Vals AI ProofBench: 22.22 (#15/43)
- Vals AI LegalBench: 84.39 (#17/119)
- Vals AI Terminal-Bench 2.0: 53.93 (#17/68)
- Vals AI Terminal-Bench 2.1: 56.93 (#17/35)
- Vals AI MMLU-Pro: 86.9 (#23/115)
Step 3.7 Flash — ELO 2044, #85
- AA APEX-Agents: 14.82 (#15/25)
Qwen 3.6 Plus — ELO 2020, #94
- LLM Stats (GDPval-AA): 1160.0 (#21/33)
Qwen 3.5 397B A17B — ELO 2020, #95
- LLM Stats (GDPval-AA): 961.0 (#29/33)
GPT-5.4 Mini — ELO 2005, #106
- LLM Stats (GDPval-AA): 1190.0 (#17/33)
Qwen 3.6 27B — ELO 1989, #110
- LLM Stats (GDPval-AA): 1158.0 (#22/33)
Qwen 3.5 122B A10B — ELO 1981, #115
- LLM Stats (GDPval-AA): 985.0 (#27/33)
Command A+ — ELO 1876, #171
- LLM Stats (MathVista): 80.6 (#4/36)
- LLM Stats (CharXiv-D): 88.0 (#5/14)
- LLM Stats (WMT24++): 81.0 (#7/23)
- LLM Stats (CharXiv-R): 52.7 (#35/40)
GPT-5.4 Nano — ELO 1860, #178
- LLM Stats (GDPval-AA): 1115.0 (#24/33)
MiniMax-M2.5 — ELO 1837, #199
- Vals AI SWE-bench Verified: 74.2 (#18/46)
- Vals AI Terminal-Bench 2.0: 41.57 (#30/68)
- Vals AI MedQA: 92.53 (#31/95)
- Vals AI IOI: 6.67 (#34/55)
- Vals AI ProofBench: 4.0 (#39/43)
- Vals AI GPQA: 82.07 (#39/116)
- Vals AI CaseLaw v2: 53.48 (#40/54)
- Vals AI Finance Agent: 38.58 (#41/51)
- Vals AI LiveCodeBench: 79.21 (#53/122)
- Vals AI CorpFin v2: 59.6 (#60/116)
Claude Haiku 4.5 — ELO 1831, #203
- LLM Stats (GDPval-AA): 902.0 (#32/33)
Mistral Medium 3.5 — ELO 1793, #228
- LLM Stats (GDPval-AA): 926.0 (#31/33)
Gemma 4 31B — ELO 1736, #266
- LLM Stats (GDPval-AA): 783.0 (#33/33)
Qwen 3.6 35B A3B — ELO 1723, #275
- LLM Stats (GDPval-AA): 1056.0 (#26/33)
nemotron-3-ultra-550B-a55B — ELO 1712, #285
- Design Arena (3D): 1203.0 (#51/120)
- Design Arena (Game Dev): 1200.0 (#67/132)
- Design Arena (UI Components): 1162.0 (#74/126)
- Design Arena (Data Viz): 1139.0 (#90/128)
Laguna XS.2 — ELO 1664, #341
- Chatbot Arena (Code): 1298.0 (#70/89)
Laguna M.1 — ELO 1651, #358
- Chatbot Arena (Code): 1347.0 (#60/89)
Gemma 4 12B — ELO 1400, #813
- Wolfram LLM Benchmarking Project: 22.8 (#380/489)

Top-10 New Scores (9)

Claude Fable 5 on Chatbot Arena (Search): 1237.0 (#3)
Claude Fable 5 on Epoch AI - ECI: 160.87 (#4)
Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 (#11)
Claude Opus 4.8 on Vals AI Terminal-Bench 2.1: 71.91 (#4)
Claude Opus 4.8 on Vals AI Vibe Code Bench: 82.72 (#2)
GPT-5.4 Pro on FrontierMath - Tier 4 (v2): 58.54 (#5)
GPT-5.4 Pro on FrontierMath - Tiers 1-3 (v2): 82.46 (#4)
GPT-5.5 on LLM Stats (GDPval-AA): 1135.0 (#23)
Qwen 3.7 Max on LLM Stats (GDPval-AA): 1308.0 (#12)

New #1 Leaders (24)

WDCD R3 Pressure Integrity: Qwen 3 Max (190.0) beat Claude Opus 4.7 by 90.0
LLM Stats (MRCR v2): Qwen 3.7 Plus (91.7) beat Gemma 4 31B by 25.3
LLM Stats (DeepPlanning): Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus by 20.8
Coding Agent Leaderboard - swe-bench-pro--ansible: Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code by 19.8
Coding Agent Leaderboard: Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code by 13.5
Design Arena (Website): silo (1357.0) beat Claude Fable 5 by 12.0
Coding Agent Leaderboard - swe-bench-verified: Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code by 7.2
WDCD R2 In-Document Resistance: Gemini 2.5 Pro (90.0) beat Grok 4 by 6.0
Agent Security League - Security Correctness: Claude Fable 5 (29.0) beat GPT-5.5 by 5.0
Terminal-Bench 2.1 (Claude Code): Claude 5 Fable (83.1) beat Claude Opus 4.8 by 4.2
LLM Stats (ERQA): Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus by 4.1
LLM Stats (SimpleVQA): Qwen 3.7 Plus (81.7) beat GLM-5V Turbo by 3.5
LLM Stats (AIME 2026): GLM-5.2 (99.2) beat Kimi K2.6 by 2.8
LLM Stats (IMO-AnswerBench): Nemotron 3 Ultra (92.3) beat Qwen 3.7 Max by 2.3
Terminal-Bench 2.1 (Terminus 2): Claude 5 Fable (80.4) beat GPT-5.5 by 2.2
Epoch AI - ECI: Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) by 1.97
LLM Stats (NL2Repo): GLM-5.2 (48.9) beat Qwen 3.7 Max by 1.7
LLM Stats (RealWorldQA): Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus by 1.5
Wolfram LLM Benchmarking Project: Claude Fable 5 thinking max (73.3) beat Claude Opus 4.7 (Thinking) by 0.8
LLM Stats (LVBench): Qwen 3.7 Plus (76.2) beat Kimi K2.5 by 0.3
LLM Stats (Video-MME): Qwen 3.7 Plus (88.0) beat MiMo-V2.5 by 0.3
NYT Connections Older Models: GLM-5.2 (92.7) beat Sherlock Think Alpha by 0.2
Agent Arena - Tool Hallucination: Grok 4.3 (High) (0.11) beat Grok 4.3 xAI · Proprietary by 0.15
LLM Stats (MLVU): Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B by 0.1

AI Benchmark Digest — 2026-06-20

2026-06-20T07:12:32.939650+00:00

Daily

New Benchmarks (18)

SQL Capability Leaderboard (Average Ability Score): leader SQLShift (83.4), 35 models
Monthly SQL capability leaderboard measuring SQL understanding, optimization, and dialect conversion ability scores for LLMs and SQL-focused applications.
SQL Capability - Dialect Conversion (Ability Score): leader SQLShift (83.4), 34 models
SQL Capability leaderboard slice measuring conversion of SQL queries across database dialects.
SQL Capability - SQL Optimization (Ability Score): leader SQLFlash (72.1), 34 models
SQL Capability leaderboard slice measuring query optimization, index advice, and semantic-preserving SQL improvements.
SQL Capability - SQL Understanding (Ability Score): leader Gemini 3 Pro (86.0), 33 models
SQL Capability leaderboard slice measuring SQL execution understanding, explanation, and syntax-error detection.
Software Engineering Arena - Model Arena (Elo Rating): leader MiMo-V2-Flash (1002.0), 23 models
Software Engineering Arena model-only preference leaderboard for software-engineering task performance.
boogiebench (Elo Rating): leader gemini-3-pro-preview (1683.0), 13 models
Elo-ranked LLM music-composition arena where text models write Strudel JavaScript compositions from music prompts and users vote on the generated tracks.
Lean AI Formalization Leaderboard (Solved Problems): leader Aristotle (Harmonic) (92.0), 29 models
Submission-based Lean formalization leaderboard for hard mathematical problems where accepted solutions must pass automated comparator verification.
WorldCupBench (Quiniela Points): leader MiMo-V2.5-Pro (20.0), 11 models
Live 2026 World Cup forecasting benchmark where frontier models submitted frozen tournament predictions and are scored by match-prediction points and Brier score as results arrive.
WorldCupBench - Brier Skill (100 - Brier Total): leader Gemini-3.5-Flash (92.0625), 11 models
WorldCupBench live tournament forecasting score transformed as 100 minus total Brier score, so higher is better.
SWE-bench Multilingual (Mythos Preview System Card) (Resolved (%)): leader Claude Mythos Preview (87.3), 2 models
Claude Mythos Preview system-card run of SWE-bench Multilingual, measuring multilingual software issue resolution.
SWE-bench Multimodal (Mythos Preview System Card) (Resolved (%)): leader Claude Mythos Preview (59.0), 2 models
Claude Mythos Preview system-card run of SWE-bench Multimodal, measuring software issue resolution when visual context is part of the task.
Terminal-Bench 2.0 (Mythos Preview System Card) (Mean Reward (%)): leader Claude Mythos Preview (82.0), 4 models
Claude Mythos Preview system-card run of Terminal-Bench 2.0, measuring terminal-based agent task completion with the Terminus harness.
USAMO 2026 (Mythos Preview System Card) (Score (%)): leader Claude Mythos Preview (97.6), 4 models
Claude Mythos Preview system-card evaluation on 2026 USAMO proof problems, scored with MathArena-style model judging.
LAB-Bench FigQA (Mythos Preview No Tools) (Accuracy (%)): leader Claude Mythos Preview (79.7), 3 models
Claude Mythos Preview system-card LAB-Bench FigQA result without tools, measuring biology-figure visual reasoning.
LAB-Bench FigQA (Mythos Preview Tools) (Accuracy (%)): leader Claude Mythos Preview (89.0), 3 models
Claude Mythos Preview system-card LAB-Bench FigQA result with Python/image tools, measuring biology-figure visual reasoning.
ScreenSpot-Pro (Mythos Preview No Tools) (Accuracy (%)): leader Claude Mythos Preview (79.5), 3 models
Claude Mythos Preview system-card ScreenSpot-Pro result without tools, measuring GUI grounding in high-resolution professional screenshots.
ScreenSpot-Pro (Mythos Preview Tools) (Accuracy (%)): leader Claude Mythos Preview (92.8), 3 models
Claude Mythos Preview system-card ScreenSpot-Pro result with Python/image tools, measuring GUI grounding in high-resolution professional screenshots.
Google Gemini 3 Deep Think - GPQA Diamond (Score (%)): leader Gemini 3 Deep Think (93.8), 2 models
Google Gemini 3 launch result for GPQA Diamond, reporting graduate-level science reasoning accuracy for Gemini 3 Deep Think and Gemini 3 Pro.

New #1 Leaders (1)

NYT Connections Older Models: GLM-5.2 (92.7) beat Sherlock Think Alpha by 0.2

AI Benchmark Digest — 2026-06-19

2026-06-19T07:25:58.875498+00:00

Daily

New Benchmarks (7)

Benchmarks.bio - TxBench-PP (Pass Rate (%)): leader Claude Opus 4.8 (59.33), 11 models
Benchmarks.bio TxBench-PP evaluates agentic perturbation/transcriptomics analysis workflows with deterministic grading over realistic biological data-analysis tasks.
JEE/NEET LLM Benchmark - JEE Advanced 2025 (Exam Score (%)): leader Gemini 2.5 Pro Preview 05-06 (89.72), 5 models
JEE/NEET LLM Benchmark split for 2025 JEE Advanced image questions, reporting exam-score percentage for multimodal models.
JEE/NEET LLM Benchmark - JEE Advanced 2026 (Exam Score (%)): leader Gemini 3.1 Pro Preview (94.72), 6 models
JEE/NEET LLM Benchmark split for 2026 JEE Advanced image questions, reporting exam-score percentage for multimodal models.
JEE/NEET LLM Benchmark - NEET 2026 (Exam Score (%)): leader Gemini 3 Flash Preview (99.31), 14 models
JEE/NEET LLM Benchmark split for 2026 NEET image questions, reporting exam-score percentage for multimodal models.
MyPCBench (Perfect Rate (%)): leader Claude Opus 4.6 (55.4), 6 models
Personal-computer task benchmark evaluating computer-use agents on realistic desktop workflows with screenshots, browser/filesystem actions, shell access, and rubric-based grading.
Opus Magnum Bench (Human-normalized score (%)): leader Claude Fable 5 (High) (60.15), 17 models
Puzzle-solving benchmark based on Opus Magnum campaign levels, scoring models by whether they synthesize valid alchemy-machine solutions and how close those solutions are to human-best efficiency.
OpenAI LifeSciBench (Exact Pass Rate (%)): leader GPT-Rosalind-5.5 (36.1), 2 models
OpenAI life-sciences benchmark of expert-authored research tasks spanning artifact-heavy scientific workflows, rubric-graded for exact pass rate across life-science domains.

Top-10 New Scores (4)

Claude Opus 4.8 on Vals AI Terminal-Bench 2.1: 71.91 (#4)
Claude Opus 4.8 on Vals AI Vibe Code Bench: 82.72 (#2)
GPT-5.5 on Vals AI Terminal-Bench 2.1: 76.4 (#2)
GPT-5.5 on Vals AI Vibe Code Bench: 69.85 (#5)

New #1 Leaders (1)

Wolfram LLM Benchmarking Project: Claude Fable 5 thinking max (73.3) beat Claude Opus 4.7 (Thinking) by 0.8

AI Benchmark Digest — 2026-06-18

2026-06-18T07:17:43.853003+00:00

Daily

New Benchmarks (9)

AISI Cyber Cooling Tower 10M (Avg Steps (/7)): leader Claude Opus 4.6 (0.1), 7 models
AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 10M token budget.
AISI Cyber Cooling Tower 100M (Avg Steps (/7)): leader Claude Opus 4.6 (1.4), 5 models
AISI cyber range: "Cooling Tower" — a 7-step industrial-control-network attack simulation. Reports average steps completed at a 100M token budget.
OpenAI CTF (Professional) (pass@12 (%)): leader GPT-5.5 (96.3), 3 models
OpenAI system-card subset of professional capture-the-flag tasks, reporting pass@12 over offensive-security rollouts with a Linux tool harness.
CVE-Bench (pass@1 (%)): leader GPT-5.5 (93.1), 4 models
Cybersecurity benchmark for autonomous web vulnerability exploitation across 40 critical CVEs in zero-day and one-day settings.
OpenAI Cyber Ranges (Combined Pass Rate (%)): leader GPT-5.5 (93.33), 4 models
OpenAI internal cyber-range suite measuring end-to-end cyber operations across realistic emulated networks.
ExploitGym (Successful Intended Exploits (#)): leader Claude Mythos Preview (157.0), 7 models
Real-world cybersecurity agent benchmark measuring whether AI agents can turn known software vulnerabilities into working, intended exploits across userspace, V8, and Linux kernel targets.
CyScenarioBench (Average Success Rate (%)): leader Claude Mythos 5 (36.7), 9 models
Irregular scenario-based offensive security benchmark measuring whether agents can plan and complete full multi-stage attack scenarios in realistic environments.
Lyptus Cyber Time Horizons - InterCode-CTF (pass@1 at 2M tokens (%)): leader Claude Opus 4.6 (100.0), 3 models
Lyptus Research offensive cyber time-horizon run of InterCode-CTF, measuring pass@1 on CTF tasks at a 2M token budget.
Lyptus Cyber Time Horizons - NL2Bash (pass@1 at 2M tokens (%)): leader GPT-5.3 Codex (100.0), 3 models
Lyptus Research offensive cyber time-horizon run of NL2Bash, measuring command-generation success at a 2M token budget.

Top-10 New Scores (2)

GPT-5.4 Pro on FrontierMath - Tier 4 (v2): 58.54 (#5)
GPT-5.4 Pro on FrontierMath - Tiers 1-3 (v2): 82.46 (#4)

New #1 Leaders (2)

Terminal-Bench 2.1 (Claude Code): Claude 5 Fable (83.1) beat Claude Opus 4.8 by 4.2
Terminal-Bench 2.1 (Terminus 2): Claude 5 Fable (80.4) beat GPT-5.5 by 2.2

AI Benchmark Digest — 2026-06-17

2026-06-17T07:26:00.903157+00:00

Daily

New Benchmarks (4)

LLM Stats (Finance Agent v2) (Score (%)): leader Gemini 3.5 Flash (57.86), 25 models
LLM Stats (FrontierSWE) (Score (%)): leader Claude Fable 5 (90.0), 13 models
LLM Stats (Legal Agent Benchmark) (Score (%)): leader Claude Fable 5 (13.3), 11 models
LLM Stats (SkillsBench) (Score (%)): leader Qwen3.7 Max (59.2), 5 models

Top-10 New Scores (12)

Claude Fable 5 on SWE-Marathon: 24.0 (#2)
GLM-5.2 on BenchLM: 94.0 (#3)
GLM-5.2 on LLM Stats (HMMT 2025): 94.4 (#9)
GLM-5.2 on LLM Stats (HMMT Feb 26): 92.5 (#6)
GLM-5.2 on LLM Stats (IMO-AnswerBench): 91.0 (#2)
GLM-5.2 on LLM Stats (MCP Atlas): 76.8 (#4)
GLM-5.2 on LLM Stats (Toolathlon): 48.2 (#8)
GLM-5.2 on PinchBench: 87.79 (#18)
GLM-5.2 on RuneBench: 3230.0 (#4)
GLM-5.2 on SWE-Marathon: 13.0 (#4)
GLM-5.2 on ZeroEval GPQA Diamond: 91.2 (#12)
Qwen 3.7 Max on LLM Stats (GDPval-AA): 1308.0 (#12)

New #1 Leaders (15)

LLM Stats (DeepPlanning): Qwen 3.7 Plus (62.3) beat Qwen 3.6 Plus by 20.8
Coding Agent Leaderboard - swe-bench-pro--ansible: Opus 4.8 + Claude Code (69.8) beat Sonnet 4.6 + Claude Code by 19.8
LLM Stats (MRCR v2): Qwen 3.7 Plus (91.7) beat U2 by 15.09
Coding Agent Leaderboard: Opus 4.8 + Claude Code (78.3) beat Sonnet 4.6 + Claude Code by 13.5
Design Arena (Website): silo (1357.0) beat Claude Fable 5 by 12.0
Coding Agent Leaderboard - swe-bench-verified: Opus 4.8 + Claude Code (86.8) beat Sonnet 4.6 + Claude Code by 7.2
LLM Stats (ERQA): Qwen 3.7 Plus (69.8) beat Qwen 3.6 Plus by 4.1
LLM Stats (SimpleVQA): Qwen 3.7 Plus (81.7) beat GLM-5V Turbo by 3.5
LLM Stats (AIME 2026): GLM-5.2 (99.2) beat Kimi K2.6 by 2.8
LLM Stats (IMO-AnswerBench): Nemotron 3 Ultra (550B A55B) (92.3) beat Qwen 3.7 Max by 2.3
LLM Stats (NL2Repo): GLM-5.2 (48.9) beat Qwen 3.7 Max by 1.7
LLM Stats (RealWorldQA): Qwen 3.7 Plus (86.9) beat Qwen 3.6 Plus by 1.5
LLM Stats (LVBench): Qwen 3.7 Plus (76.2) beat Kimi K2.5 by 0.3
LLM Stats (Video-MME): Qwen 3.7 Plus (88.0) beat MiMo-V2.5 by 0.3
LLM Stats (MLVU): Qwen 3.7 Plus (87.4) beat Qwen 3.5 122B A10B by 0.1

AI Benchmark Digest — 2026-06-16

2026-06-16T08:27:51.523101+00:00

Daily

New Benchmarks (7)

SWE-Marathon (Pass@1 (%)): leader Claude Opus 4.8 (26.0), 9 models
Long-horizon software engineering benchmark where coding agents work on realistic repository tasks under marathon-scale time budgets, reporting pass@1 for end-to-end completed tasks.
InferenceBench (Speedup Score): leader Claude Fable 5 (Low) (8.74), 22 models
Benchmark for coding agents optimizing inference workloads. Agents tune serving configurations and implementation choices across latency, throughput, and all-in-one scenarios.
AgenticVBench (Average Success (%)): leader Claude Fable 5 (32.4), 9 models
Agentic video benchmark where autonomous agents perform multi-step video repurposing, sequencing, repair, and assembly tasks, scored by average task success.
TERMS-Bench (Mean Utility): leader GLM 5.1 (11.7), 15 models
Negotiation benchmark for LLM agents bargaining over terms under changing utility, urgency, and no-deal regimes, reporting mean utility and agreement metrics.
Structured Output Benchmark (Overall (%)): leader GPT-5.4 (87.0), 28 models
Structured-output benchmark measuring schema-constrained generation with value accuracy, faithfulness, JSON validity, path recall, type safety, and perfect-output rates.
BenGER (Aggregate Accuracy (%)): leader Gemini 3.1 Pro (77.0), 12 models
German-law benchmark for subsumption-based legal reasoning, evaluating model answers across Benchathon, ZJS, and doctrinal-principles corpora.
BenchLM (Overall Score): leader Claude Mythos 5 (99.0), 123 models
Composite LLM leaderboard aggregating current model performance across agentic, coding, reasoning, grounded multimodal, knowledge, multilingual, instruction-following, and math categories.

Top-10 New Scores (3)

Claude Fable 5 on Chatbot Arena (Search): 1237.0 (#3)
Claude Fable 5 on Epoch AI - ECI: 160.87 (#3)
Claude Opus 4.8 on Chatbot Arena (Search): 1203.0 (#11)

New #1 Leaders (2)

LLM Stats (MRCR v2): U2 (76.61) beat Gemma 4 31B by 10.21
Epoch AI - ECI: Claude Fable 5 (Max) (160.87) beat GPT-5.5 Pro (xHigh) by 1.97

AI Benchmark Digest — 2026-06-15

2026-06-15T08:24:20.247016+00:00

Daily

New Benchmarks (145)

Open LLM Leaderboard - IFEval (Score): leader Llama-3.3-70B-Instruct (89.98), 4576 models
Open LLM Leaderboard - BBH (Score): leader Benchmaxx-Llama-3.2-1B-Instruct (76.7), 4576 models
Open LLM Leaderboard - MATH Level 5 (Score): leader AceMath-72B-Instruct (71.45), 4576 models
Open LLM Leaderboard - GPQA (Score): leader L3.3-MS-Nevoria-70b (29.42), 4576 models
Open LLM Leaderboard - MuSR (Score): leader T3Q-Qwen2.5-14B-Instruct-1M-e3 (38.69), 4576 models
Open LLM Leaderboard - MMLU-Pro (Score): leader calme-3.2-instruct-78b (70.03), 4576 models
AI for Education Pedagogy (Accuracy (%)): leader GPT-5.5 (92.1), 216 models
AI for Education Pedagogy - Maths (Accuracy (%)): leader Gemini-3.1 Pro (94.44), 216 models
AI for Education Pedagogy - Primary (Accuracy (%)): leader GPT-5.5 (96.71), 216 models
AI for Education Pedagogy - Science (Accuracy (%)): leader Qwen3.5 Plus (95.08), 216 models
AI for Education Pedagogy - Secondary (Accuracy (%)): leader GPT-5.5 (91.04), 216 models
AI for Education Pedagogy - Social studies (Accuracy (%)): leader o3 (91.82), 216 models
AI for Education Pedagogy - Technology (Accuracy (%)): leader Kimi K2.5 (89.62), 216 models
AI for Education SEND (Accuracy (%)): leader GPT-5.5 (88.07), 208 models
AI for Education Visual Maths (Accuracy (%)): leader GPT-5.5 (89.87), 61 models
AI for Education Visual Maths - Algebra (Accuracy (%)): leader Gemini-2.5 Pro (100.0), 61 models
AI for Education Visual Maths - Geometry (Accuracy (%)): leader GPT-5.5 (88.46), 61 models
AI for Education Visual Maths - Measurement (Accuracy (%)): leader GPT-5.5 (97.3), 61 models
AI for Education Visual Maths - Number and Operations (Accuracy (%)): leader GPT-5.5 (83.78), 61 models
AI for Education Visual Maths - Statistics and Probability (Accuracy (%)): leader GPT-5.5 (85.71), 61 models
AI for Education Visual Reasoning (Accuracy (%)): leader Gemini-3.5 Flash (86.0), 63 models
AI for Education Visual Reasoning - match (figure) (Accuracy (%)): leader Gemini-3.5 Flash (85.2), 63 models
AI for Education Visual Reasoning - match (process) (Accuracy (%)): leader Gemini-3 Flash (77.8), 63 models
AI for Education Visual Reasoning - odd one out (Accuracy (%)): leader Gemini-3.5 Flash (80.5), 63 models
AI for Education Visual Reasoning - pattern completion (2d) (Accuracy (%)): leader Gemini-3.1 Pro (86.3), 63 models
AI for Education Visual Reasoning - pattern completion (linear) (Accuracy (%)): leader Gemini-3.5 Flash (91.5), 63 models
AI for Education Visual Reasoning - reasoning by analogy (Accuracy (%)): leader Gemini-3.5 Flash (88.8), 63 models
SWE-bench Verified (Opus 4.6 System Card) (Resolved (%)): leader Claude Opus 4.5 (Thinking) (80.9), 5 models
Terminal-Bench 2.0 (Opus 4.6 System Card) (Pass Rate (%)): leader Claude Opus 4.6 (Thinking) (65.4), 5 models
Tau2 Bench Retail (Opus 4.6 System Card) (Score (%)): leader Claude Opus 4.6 (Thinking) (91.9), 5 models
Tau2 Bench Telecom (Opus 4.6 System Card) (Score (%)): leader Claude Opus 4.6 (Thinking) (99.3), 5 models
MCP-Atlas (Opus 4.6 System Card) (Score (%)): leader Claude Opus 4.5 (Thinking) (62.3), 5 models
ARC-AGI-2 Verified (Opus 4.6 System Card) (Score (%)): leader Claude Opus 4.6 (Thinking) (68.8), 5 models
GPQA Diamond (Opus 4.6 System Card) (Accuracy (%)): leader GPT-5.2 (93.2), 5 models
MMMU-Pro No Tools (Opus 4.6 System Card) (Score (%)): leader Gemini 3 Pro (81.0), 5 models
MMMLU (Opus 4.6 System Card) (Accuracy (%)): leader Gemini 3 Pro (91.8), 5 models
SWE-bench Verified (Fable/Mythos) (Resolved (%)): leader Claude Mythos 5 (95.5), 5 models
Terminal-Bench 2.1 (Fable/Mythos) (Mean Reward (%)): leader Claude Mythos 5 (88.0), 5 models
BrowseComp (Fable/Mythos Single-Agent) (Score (%)): leader Claude Mythos 5 (88.0), 4 models
BrowseComp (Fable/Mythos Multi-Agent) (Score (%)): leader Claude Fable 5 (93.3), 2 models
Humanity's Last Exam (Fable/Mythos No Tools) (Score (%)): leader Claude Mythos 5 (59.0), 5 models
Humanity's Last Exam (Fable/Mythos Tools) (Score (%)): leader Claude Mythos Preview (64.7), 5 models
CharXiv Reasoning (Fable/Mythos No Tools) (Score (%)): leader Claude Mythos 5 (88.9), 3 models
CharXiv Reasoning (Fable/Mythos Tools) (Score (%)): leader Claude Mythos 5 (93.5), 3 models
BioMysteryBench Human Solvable (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (83.9), 4 models
BioMysteryBench Human Difficult (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (46.1), 4 models
OSWorld-Verified (Fable/Mythos) (Score (%)): leader Claude Mythos Preview (85.4), 7 models
CritPt (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (28.6), 4 models
ArxivMath (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (78.5), 5 models
RiemannBench (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (55.0), 3 models
GraphWalks BFS 256K (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (91.1), 4 models
GraphWalks Parents 256K (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (99.96), 4 models
FrontierCode Diamond (Fable/Mythos) (Score (%)): leader Claude Fable 5 (29.3), 3 models
GDPval-AA (Fable/Mythos) (Elo): leader Claude Fable 5 (1932.0), 4 models
GDP.pdf (Fable/Mythos) (Strict Pass Rate (%)): leader Claude Fable 5 (29.8), 4 models
AutomationBench (Fable/Mythos) (Score (%)): leader Claude Fable 5 (17.4), 5 models
Blueprint-Bench 2 (Fable/Mythos) (Score (%)): leader Claude Fable 5 (38.6), 5 models
Legal Agent Benchmark Public Set (Fable/Mythos) (All-Pass Rate (%)): leader Claude Mythos 5 (16.9), 3 models
HealthBench (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (62.7), 4 models
HealthBench Professional (Fable/Mythos) (Score (%)): leader Claude Mythos 5 (66.0), 4 models
OpenAI GPT-5.5 Launch - GDPval (wins or ties) (Score (%)): leader GPT-5.5 (84.9), 6 models
OpenAI GPT-5.5 Launch - FinanceAgent v1.1 (Score (%)): leader Claude Opus 4.7 (64.4), 5 models
OpenAI GPT-5.5 Launch - Investment Banking Modeling Tasks (Score (%)): leader GPT-5.5 Pro (88.6), 4 models
OpenAI GPT-5.5 Launch - BrowseComp (Score (%)): leader GPT-5.5 Pro (90.1), 6 models
OpenAI GPT-5.5 Launch - GeneBench (Score (%)): leader GPT-5.5 Pro (33.2), 4 models
OpenAI GPT-5.5 Launch - FrontierMath Tier 1-3 (Score (%)): leader GPT-5.5 Pro (52.4), 6 models
OpenAI GPT-5.5 Launch - FrontierMath Tier 4 (Score (%)): leader GPT-5.5 Pro (39.6), 6 models
OpenAI GPT-5.5 Launch - GPQA Diamond (Score (%)): leader GPT-5.4 Pro (94.4), 5 models
OpenAI GPT-5.5 Launch - Humanity's Last Exam (no tools) (Score (%)): leader Claude Opus 4.7 (46.9), 6 models
OpenAI GPT-5.5 Launch - Humanity's Last Exam (with tools) (Score (%)): leader GPT-5.4 Pro (58.7), 6 models
OpenAI GPT-5.5 Launch - ARC-AGI-1 (Verified) (Score (%)): leader Gemini 3.1 Pro (98.0), 5 models
OpenAI GPT-5.5 Launch - ARC-AGI-2 (Verified) (Score (%)): leader GPT-5.5 (85.0), 5 models
OpenAI GPT-5.4 Launch - GDPval (Score (%)): leader GPT-5.4 (83.0), 5 models
OpenAI GPT-5.4 Launch - FinanceAgent v1.1 (Score (%)): leader GPT-5.4 Pro (61.5), 4 models
OpenAI GPT-5.4 Launch - Investment Banking Modeling Tasks (Score (%)): leader GPT-5.4 (87.3), 5 models
OpenAI GPT-5.4 Launch - BrowseComp (Score (%)): leader GPT-5.4 Pro (89.3), 5 models
OpenAI GPT-5.4 Launch - Frontier Science Research (Score (%)): leader GPT-5.4 Pro (36.7), 3 models
OpenAI GPT-5.4 Launch - FrontierMath Tier 1-3 (Score (%)): leader GPT-5.4 Pro (50.0), 3 models
OpenAI GPT-5.4 Launch - FrontierMath Tier 4 (Score (%)): leader GPT-5.4 Pro (38.0), 4 models
OpenAI GPT-5.4 Launch - GPQA Diamond (Score (%)): leader GPT-5.4 Pro (94.4), 5 models
OpenAI GPT-5.4 Launch - Humanity's Last Exam (no tools) (Score (%)): leader GPT-5.4 Pro (42.7), 4 models
OpenAI GPT-5.4 Launch - Humanity's Last Exam (with tools) (Score (%)): leader GPT-5.4 Pro (58.7), 4 models
OpenAI GPT-5.4 Launch - ARC-AGI-1 (Verified) (Score (%)): leader GPT-5.4 Pro (94.5), 4 models
OpenAI GPT-5.4 Launch - ARC-AGI-2 (Verified) (Score (%)): leader GPT-5.4 Pro (83.3), 4 models
OpenAI GPT-5.5 System Card - Tacit Knowledge and Troubleshooting (Score (%)): leader GPT-5.5 Pro (81.67), 2 models
OpenAI GPT-5.5 System Card - Biochemistry Knowledge Improvement (reward@4 (%)): leader GPT-5.5 Pro (39.26), 3 models
OpenAI GPT-5.5 System Card - Hard Negative Protein Binding Prediction (pass@4 (%)): leader GPT-5.4 (Thinking) (3.46), 3 models
OpenAI GPT-5.5 System Card - DNA Sequence Design for TF Binding (pass@1 (%)): leader GPT-5.5 Pro (16.5), 3 models
OpenAI GPT-Rosalind-5.5 System Card - ProtocolQA Open-Ended (pass@1 (%)): leader GPT-5.5 (37.3), 3 models
OpenAI GPT-Rosalind-5.5 System Card - TroubleshootingBench (pass@1 (%)): leader GPT-Rosalind-5.5 (53.31), 3 models
OpenAI GPT-Rosalind-5.5 System Card - Biorisk Knowledge (cons@32 (%)): leader GPT-5.5 Pro (81.67), 3 models
OpenAI GPT-Rosalind-5.5 System Card - Multi-select Virology Troubleshooting (pass@1 (%)): leader GPT-5.5 Pro (55.34), 3 models
OpenAI GPT-Rosalind-5.5 System Card - Hard Negative Protein Binding Prediction (pass@4 (%)): leader GPT-Rosalind-5.5 (3.13), 3 models
OpenAI GPT-Rosalind-5.5 System Card - DNA Sequence Design for TF Binding (pass@1 (%)): leader GPT-5.5 Pro (16.5), 3 models
Google Gemini 3 Deep Think - ARC-AGI-2 (Score (%)): leader Gemini 3 Deep Think (84.6), 4 models
Google Gemini 3 Deep Think - Humanity's Last Exam (no tools) (Score (%)): leader Gemini 3 Deep Think (48.4), 4 models
Google Gemini 3 Deep Think - Humanity's Last Exam (search and code) (Score (%)): leader Gemini 3 Deep Think (53.4), 4 models
Google Gemini 3 Deep Think - MMMU-Pro (Score (%)): leader Gemini 3 Deep Think (81.5), 4 models
Google Gemini 3 Deep Think - International Math Olympiad 2025 (Score (%)): leader Gemini 3 Deep Think (81.5), 3 models
Google Gemini 3 Deep Think - Codeforces (Elo): leader Gemini 3 Deep Think (3455.0), 3 models
Google Gemini 3 Deep Think - International Physics Olympiad 2025 (theory) (Score (%)): leader Gemini 3 Deep Think (87.7), 4 models
Google Gemini 3 Deep Think - CMT-Benchmark (Pass@8 (%)): leader Gemini 3 Deep Think (50.5), 4 models
Google Gemini 3 Deep Think - International Chemistry Olympiad 2025 (theory) (Score (%)): leader Gemini 3 Deep Think (82.8), 3 models
Qwen3.7 Launch - Terminal Bench 2.0-Terminus (Score (%)): leader Qwen 3.7 Max (69.7), 6 models
Qwen3.7 Launch - SWE-Verified (Resolved (%)): leader Claude Opus 4.6 (Thinking) (80.8), 5 models
Qwen3.7 Launch - SWE-Pro (Resolved (%)): leader Qwen 3.7 Max (60.6), 6 models
Qwen3.7 Launch - SWE-Multilingual (Resolved (%)): leader Qwen 3.7 Max (78.3), 5 models
Qwen3.7 Launch - NL2repo (Score (%)): leader Claude Opus 4.6 (Thinking) (47.6), 6 models
Qwen3.7 Launch - SciCode (Score (%)): leader Qwen 3.7 Max (53.5), 5 models
Qwen3.7 Launch - QwenWebDev (Elo): leader Claude Opus 4.6 (Thinking) (1617.0), 5 models
Qwen3.7 Launch - QwenSVG (Elo): leader Qwen 3.7 Max (1608.0), 6 models
Qwen3.7 Launch - Qwenclaw (Score (%)): leader Claude Opus 4.6 (Thinking) (65.5), 6 models
Qwen3.7 Launch - CoWorkBench (Score (%)): leader Claude Opus 4.6 (Thinking) (68.2), 6 models
Qwen3.7 Launch - ClawEval (Score (%)): leader Claude Opus 4.6 (Thinking) (70.4), 6 models
Qwen3.7 Launch - Skillsbench (Score (%)): leader Qwen 3.7 Max (59.2), 5 models
Qwen3.7 Launch - BFCL-V4 (Score (%)): leader Claude Opus 4.6 (Thinking) (76.7), 6 models
Qwen3.7 Launch - MCP-Mark (Score (%)): leader Qwen 3.7 Max (60.8), 6 models
Qwen3.7 Launch - MCP-Atlas (Score (%)): leader Qwen 3.7 Max (76.4), 6 models
Qwen3.7 Launch - Vitabench (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (51.9), 5 models
Qwen3.7 Launch - SpreadSheetBench-v1 (Score (%)): leader Claude Opus 4.6 (Thinking) (89.3), 6 models
Qwen3.7 Launch - Kernel Bench L3 - Median Speedup (Median speedup (x)): leader Claude Opus 4.6 (Thinking) (2.63), 6 models
Qwen3.7 Launch - Kernel Bench L3 - Win Rate (Problems faster than torch.compile (%)): leader Claude Opus 4.6 (Thinking) (98.0), 6 models
Qwen3.7 Launch - Humanity's Last Exam (with tools) (Score (%)): leader Kimi K2.6 (Thinking) (54.0), 6 models
Qwen3.7 Launch - QwenWorldBench (Score (%)): leader Qwen 3.7 Max (57.3), 6 models
Qwen3.7 Launch - GPQA Diamond (Score (%)): leader Qwen 3.7 Max (92.4), 6 models
Qwen3.7 Launch - Humanity's Last Exam (Score (%)): leader Qwen 3.7 Max (41.4), 6 models
Qwen3.7 Launch - LiveCodeBench (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (93.5), 5 models
Qwen3.7 Launch - HMMT 2026 Feb (Score (%)): leader Qwen 3.7 Max (97.1), 6 models
Qwen3.7 Launch - IMOAnswerBench (Score (%)): leader Qwen 3.7 Max (90.0), 6 models
Qwen3.7 Launch - CritPT (Score (%)): leader DeepSeek V4 Pro (Reasoning, Max Effort) (12.9), 6 models
Qwen3.7 Launch - Apex (Score (%)): leader Qwen 3.7 Max (44.5), 6 models
Qwen3.7 Launch - MMLU-Pro (Score (%)): leader Claude Opus 4.6 (Thinking) (89.7), 6 models
Qwen3.7 Launch - MMLU-Redux (Score (%)): leader Kimi K2.6 (Thinking) (95.3), 6 models
Qwen3.7 Launch - SuperGPQA (Score (%)): leader Qwen 3.7 Max (73.6), 6 models
Qwen3.7 Launch - IFEval (Score (%)): leader Kimi K2.6 (Thinking) (94.5), 6 models
Qwen3.7 Launch - IFBench (Score (%)): leader Qwen 3.7 Max (79.1), 6 models
Qwen3.7 Launch - MRCR-v2 128k (Accuracy (%)): leader Qwen 3.7 Max (90.4), 6 models
Qwen3.7 Launch - WMT24++ (Score (%)): leader Qwen 3.7 Max (85.8), 6 models
Qwen3.7 Launch - MAXIFE (Score (%)): leader Qwen 3.7 Max (89.2), 6 models
Qwen3.7 Launch - MMMLU (Score (%)): leader Claude Opus 4.6 (Thinking) (90.6), 6 models
Qwen3.7 Launch - MMLU-ProX (Score (%)): leader Qwen 3.7 Max (87.0), 6 models
Qwen3.7 Launch - NOVA-63 (Score (%)): leader Claude Opus 4.6 (Thinking) (59.1), 6 models
Qwen3.7 Launch - INCLUDE (Score (%)): leader Claude Opus 4.6 (Thinking) (87.4), 6 models
Qwen3.7 Launch - Global PIQA (Score (%)): leader Qwen 3.7 Max (91.4), 6 models
Qwen3.7 Launch - PolyMATH (Score (%)): leader Qwen 3.7 Max (86.5), 6 models

AI Benchmark Digest — 2026-06-14

2026-06-14T09:01:01.779177+00:00

Daily

New Benchmarks (75)

Ramp SWE-Bench (Resolved (%)): leader Claude Fable 5 (87.5), 14 models
Ramp Labs benchmark for background coding agents on realistic financial software engineering work, scored by resolved tasks with the mini-SWE-agent harness.
CADGenBench (Aggregate CAD Score): leader Claude Fable 5 (0.4514), 11 models
CAD generation and editing benchmark scoring generated CAD artifacts on aggregate geometric and validity metrics across validated submissions.
FrontierMath - Tier 4 (v2) (Accuracy (%, 41 private v2 problems)): leader Claude Fable 5 (max) (87.8), 27 models
Current v2 private Tier 4 FrontierMath expansion set from Epoch AI, measuring accuracy on the hardest unpublished research-level mathematics problems.
FrontierMath - Tiers 1-3 (v2) (Accuracy (%, 285 private v2 problems)): leader GPT-5.5 Pro (xhigh) (87.72), 26 models
Current v2 private FrontierMath base set from Epoch AI, covering original problems from undergraduate through early-postdoc difficulty across major areas of modern mathematics.
Benchmarks.bio - SpatialBench (Pass Rate (%)): leader GPT-5.5 (69.57), 11 models
LatchBio agentic benchmark on messy real-world spatial transcriptomics data, with models writing and running analysis workflows across assays, platforms, and task categories.
Benchmarks.bio - scBench (Pass Rate (%)): leader Claude Mythos 5 (59.3), 13 models
LatchBio agentic benchmark for single-cell RNA-seq analysis, requiring models to perform realistic data cleaning, clustering, cell typing, and differential-expression workflows.
Benchmarks.bio - SpatialBench-Long (Pass Rate (%)): leader Gemini 3.5 Flash (11.11), 12 models
Long-form Benchmarks.bio spatial transcriptomics tasks that require multi-step biological data analysis, tool use, and synthesis over larger assay contexts.
Benchmarks.bio - EpiBench (Pass Rate (%)): leader GPT-5.5 (44.97), 11 models
Benchmarks.bio epigenomics benchmark covering real assays such as chromatin accessibility, binding, and methylation analyses with deterministic graders.
Agent Arena (Net Improvement (%)): leader Grok 4.3 xAI · Proprietary (18.3), 25 models
Arena.ai agent leaderboard measuring net improvement on real-world tool orchestration sessions with success, steerability, recovery, and hallucination metrics.
Agent Arena - Confirmed Success (Confirmed Success (%)): leader Claude Fable 5 (High) (17.21), 25 models
Agent Arena submetric tracking confirmed successful completion rate on real-world agent sessions.
Agent Arena - Praise vs Complaint (Praise vs Complaint (%)): leader Claude Fable 5 (High) (27.74), 25 models
Agent Arena submetric comparing user praise against complaints across agent sessions.
Agent Arena - Steerability (Steerability (%)): leader Nemotron 3 Ultra (23.87), 25 models
Agent Arena submetric measuring how well models adapt to user steering during tool-use sessions.
Agent Arena - Bash Recovery (Bash Recovery (%)): leader Grok 4.3 xAI · Proprietary (60.23), 25 models
Agent Arena submetric measuring recovery from shell or command-line failures in agent sessions.
Agent Arena - Tool Hallucination (Tool Hallucination (%)): leader Grok 4.3 xAI · Proprietary (0.26), 25 models
Agent Arena submetric measuring tool hallucination rate; lower values indicate fewer invented or invalid tool uses.
Agents' Last Exam (Pass Rate (%)): leader GPT-5.5 (24.0), 18 models
Snorkel benchmark of long-horizon economically valuable agent tasks across many industries, reporting workflow pass rate and score.
WolfBench (Average Score (%)): leader GPT-5.5 (77.0), 27 models
Agent benchmark based on Terminal-Bench 2.0 that compares harnesses and models across repeated terminal task runs using aggregate score statistics.
Appwrite Arena (With Skills) (Overall Score (%)): leader GPT-5.5 (97.7), 16 models
Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks when models can use Appwrite skills.
Appwrite Arena (Without Skills) (Overall Score (%)): leader Claude Fable 5 (97.7), 16 models
Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks without Appwrite skill assistance.
Terminal-Bench 2.1 (Accuracy (%)): leader GPT-5.5 (83.4), 6 models
Official Terminal-Bench 2.1 leaderboard measuring agent success on realistic command-line tasks, using each model best available harness row.
Terminal-Bench 2.1 (Claude Code) (Accuracy (%)): leader Claude Opus 4.8 (78.9), 3 models
Terminal-Bench 2.1 results for the Claude Code harness, measuring command-line task completion by model.
Terminal-Bench 2.1 (Terminus 2) (Accuracy (%)): leader GPT-5.5 (78.2), 5 models
Terminal-Bench 2.1 results for the Terminus 2 harness, measuring command-line task completion by model.
Vals AI Finance Agent v2 (Accuracy (%)): leader gemini-3.5-flash (57.86), 29 models
Updated Vals AI financial-research agent benchmark over SEC filings and supporting documents, measuring completion accuracy on realistic analyst workflows.
Vals AI Public Benefits Bench (Accuracy (%)): leader claude-fable-5 (71.65), 13 models
SNAP public-benefits guidance benchmark measuring whether models answer benefits questions accurately while following eligibility and documentation rules.
Vals AI Terminal-Bench 2.1 (Accuracy (%)): leader claude-fable-5 (80.52), 30 models
Updated Terminal-Bench 2.1 evaluation from Vals AI, measuring agentic command-line task completion in sandboxed software and systems environments.
Vals AI LiveCodeBench (Accuracy (%)): leader claude-fable-5 (89.78), 121 models
Vals AI run of LiveCodeBench coding problems, measuring pass rates on recent contest-style programming tasks intended to reduce contamination.
Vals AI GPQA (Accuracy (%)): leader gemini-3.1-pro-preview (95.45), 115 models
Vals AI run of GPQA graduate-level science questions, measuring difficult expert-domain reasoning accuracy.
Vals AI MMLU-Pro (Accuracy (%)): leader claude-fable-5 (91.5), 114 models
Vals AI run of MMLU-Pro multitask academic questions, using harder multi-choice problems across STEM, humanities, and professional domains.
Vals AI MMMU (Accuracy (%)): leader claude-fable-5 (89.31), 76 models
Vals AI run of MMMU multimodal college-level subject questions, measuring visual and textual academic reasoning.
Vals AI SWE-bench Verified (Resolved (%)): leader claude-fable-5 (95.0), 57 models
Vals AI SWE-bench Verified leaderboard, measuring the percentage of real GitHub issues resolved by coding agents.
GDP.pdf (Strict Pass Rate (%)): leader Claude Fable 5 (30.0), 12 models
Surge AI document-reasoning benchmark over 100 professional PDF workflows, scored by strict pass rate against expert-written rubrics.
Riemann-bench (Score (%)): leader Claude Fable 5 (55.0), 15 models
Surge AI frontier mathematics benchmark with advanced research-style problems sourced from mathematicians and scored by solution correctness.
SWE-bench Pro (Anthropic Scaffold) (Pass@1 (%)): leader Claude Mythos 5 (80.3), 6 models
Anthropic system-card run of SWE-bench Pro, measuring pass@1 on production software engineering issues using Anthropic scaffold settings.
OfficeQA Pro (Correctness (%)): leader Claude Fable 5 (57.9), 4 models
Hard OfficeQA subset for frontier document agents, requiring grounded search and numerical reasoning over U.S. Treasury Bulletin documents.
Real-World Finance v2 (Elo): leader Claude Fable 5 (1374.0), 4 models
Anthropic long-horizon finance workflow evaluation using pairwise preference grading and Elo ratings over realistic professional deliverables.
Real-World Finance v1 (Score (%)): leader Claude Mythos Preview (70.9), 4 models
Anthropic curated finance benchmark of 53 tasks evaluated against reference answers with a model-based grader.
Legal Agent Benchmark (Harvey Held-Out) (All-Pass Rate (%)): leader Claude Fable 5 (13.3), 5 models
Harvey legal-agent held-out evaluation using closed-universe matter files and expert rubrics, scored by all-pass task success.
Toolathlon (Anthropic Internal Harness) (Pass@1 (%)): leader Claude Fable 5 (61.7), 7 models
Anthropic internal Toolathlon harness over 108 tool-use tasks, reporting pass@1 for agentic workflow completion.
SWE-bench Verified (Anthropic Scaffold) (Resolved (%)): leader Claude Opus 4.8 (88.6), 3 models
Anthropic system-card run of SWE-bench Verified, measuring real GitHub issue resolution with Anthropic scaffold settings.
SWE-bench Multilingual (Anthropic Scaffold) (Resolved (%)): leader Claude Opus 4.8 (84.4), 2 models
Anthropic system-card run of SWE-bench Multilingual, measuring multilingual software issue resolution with Anthropic scaffold settings.
SWE-bench Multimodal (Anthropic Internal Harness) (Resolved (%)): leader Claude Opus 4.8 (38.4), 2 models
Anthropic internal multimodal SWE-bench harness, measuring software issue resolution that requires visual or multimodal context.
Humanity's Last Exam (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (49.8), 4 models
Anthropic system-card run of Humanitys Last Exam without tools, covering expert-level academic reasoning across many domains.
Humanity's Last Exam (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (57.9), 4 models
Anthropic system-card run of Humanitys Last Exam with tools, covering expert-level academic reasoning across many domains.
ChartQAPro (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (69.4), 2 models
Anthropic no-tool run of ChartQAPro, testing chart understanding and quantitative visual reasoning.
ChartQAPro (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (72.3), 2 models
Anthropic tool-enabled run of ChartQAPro, testing chart understanding and quantitative visual reasoning.
ScreenSpot-Pro (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (82.3), 2 models
Anthropic no-tool run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.
ScreenSpot-Pro (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (87.9), 2 models
Anthropic tool-enabled run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.
GraphWalks BFS 256K (Anthropic) (F1 Score (%)): leader Claude Opus 4.8 (85.9), 4 models
Anthropic GraphWalks long-context graph traversal evaluation using breadth-first-search tasks at 256K context.
GraphWalks Parents 256K (Anthropic) (F1 Score (%)): leader Claude Opus 4.8 (99.3), 4 models
Anthropic GraphWalks long-context graph traversal evaluation using parent-pointer recovery tasks at 256K context.
USAMO 2026 (Anthropic) (Accuracy (%)): leader Claude Opus 4.8 (96.7), 2 models
Anthropic system-card evaluation on 2026 USAMO-style olympiad math problems, scored by answer correctness.
ArXivMath Mar-Apr 2026 (Anthropic) (Accuracy (%)): leader Claude Opus 4.8 (71.82), 3 models
Anthropic system-card evaluation on recent arXiv mathematics problems from March and April 2026.
OfficeQA (Anthropic Internal Harness) (Exact Match (%)): leader Claude Opus 4.8 (77.6), 2 models
Anthropic internal OfficeQA document-agent benchmark, requiring grounded search and numerical reasoning over office documents.
OfficeQA Pro (Anthropic Internal Harness) (Exact Match (%)): leader Claude Opus 4.8 (66.2), 2 models
Anthropic internal OfficeQA Pro hard subset, requiring grounded search and numerical reasoning over office documents.
ChartMuseum (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (75.8), 2 models
Anthropic no-tool run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.
ChartMuseum (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (89.7), 2 models
Anthropic tool-enabled run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.
LAB-Bench FigQA (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (80.4), 2 models
Anthropic no-tool run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.
LAB-Bench FigQA (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (87.3), 2 models
Anthropic tool-enabled run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.
CharXiv Reasoning (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.7 (81.3), 2 models
Anthropic no-tool run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.
CharXiv Reasoning (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.7 (90.1), 2 models
Anthropic tool-enabled run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.
HealthBench Professional (Anthropic) (Length-Adjusted Score (%)): leader Claude Opus 4.8 (55.8), 3 models
Anthropic system-card run of HealthBench Professional, measuring clinical and healthcare reasoning with length-adjusted scoring.
GMMLU (Anthropic) (Average Accuracy (%)): leader Gemini 3.1 Pro (92.2), 5 models
Anthropic system-card run of Global MMLU, measuring multilingual academic and professional knowledge.
BioPipelineBench Verified (Anthropic) (Score (%)): leader Claude Mythos Preview (88.1), 4 models
Anthropic system-card run of BioPipelineBench Verified, measuring biological data-analysis workflow completion.
BioMysteryBench Verified - Human Solvable (Anthropic) (Score (%)): leader Claude Mythos Preview (82.6), 4 models
Anthropic system-card run of BioMysteryBench Verified human-solvable tasks, testing biological mystery problem solving.
BioMysteryBench Verified - Human Difficult (Anthropic) (Score (%)): leader Claude Opus 4.8 (40.0), 4 models
Anthropic system-card run of BioMysteryBench Verified human-difficult tasks, testing hard biological mystery problem solving.
LatchBio SpatialBench (Anthropic) (Score (%)): leader Claude Mythos Preview (53.8), 4 models
Anthropic system-card run of LatchBio SpatialBench, measuring spatial transcriptomics analysis workflows.
LatchBio SingleCellBench (Anthropic) (Score (%)): leader Claude Opus 4.8 (58.2), 4 models
Anthropic system-card run of LatchBio SingleCellBench, measuring single-cell RNA-seq analysis workflows.
Structural Biology (Anthropic) (Score (%)): leader Claude Mythos Preview (81.6), 4 models
Anthropic system-card structural biology evaluation, testing biomolecular structure reasoning and analysis.
ProteinGym Hard (Anthropic) (Rank Correlation (%)): leader Claude Mythos Preview (43.1), 4 models
Anthropic system-card run of the hard ProteinGym subset, measuring protein variant effect prediction via rank correlation.
Organic Chemistry (Anthropic) (Score (%)): leader Claude Mythos Preview (86.5), 4 models
Anthropic system-card organic chemistry evaluation, testing reaction and molecule reasoning.
Protocol Troubleshooting (Anthropic) (Score (%)): leader Claude Mythos Preview (69.6), 4 models
Anthropic system-card protocol troubleshooting benchmark, testing diagnosis of laboratory protocol failures.
LABBench2 - Patent Questions (Anthropic) (Score (%)): leader Claude Opus 4.8 (68.8), 3 models
Anthropic system-card LABBench2 patent-question subset, testing life-science document reasoning over patent material.
LABBench2 - Clinical Trial Questions (Anthropic) (Score (%)): leader Claude Mythos Preview (86.3), 3 models
Anthropic system-card LABBench2 clinical-trial subset, testing life-science reasoning over trial documents.
LABBench2 - Table Reading (Anthropic) (Score (%)): leader Claude Opus 4.8 (77.2), 2 models
Anthropic system-card LABBench2 table-reading subset, testing scientific table comprehension.
LABBench2 - Supplementary Materials (Anthropic) (Score (%)): leader Claude Opus 4.8 (58.9), 2 models
Anthropic system-card LABBench2 supplementary-materials subset, testing reasoning over scientific supporting files.
Agent Security League - Functional Correctness (Functional Correctness (%)): leader GPT-5.5 (84.9), 15 models
Endor Labs coding-agent benchmark measuring whether agents functionally complete security-sensitive software tasks.
Agent Security League - Security Correctness (Security Correctness (%)): leader GPT-5.5 (24.0), 15 models
Endor Labs coding-agent benchmark measuring whether completed software tasks avoid introducing or preserving security vulnerabilities.

New #1 Leaders (1)

OpenClawProBench: GLM-5.2 (81.3) beat intern-s2-preview by 4.6

Weekly

New Benchmarks (86)

FrontierCode Diamond (Score (%)): leader Claude Opus 4.8 (13.4), 12 models
Hardest 50 FrontierCode production-code tasks from Cognition, measuring whether maintainers would merge model PRs using blocker criteria and quality rubrics.
FrontierCode Main (Score (%)): leader Claude Opus 4.8 (34.3), 12 models
100 hardest FrontierCode production-code tasks, including Diamond, scored by maintainer-style mergeability criteria across correctness, tests, scope, style, and codebase standards.
FrontierCode Extended (Score (%)): leader Claude Opus 4.8 (51.8), 12 models
Full 150-task FrontierCode benchmark from Cognition, evaluating production-quality coding agents on maintainer-authored open source repository work.
Ramp SWE-Bench (Resolved (%)): leader Claude Fable 5 (87.5), 14 models
Ramp Labs benchmark for background coding agents on realistic financial software engineering work, scored by resolved tasks with the mini-SWE-agent harness.
CADGenBench (Aggregate CAD Score): leader Claude Fable 5 (0.4514), 11 models
CAD generation and editing benchmark scoring generated CAD artifacts on aggregate geometric and validity metrics across validated submissions.
FrontierMath - Tier 4 (v2) (Accuracy (%, 41 private v2 problems)): leader Claude Fable 5 (max) (87.8), 27 models
Current v2 private Tier 4 FrontierMath expansion set from Epoch AI, measuring accuracy on the hardest unpublished research-level mathematics problems.
FrontierMath - Tiers 1-3 (v2) (Accuracy (%, 285 private v2 problems)): leader GPT-5.5 Pro (xhigh) (87.72), 26 models
Current v2 private FrontierMath base set from Epoch AI, covering original problems from undergraduate through early-postdoc difficulty across major areas of modern mathematics.
Benchmarks.bio - SpatialBench (Pass Rate (%)): leader GPT-5.5 (69.57), 11 models
LatchBio agentic benchmark on messy real-world spatial transcriptomics data, with models writing and running analysis workflows across assays, platforms, and task categories.
Benchmarks.bio - scBench (Pass Rate (%)): leader Claude Mythos 5 (59.3), 13 models
LatchBio agentic benchmark for single-cell RNA-seq analysis, requiring models to perform realistic data cleaning, clustering, cell typing, and differential-expression workflows.
Benchmarks.bio - SpatialBench-Long (Pass Rate (%)): leader Gemini 3.5 Flash (11.11), 12 models
Long-form Benchmarks.bio spatial transcriptomics tasks that require multi-step biological data analysis, tool use, and synthesis over larger assay contexts.
Benchmarks.bio - EpiBench (Pass Rate (%)): leader GPT-5.5 (44.97), 11 models
Benchmarks.bio epigenomics benchmark covering real assays such as chromatin accessibility, binding, and methylation analyses with deterministic graders.
Agent Arena (Net Improvement (%)): leader Grok 4.3 xAI · Proprietary (18.3), 25 models
Arena.ai agent leaderboard measuring net improvement on real-world tool orchestration sessions with success, steerability, recovery, and hallucination metrics.
Agent Arena - Confirmed Success (Confirmed Success (%)): leader Claude Fable 5 (High) (17.21), 25 models
Agent Arena submetric tracking confirmed successful completion rate on real-world agent sessions.
Agent Arena - Praise vs Complaint (Praise vs Complaint (%)): leader Claude Fable 5 (High) (27.74), 25 models
Agent Arena submetric comparing user praise against complaints across agent sessions.
Agent Arena - Steerability (Steerability (%)): leader Nemotron 3 Ultra (23.87), 25 models
Agent Arena submetric measuring how well models adapt to user steering during tool-use sessions.
Agent Arena - Bash Recovery (Bash Recovery (%)): leader Grok 4.3 xAI · Proprietary (60.23), 25 models
Agent Arena submetric measuring recovery from shell or command-line failures in agent sessions.
Agent Arena - Tool Hallucination (Tool Hallucination (%)): leader Grok 4.3 xAI · Proprietary (0.26), 25 models
Agent Arena submetric measuring tool hallucination rate; lower values indicate fewer invented or invalid tool uses.
Agents' Last Exam (Pass Rate (%)): leader GPT-5.5 (24.0), 18 models
Snorkel benchmark of long-horizon economically valuable agent tasks across many industries, reporting workflow pass rate and score.
WolfBench (Average Score (%)): leader GPT-5.5 (77.0), 27 models
Agent benchmark based on Terminal-Bench 2.0 that compares harnesses and models across repeated terminal task runs using aggregate score statistics.
Appwrite Arena (With Skills) (Overall Score (%)): leader GPT-5.5 (97.7), 16 models
Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks when models can use Appwrite skills.
Appwrite Arena (Without Skills) (Overall Score (%)): leader Claude Fable 5 (97.7), 16 models
Appwrite Arena evaluation of model knowledge and reasoning about Appwrite development tasks without Appwrite skill assistance.
Terminal-Bench 2.1 (Accuracy (%)): leader GPT-5.5 (83.4), 6 models
Official Terminal-Bench 2.1 leaderboard measuring agent success on realistic command-line tasks, using each model best available harness row.
Terminal-Bench 2.1 (Claude Code) (Accuracy (%)): leader Claude Opus 4.8 (78.9), 3 models
Terminal-Bench 2.1 results for the Claude Code harness, measuring command-line task completion by model.
Terminal-Bench 2.1 (Terminus 2) (Accuracy (%)): leader GPT-5.5 (78.2), 5 models
Terminal-Bench 2.1 results for the Terminus 2 harness, measuring command-line task completion by model.
Vals AI Finance Agent v2 (Accuracy (%)): leader gemini-3.5-flash (57.86), 29 models
Updated Vals AI financial-research agent benchmark over SEC filings and supporting documents, measuring completion accuracy on realistic analyst workflows.
Vals AI Public Benefits Bench (Accuracy (%)): leader claude-fable-5 (71.65), 13 models
SNAP public-benefits guidance benchmark measuring whether models answer benefits questions accurately while following eligibility and documentation rules.
Vals AI Terminal-Bench 2.1 (Accuracy (%)): leader claude-fable-5 (80.52), 30 models
Updated Terminal-Bench 2.1 evaluation from Vals AI, measuring agentic command-line task completion in sandboxed software and systems environments.
Vals AI LiveCodeBench (Accuracy (%)): leader claude-fable-5 (89.78), 121 models
Vals AI run of LiveCodeBench coding problems, measuring pass rates on recent contest-style programming tasks intended to reduce contamination.
Vals AI GPQA (Accuracy (%)): leader gemini-3.1-pro-preview (95.45), 115 models
Vals AI run of GPQA graduate-level science questions, measuring difficult expert-domain reasoning accuracy.
Vals AI MMLU-Pro (Accuracy (%)): leader claude-fable-5 (91.5), 114 models
Vals AI run of MMLU-Pro multitask academic questions, using harder multi-choice problems across STEM, humanities, and professional domains.
Vals AI MMMU (Accuracy (%)): leader claude-fable-5 (89.31), 76 models
Vals AI run of MMMU multimodal college-level subject questions, measuring visual and textual academic reasoning.
Vals AI SWE-bench Verified (Resolved (%)): leader claude-fable-5 (95.0), 57 models
Vals AI SWE-bench Verified leaderboard, measuring the percentage of real GitHub issues resolved by coding agents.
Icelandic LLM Leaderboard - Average (Average Score (%)): leader Gemini 3.1 Pro Preview (88.54), 86 models
Icelandic LLM leaderboard aggregating WinoGrande-IS, GED, Inflection, Belebele-IS, ARC-Challenge-IS, and WikiQA-IS for Icelandic language understanding and reasoning.
Icelandic LLM - WinoGrande-IS (Score (%)): leader Gemini 3.1 Pro Preview (96.14), 86 models
Icelandic WinoGrande common-sense reasoning score.
Icelandic LLM - GED (Score (%)): leader Claude Fable 5 (91.5), 86 models
Icelandic grammatical error detection score.
Icelandic LLM - Inflection (Score (%)): leader GPT-5.5 (97.96), 86 models
Icelandic morphological inflection score.
Icelandic LLM - Belebele-IS (Score (%)): leader Gemini 3.1 Pro Preview (95.0), 86 models
Icelandic Belebele reading-comprehension score.
Icelandic LLM - ARC-Challenge-IS (Score (%)): leader GPT-5.5 (95.22), 86 models
Icelandic ARC-Challenge science and commonsense reasoning score.
Icelandic LLM - WikiQA-IS (Score (%)): leader Claude Fable 5 (75.39), 86 models
Icelandic WikiQA question-answering score.
GDP.pdf (Strict Pass Rate (%)): leader Claude Fable 5 (30.0), 12 models
Surge AI document-reasoning benchmark over 100 professional PDF workflows, scored by strict pass rate against expert-written rubrics.
Riemann-bench (Score (%)): leader Claude Fable 5 (55.0), 15 models
Surge AI frontier mathematics benchmark with advanced research-style problems sourced from mathematicians and scored by solution correctness.
SWE-bench Pro (Anthropic Scaffold) (Pass@1 (%)): leader Claude Mythos 5 (80.3), 6 models
Anthropic system-card run of SWE-bench Pro, measuring pass@1 on production software engineering issues using Anthropic scaffold settings.
OfficeQA Pro (Correctness (%)): leader Claude Fable 5 (57.9), 4 models
Hard OfficeQA subset for frontier document agents, requiring grounded search and numerical reasoning over U.S. Treasury Bulletin documents.
Real-World Finance v2 (Elo): leader Claude Fable 5 (1374.0), 4 models
Anthropic long-horizon finance workflow evaluation using pairwise preference grading and Elo ratings over realistic professional deliverables.
Real-World Finance v1 (Score (%)): leader Claude Mythos Preview (70.9), 4 models
Anthropic curated finance benchmark of 53 tasks evaluated against reference answers with a model-based grader.
Legal Agent Benchmark (Harvey Held-Out) (All-Pass Rate (%)): leader Claude Fable 5 (13.3), 5 models
Harvey legal-agent held-out evaluation using closed-universe matter files and expert rubrics, scored by all-pass task success.
Toolathlon (Anthropic Internal Harness) (Pass@1 (%)): leader Claude Fable 5 (61.7), 7 models
Anthropic internal Toolathlon harness over 108 tool-use tasks, reporting pass@1 for agentic workflow completion.
SWE-bench Verified (Anthropic Scaffold) (Resolved (%)): leader Claude Opus 4.8 (88.6), 3 models
Anthropic system-card run of SWE-bench Verified, measuring real GitHub issue resolution with Anthropic scaffold settings.
SWE-bench Multilingual (Anthropic Scaffold) (Resolved (%)): leader Claude Opus 4.8 (84.4), 2 models
Anthropic system-card run of SWE-bench Multilingual, measuring multilingual software issue resolution with Anthropic scaffold settings.
SWE-bench Multimodal (Anthropic Internal Harness) (Resolved (%)): leader Claude Opus 4.8 (38.4), 2 models
Anthropic internal multimodal SWE-bench harness, measuring software issue resolution that requires visual or multimodal context.
Humanity's Last Exam (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (49.8), 4 models
Anthropic system-card run of Humanitys Last Exam without tools, covering expert-level academic reasoning across many domains.
Humanity's Last Exam (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (57.9), 4 models
Anthropic system-card run of Humanitys Last Exam with tools, covering expert-level academic reasoning across many domains.
ChartQAPro (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (69.4), 2 models
Anthropic no-tool run of ChartQAPro, testing chart understanding and quantitative visual reasoning.
ChartQAPro (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (72.3), 2 models
Anthropic tool-enabled run of ChartQAPro, testing chart understanding and quantitative visual reasoning.
ScreenSpot-Pro (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (82.3), 2 models
Anthropic no-tool run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.
ScreenSpot-Pro (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (87.9), 2 models
Anthropic tool-enabled run of ScreenSpot-Pro, evaluating GUI grounding and screen element localization.
GraphWalks BFS 256K (Anthropic) (F1 Score (%)): leader Claude Opus 4.8 (85.9), 4 models
Anthropic GraphWalks long-context graph traversal evaluation using breadth-first-search tasks at 256K context.
GraphWalks Parents 256K (Anthropic) (F1 Score (%)): leader Claude Opus 4.8 (99.3), 4 models
Anthropic GraphWalks long-context graph traversal evaluation using parent-pointer recovery tasks at 256K context.
USAMO 2026 (Anthropic) (Accuracy (%)): leader Claude Opus 4.8 (96.7), 2 models
Anthropic system-card evaluation on 2026 USAMO-style olympiad math problems, scored by answer correctness.
ArXivMath Mar-Apr 2026 (Anthropic) (Accuracy (%)): leader Claude Opus 4.8 (71.82), 3 models
Anthropic system-card evaluation on recent arXiv mathematics problems from March and April 2026.
OfficeQA (Anthropic Internal Harness) (Exact Match (%)): leader Claude Opus 4.8 (77.6), 2 models
Anthropic internal OfficeQA document-agent benchmark, requiring grounded search and numerical reasoning over office documents.
OfficeQA Pro (Anthropic Internal Harness) (Exact Match (%)): leader Claude Opus 4.8 (66.2), 2 models
Anthropic internal OfficeQA Pro hard subset, requiring grounded search and numerical reasoning over office documents.
ChartMuseum (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (75.8), 2 models
Anthropic no-tool run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.
ChartMuseum (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (89.7), 2 models
Anthropic tool-enabled run of ChartMuseum, evaluating visual chart interpretation across diverse chart types.
LAB-Bench FigQA (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.8 (80.4), 2 models
Anthropic no-tool run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.
LAB-Bench FigQA (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.8 (87.3), 2 models
Anthropic tool-enabled run of LAB-Bench FigQA, testing scientific figure understanding and reasoning.
CharXiv Reasoning (Anthropic No Tools) (Accuracy (%)): leader Claude Opus 4.7 (81.3), 2 models
Anthropic no-tool run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.
CharXiv Reasoning (Anthropic Tools) (Accuracy (%)): leader Claude Opus 4.7 (90.1), 2 models
Anthropic tool-enabled run of CharXiv Reasoning, evaluating reasoning over scientific charts from arXiv papers.
HealthBench Professional (Anthropic) (Length-Adjusted Score (%)): leader Claude Opus 4.8 (55.8), 3 models
Anthropic system-card run of HealthBench Professional, measuring clinical and healthcare reasoning with length-adjusted scoring.
GMMLU (Anthropic) (Average Accuracy (%)): leader Gemini 3.1 Pro (92.2), 5 models
Anthropic system-card run of Global MMLU, measuring multilingual academic and professional knowledge.
BioPipelineBench Verified (Anthropic) (Score (%)): leader Claude Mythos Preview (88.1), 4 models
Anthropic system-card run of BioPipelineBench Verified, measuring biological data-analysis workflow completion.
BioMysteryBench Verified - Human Solvable (Anthropic) (Score (%)): leader Claude Mythos Preview (82.6), 4 models
Anthropic system-card run of BioMysteryBench Verified human-solvable tasks, testing biological mystery problem solving.
BioMysteryBench Verified - Human Difficult (Anthropic) (Score (%)): leader Claude Opus 4.8 (40.0), 4 models
Anthropic system-card run of BioMysteryBench Verified human-difficult tasks, testing hard biological mystery problem solving.
LatchBio SpatialBench (Anthropic) (Score (%)): leader Claude Mythos Preview (53.8), 4 models
Anthropic system-card run of LatchBio SpatialBench, measuring spatial transcriptomics analysis workflows.
LatchBio SingleCellBench (Anthropic) (Score (%)): leader Claude Opus 4.8 (58.2), 4 models
Anthropic system-card run of LatchBio SingleCellBench, measuring single-cell RNA-seq analysis workflows.
Structural Biology (Anthropic) (Score (%)): leader Claude Mythos Preview (81.6), 4 models
Anthropic system-card structural biology evaluation, testing biomolecular structure reasoning and analysis.
ProteinGym Hard (Anthropic) (Rank Correlation (%)): leader Claude Mythos Preview (43.1), 4 models
Anthropic system-card run of the hard ProteinGym subset, measuring protein variant effect prediction via rank correlation.
Organic Chemistry (Anthropic) (Score (%)): leader Claude Mythos Preview (86.5), 4 models
Anthropic system-card organic chemistry evaluation, testing reaction and molecule reasoning.
Protocol Troubleshooting (Anthropic) (Score (%)): leader Claude Mythos Preview (69.6), 4 models
Anthropic system-card protocol troubleshooting benchmark, testing diagnosis of laboratory protocol failures.
LABBench2 - Patent Questions (Anthropic) (Score (%)): leader Claude Opus 4.8 (68.8), 3 models
Anthropic system-card LABBench2 patent-question subset, testing life-science document reasoning over patent material.
LABBench2 - Clinical Trial Questions (Anthropic) (Score (%)): leader Claude Mythos Preview (86.3), 3 models
Anthropic system-card LABBench2 clinical-trial subset, testing life-science reasoning over trial documents.
LABBench2 - Table Reading (Anthropic) (Score (%)): leader Claude Opus 4.8 (77.2), 2 models
Anthropic system-card LABBench2 table-reading subset, testing scientific table comprehension.
LABBench2 - Supplementary Materials (Anthropic) (Score (%)): leader Claude Opus 4.8 (58.9), 2 models
Anthropic system-card LABBench2 supplementary-materials subset, testing reasoning over scientific supporting files.
BoxPwnr CTF Bench (Average Platform Completion (%)): leader z-ai/glm-5.1 (54.37), 15 models
Aggregated BoxPwnr trace leaderboard over public CTF and security-lab platforms including CyBench, Hack The Box, picoCTF, PortSwigger, TryHackMe, Argus, and XBOW.
Agent Security League - Functional Correctness (Functional Correctness (%)): leader GPT-5.5 (84.9), 15 models
Endor Labs coding-agent benchmark measuring whether agents functionally complete security-sensitive software tasks.
Agent Security League - Security Correctness (Security Correctness (%)): leader GPT-5.5 (24.0), 15 models
Endor Labs coding-agent benchmark measuring whether completed software tasks avoid introducing or preserving security vulnerabilities.

New Models (67)

Claude Fable 5 — ELO 2697, #4
- Lynchmark: 100.0 (#1/13)
- Design Arena (Website): 1345.0 (#1/143)
- Design Arena (Game Dev): 1382.0 (#1/129)
- Design Arena (UI Components): 1417.0 (#1/123)
- Design Arena (Data Viz): 1381.0 (#1/125)
- Design Arena (3D): 1370.0 (#1/117)
- Design Arena (SVG): 1370.0 (#1/94)
- Chatbot Arena (Text): 1510.0 (#1/366)
- Chatbot Arena (Code): 1665.0 (#1/86)
- Blueprint-Bench 2: 0.386 (#1/14)
Claude Opus 4.8 — ELO 2449, #6
- Evals for Every Language - MGSM: 96.62 (#1/70)
- Evals for Every Language - Language ar: 71.58 (#1/71)
- Evals for Every Language - Language be: 69.43 (#1/71)
- Evals for Every Language - Language ak: 60.02 (#2/71)
- Evals for Every Language - Language bem: 60.25 (#2/71)
- Evals for Every Language - Language bm: 59.47 (#2/71)
- Evals for Every Language - Language chm: 63.17 (#2/71)
- Evals for Every Language - Language ckb: 71.59 (#2/71)
- Evals for Every Language - Language crh: 69.2 (#2/71)
- Evals for Every Language - Language en: 86.15 (#2/71)
GPT-5.5 — ELO 2384, #7
- Blueprint-Bench 2: 0.362 (#2/14)
- GRAB-Lite: 71.8 (#2/38)
- Evals for Every Language - Language ary: 47.34 (#2/71)
- Evals for Every Language - Language doi: 71.32 (#2/71)
- Evals for Every Language - Language et: 72.25 (#3/71)
- Evals for Every Language - ARC: 97.82 (#4/69)
- Evals for Every Language - Language ay: 59.02 (#4/71)
- Evals for Every Language - Language az: 65.39 (#4/71)
- Evals for Every Language - Language bho: 67.61 (#4/71)
- Evals for Every Language - Language bm: 54.72 (#4/71)
Qwen 3.7 Max — ELO 2370, #8
- Position Bias (Lechmazur): 34.8 (#10/36)
- RuneBench: 2222.0 (#11/23)
- Wolfram LLM Benchmarking Project: 67.5 (#14/483)
Claude Opus 4.7 — ELO 2325, #10
- Evals for Every Language - Language chm: 63.6 (#1/71)
- Evals for Every Language - Language cs: 74.38 (#1/71)
- Evals for Every Language - Language doi: 71.84 (#1/71)
- Evals for Every Language: 66.95 (#2/71)
- Evals for Every Language - MGSM: 95.57 (#2/70)
- Evals for Every Language - Language am: 67.86 (#2/71)
- Evals for Every Language - Language ar: 70.69 (#2/71)
- Evals for Every Language - Language arz: 52.06 (#2/71)
- Evals for Every Language - Language as: 68.11 (#2/71)
- Evals for Every Language - Language awa: 68.23 (#2/71)
Nemotron 3 Ultra — ELO 2288, #13
- YC-Bench: 326.9 (#18/26)
- SimpleBench: 41.7 (#37/74)
Claude Opus 4.6 — ELO 2253, #15
- Android Bench: 66.6 (#5/23)
GPT-5.4 — ELO 2242, #16
- Blueprint-Bench 2: 0.271 (#4/14)
Gemini 3.5 Flash — ELO 2219, #18
- ZeroBench: 19.0 (#4/60)
- GRAB-Lite: 63.0 (#4/38)
- Position Bias (Lechmazur): 29.8 (#5/36)
- Android Bench: 63.7 (#6/23)
- YC-Bench: 987.0 (#12/26)
- SWE-rebench: 49.45 (#30/85)
GPT-5 Pro — ELO 2217, #19
- Epoch AI - ECI: 149.85 (#69/374)
Qwen Max — ELO 2148, #23
- SimpleQA Verified: 58.52 (#10/55)
- OTIS Mock AIME 2024-25: 95.0 (#13/145)
- Chess Puzzles (Epoch AI): 22.0 (#22/46)
DeepSeek V4 Pro — ELO 2097, #30
- RuneBench: 2939.0 (#6/23)
- ProphetArena: 0.9061 (#15/46)
- Position Bias (Lechmazur): 43.6 (#19/36)
Qwen 3.6 Plus — ELO 2092, #31
- ProphetArena: 0.9289 (#3/46)
- Evals for Every Language - Language as: 66.13 (#6/71)
- Evals for Every Language - Language bm: 50.12 (#6/71)
- Evals for Every Language - Language chm: 59.82 (#6/71)
- Evals for Every Language - Language ckb: 68.26 (#6/71)
- Evals for Every Language - Language ace: 65.27 (#7/71)
- Evals for Every Language - Language cv: 61.59 (#7/71)
- Evals for Every Language - Language be: 65.74 (#8/71)
- Evals for Every Language - Language bjn: 45.34 (#8/71)
- Evals for Every Language - Language ban: 62.52 (#9/71)
MiMo-V2.5-Pro — ELO 2059, #36
- LLM Stats (CMMLU): 90.2 (#1/6)
- LLM Stats (DROP): 86.3 (#3/29)
- LLM Stats (TriviaQA): 81.3 (#3/18)
- LLM Stats (C-Eval): 91.5 (#5/18)
- LLM Stats (Claw-Eval): 64.0 (#5/11)
- LLM Stats (GDPval-AA): 1581.0 (#6/13)
- Vals AI ProofBench: 24.0 (#13/42)
- LLM Stats (MMLU-Redux): 92.8 (#14/47)
- Vals AI MedScribe: 83.73 (#14/64)
- Vals AI (Vals Index): 50.74 (#16/29)
MiniMax-M3 — ELO 2054, #37
- OSWorld: 75.19 (#6/61)
- WebDev Arena: 1527.75 (#9/70)
- YC-Bench: 999.5 (#11/26)
- Position Bias (Lechmazur): 34.9 (#11/36)
- Sycophancy (Lechmazur): 3.5 (#12/32)
- Design Arena (SVG): 1255.0 (#18/94)
- Design Arena (Game Dev): 1273.0 (#27/129)
- SWE-rebench: 45.64 (#38/85)
O3 — ELO 2049, #39
- GRAB-Lite: 40.8 (#21/38)
Kimi K2.6 — ELO 2048, #41
- RuneBench: 1256.0 (#16/23)
- Position Bias (Lechmazur): 47.3 (#24/36)
GPT-5.1 — ELO 2045, #42
- GRAB-Lite: 44.4 (#17/38)
kimi-k2.7-code — ELO 2040, #45
- LiveBench Python: 90.0 (#2/125)
- LiveBench TypeScript: 65.0 (#3/124)
- OTIS Mock AIME 2024-25: 96.39 (#6/145)
- Design Arena (Website): 1322.0 (#7/143)
- Design Arena (3D): 1328.0 (#11/117)
- LiveBench Logic With Navigation: 74.0 (#14/125)
- LiveBench Zebra Puzzle: 96.0 (#15/124)
- LiveBench Olympiad: 90.3 (#17/125)
- Vals AI Vibe Code Bench: 47.21 (#18/62)
- LiveBench JavaScript: 55.0 (#19/125)
Claude Sonnet 4.6 — ELO 2023, #50
- ZeroBench: 11.0 (#11/60)
- SWE-rebench: 54.49 (#18/85)
- Terminal-Bench 2.0: 53.4 (#21/58)
GLM-5.1 — ELO 2004, #55
- ProphetArena: 0.9253 (#4/46)
- FrontierSWE: 32.0 (#9/13)
Grok 4.3 — ELO 1973, #64
- ProphetArena: 0.9188 (#6/46)
Step 3.7 Flash — ELO 1962, #71
- Design Arena (Game Dev): 1216.0 (#54/129)
Qwen 3.7 Plus — ELO 1960, #72
- Sycophancy (Lechmazur): 5.0 (#18/32)
Qwen 3.5 Plus — ELO 1951, #77
- Epoch AI - Apex Agents: 13.6 (#29/46)
Grok 4.20 — ELO 1936, #82
- Evals for Every Language - Language fa: 70.2 (#5/71)
- Evals for Every Language - MGSM: 87.39 (#7/70)
- Evals for Every Language - Language ak: 56.11 (#7/71)
- Evals for Every Language - Language cy: 77.85 (#7/71)
- Evals for Every Language - Language en: 84.29 (#7/71)
- Evals for Every Language - Language am: 64.62 (#8/71)
- Evals for Every Language - Language ba: 66.75 (#8/71)
- Evals for Every Language - Language ceb: 74.99 (#8/71)
- Evals for Every Language - Language es: 72.75 (#8/71)
- Evals for Every Language - Language ar: 67.76 (#9/71)
GPT-5.4 Mini — ELO 1912, #91
- ZeroBench: 10.0 (#13/60)
Claude Sonnet 4 (20250514) — ELO 1909, #95
- Epoch AI - Apex Agents: 9.3 (#33/46)
Gemini 3.1 Flash Lite — ELO 1905, #97
- Evals for Every Language - Language am: 68.6 (#1/71)
- Evals for Every Language - Language ca: 76.29 (#1/71)
- Evals for Every Language - Language ceb: 78.06 (#1/71)
- Evals for Every Language - Language cy: 82.03 (#1/71)
- Evals for Every Language - Language el: 73.81 (#1/71)
- Evals for Every Language - Language en: 87.28 (#1/71)
- Evals for Every Language - Language es: 76.16 (#1/71)
- Evals for Every Language - Language aeb: 53.18 (#2/71)
- Evals for Every Language - Language az: 67.76 (#2/71)
- Evals for Every Language - Language eo: 76.43 (#2/71)
Qwen 3.5 122B A10B — ELO 1903, #101
- LIBRA - ruSciPassageCount *: 21.38 (#3/13)
- LIBRA - ruBABILongQA1: 66.8 (#3/13)
- LIBRA - ruBABILongQA2: 53.71 (#3/13)
- LIBRA - ruBABILongQA3 *: 31.85 (#3/13)
- LIBRA - MatreshkaNames *: 67.39 (#4/13)
- LIBRA - LibrusecHistory: 79.77 (#4/13)
- LIBRA - ru2WikiMultihopQA *: 55.3 (#4/13)
- LIBRA - ruSciFi: 50.29 (#4/13)
- LIBRA - LibrusecMHQA *: 42.32 (#4/13)
- LIBRA - ruBABILongQA4: 58.91 (#4/13)
MiMo-V2.5 — ELO 1903, #102
- LLM Stats (Video-MME): 87.7 (#1/14)
- LLM Stats (Claw-Eval): 63.2 (#6/11)
- LLM Stats (CharXiv-R): 81.0 (#12/38)
- Vals AI Multimodal Index: 52.77 (#12/21)
- Vals AI (Vals Index): 51.57 (#15/29)
- Vals AI Vibe Code Bench: 42.17 (#21/62)
- Vals AI ProofBench: 16.0 (#22/42)
- Vals AI SAGE: 43.27 (#26/61)
- Vals AI MortgageTax: 59.26 (#49/80)
- Vals AI MedScribe: 72.15 (#50/64)
qwen3.6-flash — ELO 1872, #116
- Evals for Every Language - Language chm: 55.74 (#12/71)
- Evals for Every Language - Language am: 57.31 (#19/71)
- Evals for Every Language - Language ban: 58.91 (#19/71)
- Evals for Every Language - ARC: 91.99 (#20/69)
- Evals for Every Language - Language ckb: 62.38 (#20/71)
- Evals for Every Language - Language dz: 45.14 (#20/71)
- Evals for Every Language - Language en: 79.89 (#20/71)
- Evals for Every Language - Language ace: 57.48 (#21/71)
- Evals for Every Language - Language cv: 53.08 (#21/71)
- Evals for Every Language - Language ee: 41.46 (#21/71)
MiniMax-M2.7 — ELO 1853, #124
- ProphetArena: 0.9215 (#5/46)
O3 Mini — ELO 1850, #127
- FinBen - FNS: 16.95 (#4/21)
- FinBen - FinNum: 20.98 (#5/21)
nemotron-3-ultra-550B-a55B — ELO 1778, #168
- Vals AI ProofBench: 2.0 (#40/42)
- Vals AI Vibe Code Bench: 7.64 (#49/62)
- WeirdML: 43.45 (#63/131)
- Design Arena (Website): 1144.0 (#97/143)
DeepSeek V3.1 — ELO 1763, #176
- Evals for Every Language - Language da: 76.78 (#2/71)
- Evals for Every Language - Language ban: 65.1 (#4/71)
- Evals for Every Language - ARC: 97.4 (#5/69)
- Evals for Every Language - Language ay: 58.91 (#5/71)
- Evals for Every Language - Language ar: 68.89 (#6/71)
- Evals for Every Language - Language ca: 73.25 (#6/71)
- Evals for Every Language - Language bem: 54.51 (#7/71)
- Evals for Every Language - MMLU: 97.67 (#8/69)
- Evals for Every Language - Language el: 70.9 (#10/71)
- Evals for Every Language - Language as: 64.63 (#11/71)
GPT-4o — ELO 1712, #208
- FinBen (Financial LLM): 46.01 (#1/20)
- FinBen - QA: 78.22 (#1/20)
- FinBen - FNS: 25.5 (#3/21)
- FinBen - MultiFin: 59.26 (#4/20)
- FinBen - FinNum: 9.18 (#6/21)
Mistral Medium 3.5 — ELO 1712, #209
- Position Bias (Lechmazur): 72.5 (#36/36)
Mistral-Small-3.2-24B-Instruct-2506 — ELO 1708, #211
- Evals for Every Language - Classification: 89.59 (#24/70)
- Evals for Every Language - Language en: 76.19 (#29/71)
- Evals for Every Language - Language ars: 46.69 (#31/71)
- Evals for Every Language - Language awa: 61.09 (#31/71)
- Evals for Every Language - Language ca: 69.44 (#31/71)
- Evals for Every Language - Language be: 62.6 (#32/71)
- Evals for Every Language - Language cs: 66.53 (#32/71)
- Evals for Every Language - Language doi: 55.87 (#36/71)
- Evals for Every Language - Language eu: 59.54 (#37/71)
- Evals for Every Language - Language az: 58.06 (#39/71)
Qwen 3.5 35B A3B — ELO 1707, #213
- LIBRA - MatreshkaNames *: 68.97 (#2/13)
- LIBRA - ruSciPassageCount *: 21.89 (#2/13)
- LIBRA - ruSciFi: 51.47 (#2/13)
- LIBRA - ruBABILongQA1: 68.38 (#2/13)
- LIBRA - ruBABILongQA2: 54.97 (#2/13)
- LIBRA - ruBABILongQA3 *: 32.6 (#2/13)
- LIBRA - LibrusecHistory: 81.65 (#3/13)
- LIBRA - ru2WikiMultihopQA *: 56.6 (#3/13)
- LIBRA - LibrusecMHQA *: 43.32 (#3/13)
- LIBRA - ruBABILongQA4: 60.29 (#3/13)
DeepSeek V3 — ELO 1706, #215
- FinBen - FNS: 37.72 (#1/21)
- FinBen - MultiFin: 61.11 (#3/20)
- FinBen - FinNum: 7.43 (#7/21)
- FinBen - QA: 50.0 (#7/20)
- FinBen (Financial LLM): 10.2 (#13/20)
GPT-4.1 Mini — ELO 1705, #216
- GRAB-Lite: 18.6 (#32/38)
GPT-4o (2024-11-20) — ELO 1696, #224
- Epoch AI - Apex Agents: 1.1 (#46/46)
GLM 4.5 Air — ELO 1684, #230
- Evals for Every Language - Language chm: 47.52 (#21/71)
- Evals for Every Language - Language et: 66.43 (#21/71)
- Evals for Every Language - Language ckb: 60.22 (#22/71)
- Evals for Every Language - Language as: 60.21 (#23/71)
- Evals for Every Language - Language az: 62.15 (#24/71)
- Evals for Every Language - Language ak: 41.24 (#26/71)
- Evals for Every Language - Language es: 70.3 (#26/71)
- Evals for Every Language - Language ca: 70.13 (#27/71)
- Evals for Every Language - Language bho: 62.66 (#28/71)
- Evals for Every Language - Language ace: 51.96 (#29/71)
Hermes 4 70B — ELO 1674, #239
- Evals for Every Language - MGSM: 77.91 (#24/70)
- Evals for Every Language - MMLU: 88.52 (#26/69)
- Evals for Every Language - ARC: 83.16 (#38/69)
- Evals for Every Language - Language chm: 34.57 (#40/71)
- Evals for Every Language - Language dz: 28.93 (#40/71)
- Evals for Every Language - Language cv: 31.63 (#44/71)
- Evals for Every Language - Language am: 31.51 (#49/71)
- Evals for Every Language - Language ckb: 41.14 (#49/71)
- Evals for Every Language - Language as: 40.36 (#57/71)
- Evals for Every Language - Language ba: 41.82 (#58/71)
jamba-large-1.7 — ELO 1663, #245
- Evals for Every Language - Classification: 91.29 (#18/70)
- Evals for Every Language - Language af: 71.78 (#24/71)
- Evals for Every Language - Language fa: 65.89 (#24/71)
- Evals for Every Language - Language bg: 70.55 (#25/71)
- Evals for Every Language - Language ee: 30.99 (#26/71)
- Evals for Every Language - Language ar: 63.0 (#27/71)
- Evals for Every Language - Language be: 62.93 (#28/71)
- Evals for Every Language - Language de: 70.42 (#30/71)
- Evals for Every Language - Language aeb: 42.65 (#32/71)
- Evals for Every Language - Language doi: 56.88 (#33/71)
Llama 3.1 70B Instruct — ELO 1658, #251
- FinBen - FinNum: 46.34 (#3/21)
- FinBen - QA: 64.44 (#3/20)
- FinBen - FNS: 13.61 (#7/21)
- FinBen - MultiFin: 50.0 (#7/20)
- FinBen (Financial LLM): 14.07 (#8/20)
Ministral 3 8B (2512) — ELO 1640, #263
- Evals for Every Language - Language bm: 29.35 (#28/71)
- Evals for Every Language - Classification: 84.43 (#39/70)
- Evals for Every Language - Language cs: 65.2 (#41/71)
- Evals for Every Language - Language en: 73.09 (#42/71)
- Evals for Every Language - Language bn: 62.11 (#43/71)
- Evals for Every Language - Language es: 67.86 (#44/71)
- Evals for Every Language - Language el: 63.45 (#45/71)
- Evals for Every Language - Language be: 59.13 (#46/71)
- Evals for Every Language - Language ace: 43.57 (#47/71)
- Evals for Every Language - Language chm: 31.67 (#47/71)
Gemma 3 27B (IT) — ELO 1639, #266
- Evals for Every Language - Language el: 72.48 (#3/71)
- FinBen (Financial LLM): 15.74 (#7/20)
- FinBen - FinNum: 0.0 (#10/21)
- FinBen - MultiFin: 38.89 (#10/20)
- Evals for Every Language - Language eo: 73.18 (#10/71)
- Evals for Every Language - Classification: 95.41 (#11/70)
- Evals for Every Language - Language bg: 73.9 (#11/71)
- Evals for Every Language - Language es: 72.28 (#11/71)
- FinBen - QA: 22.67 (#13/20)
- FinBen - FNS: 0.21 (#14/21)
nova-2-lite-v1 — ELO 1635, #268
- Evals for Every Language - MMLU: 95.33 (#12/69)
- Evals for Every Language - Language en: 81.54 (#12/71)
- Evals for Every Language - Language be: 64.22 (#17/71)
- Evals for Every Language - Language chm: 52.92 (#18/71)
- Evals for Every Language - MGSM: 80.9 (#19/70)
- Evals for Every Language - Language bn: 68.34 (#19/71)
- Evals for Every Language - Language ak: 46.55 (#20/71)
- Evals for Every Language - Language cv: 53.69 (#20/71)
- Evals for Every Language - Language da: 71.55 (#21/71)
- Evals for Every Language - Language bm: 36.11 (#22/71)
Qwen 3.5 9B — ELO 1628, #272
- LIBRA - ruSciPassageCount *: 20.77 (#4/13)
- LIBRA - ruBABILongQA1: 64.88 (#4/13)
- LIBRA - ruBABILongQA2: 52.16 (#4/13)
- LIBRA - ruBABILongQA3 *: 30.94 (#4/13)
- LIBRA - MatreshkaNames *: 65.44 (#5/13)
- LIBRA - LibrusecHistory: 77.47 (#5/13)
- LIBRA - ru2WikiMultihopQA *: 53.7 (#5/13)
- LIBRA - ruSciFi: 48.84 (#5/13)
- LIBRA - LibrusecMHQA *: 41.1 (#5/13)
- LIBRA - ruBABILongQA4: 57.21 (#5/13)
Qwen 3 30B A3B 2507 Instruct — ELO 1615, #280
- Evals for Every Language - Language ars: 50.46 (#10/71)
- Evals for Every Language - Language aeb: 45.53 (#18/71)
- Evals for Every Language - Language en: 78.55 (#23/71)
- Evals for Every Language - Language bs: 68.53 (#30/71)
- Evals for Every Language - Language bg: 69.16 (#32/71)
- Evals for Every Language - Language arz: 42.54 (#33/71)
- Evals for Every Language - Language dz: 32.48 (#33/71)
- Evals for Every Language - Language am: 42.2 (#35/71)
- Evals for Every Language - Language bn: 63.67 (#37/71)
- Evals for Every Language - Language ace: 47.99 (#38/71)
Hunyuan A13B-Instruct — ELO 1579, #307
- Evals for Every Language - Language ars: 42.49 (#55/71)
- Evals for Every Language - Language aeb: 35.99 (#56/71)
- Evals for Every Language - Language apc: 40.5 (#56/71)
- Evals for Every Language - Translation From: 22.42 (#57/71)
- Evals for Every Language - Language ary: 33.19 (#57/71)
- Evals for Every Language - Language ak: 25.6 (#58/71)
- Evals for Every Language - Language cv: 24.99 (#58/71)
- Evals for Every Language - Language arz: 36.21 (#59/71)
- Evals for Every Language - Translation To: 18.34 (#60/71)
- Evals for Every Language - Language bjn: 29.79 (#61/71)
GPT-4o Mini — ELO 1543, #345
- GRAB-Lite: 11.4 (#38/38)
Ministral 3 14B (2512) — ELO 1532, #356
- Evals for Every Language - Classification: 88.17 (#31/70)
- Evals for Every Language - Language be: 62.67 (#31/71)
- Evals for Every Language - Language el: 66.63 (#31/71)
- Evals for Every Language - Language bn: 64.34 (#33/71)
- Evals for Every Language - Language az: 59.29 (#34/71)
- Evals for Every Language - Language af: 69.81 (#35/71)
- Evals for Every Language - Language es: 68.9 (#37/71)
- Evals for Every Language - Language en: 73.16 (#41/71)
- Evals for Every Language - Language arz: 40.92 (#42/71)
- Evals for Every Language - Language bg: 67.27 (#42/71)
GPT-OSS-20B — ELO 1515, #371
- Evals for Every Language - Language en: 77.2 (#26/71)
- Evals for Every Language - Language es: 69.42 (#30/71)
- Evals for Every Language - Language awa: 60.89 (#34/71)
- Evals for Every Language - Language bs: 67.0 (#37/71)
- Evals for Every Language - Language da: 69.16 (#37/71)
- Evals for Every Language - Language dz: 30.01 (#37/71)
- Evals for Every Language - Language as: 54.72 (#38/71)
- Evals for Every Language - Language bem: 33.09 (#39/71)
- Evals for Every Language - Language ak: 34.16 (#40/71)
- Evals for Every Language - Language cs: 65.28 (#40/71)
Llama 4 Scout Instruct — ELO 1498, #384
- FinBen - FinNum: 49.12 (#2/21)
- FinBen - QA: 74.22 (#2/20)
- FinBen (Financial LLM): 20.89 (#3/20)
- FinBen - FNS: 16.9 (#5/21)
- FinBen - MultiFin: 55.56 (#5/20)
Laguna M.1 — ELO 1491, #391
- Vals AI (Vals Index): 35.27 (#27/29)
- Vals AI ProofBench: 0.0 (#42/42)
- Vals AI Terminal-Bench 2.0: 31.46 (#43/68)
- Vals AI Vibe Code Bench: 10.94 (#48/62)
- Vals AI MedCode: 25.24 (#64/67)
- Vals AI CorpFin v2: 58.16 (#68/115)
- Vals AI LegalBench: 75.14 (#86/118)
- Vals AI TaxEval v2: 1.64 (#121/121)
granite-4.0-h-micro — ELO 1486, #399
- Evals for Every Language - Classification: 86.11 (#36/70)
- Evals for Every Language - Language ar: 60.19 (#45/71)
- Evals for Every Language - Language cv: 27.5 (#52/71)
- Evals for Every Language - Language ay: 29.15 (#54/71)
- Evals for Every Language - Language bn: 50.72 (#55/71)
- Evals for Every Language - Language ary: 33.31 (#56/71)
- Evals for Every Language - Language bg: 58.7 (#57/71)
- Evals for Every Language - Language eo: 59.1 (#57/71)
- Evals for Every Language - Language ak: 25.2 (#59/71)
- Evals for Every Language - Language da: 56.25 (#60/71)
Laguna XS.2 — ELO 1486, #401
- Vals AI (Vals Index): 29.15 (#28/29)
- Vals AI ProofBench: 1.0 (#41/42)
- Vals AI Terminal-Bench 2.0: 28.09 (#47/68)
- Vals AI Vibe Code Bench: 3.84 (#53/62)
- Vals AI MedCode: 20.7 (#66/67)
- Vals AI CorpFin v2: 56.33 (#72/115)
- Vals AI LegalBench: 71.03 (#91/118)
- Vals AI TaxEval v2: 59.98 (#107/121)
Gemma 3 4B (IT) — ELO 1463, #424
- FinBen (Financial LLM): 12.74 (#9/20)
- FinBen - FinNum: 0.0 (#9/21)
- FinBen - MultiFin: 38.89 (#9/20)
- FinBen - QA: 22.67 (#12/20)
- FinBen - FNS: 0.24 (#13/21)
Phi-4 Mini Instruct — ELO 1451, #434
- Evals for Every Language - Classification: 79.23 (#54/70)
- Evals for Every Language - Language ckb: 31.9 (#56/71)
- Evals for Every Language - Language aeb: 33.95 (#61/71)
- Evals for Every Language - Language ee: 21.7 (#62/71)
- Evals for Every Language - Language en: 62.96 (#62/71)
- Evals for Every Language - Language es: 51.93 (#65/71)
- Evals for Every Language - MGSM: 16.66 (#66/70)
- Evals for Every Language - MMLU: 43.8 (#66/69)
- Evals for Every Language - ARC: 41.91 (#67/69)
- Evals for Every Language - Language doi: 30.09 (#67/71)
Qwen 3.5 4B — ELO 1430, #455
- LIBRA - ruSciPassageCount *: 19.57 (#5/13)
- LIBRA - ruBABILongQA1: 61.13 (#5/13)
- LIBRA - ruBABILongQA2: 49.14 (#5/13)
- LIBRA - ruBABILongQA3 *: 29.15 (#5/13)
- LIBRA - MatreshkaNames *: 61.66 (#6/13)
- LIBRA - LibrusecHistory: 72.99 (#6/13)
- LIBRA - ru2WikiMultihopQA *: 50.6 (#6/13)
- LIBRA - ruSciFi: 46.02 (#6/13)
- LIBRA - LibrusecMHQA *: 38.73 (#6/13)
- LIBRA - ruBABILongQA4: 53.9 (#6/13)
Qwen3.5 0.8B — ELO 1370, #528
- LIBRA - ruSciPassageCount *: 17.79 (#7/13)
- LIBRA - ruBABILongQA2: 44.67 (#7/13)
- LIBRA - MatreshkaNames *: 56.05 (#8/13)
- LIBRA - ru2WikiMultihopQA *: 46.0 (#8/13)
- LIBRA - LibrusecMHQA *: 35.21 (#8/13)
- LIBRA - ruBABILongQA1: 55.57 (#8/13)
- LIBRA - ruBABILongQA4: 49.0 (#8/13)
- LIBRA - ruSciAbstractRetrieval: 56.26 (#9/13)
- LIBRA - ruSciFi: 41.83 (#9/13)
- LIBRA - ruBABILongQA3 *: 26.5 (#9/13)
Qwen 3.5 2B — ELO 1247, #653
- LIBRA - ruSciPassageCount *: 18.72 (#6/13)
- LIBRA - ruBABILongQA2: 47.01 (#6/13)
- LIBRA - ruBABILongQA3 *: 27.88 (#6/13)
- LIBRA - MatreshkaNames *: 58.98 (#7/13)
- LIBRA - ru2WikiMultihopQA *: 48.4 (#7/13)
- LIBRA - ruSciFi: 44.02 (#7/13)
- LIBRA - LibrusecMHQA *: 37.05 (#7/13)
- LIBRA - ruBABILongQA1: 58.48 (#7/13)
- LIBRA - ruBABILongQA4: 51.56 (#7/13)
- LIBRA - LibrusecHistory: 69.83 (#8/13)
Qwen2.5-Omni-7B — ELO 1227, #667
- FinBen (Financial LLM): 33.53 (#2/20)
- FinBen - FinNum: 0.4 (#8/21)
- FinBen - QA: 48.89 (#8/20)
- FinBen - FNS: 5.6 (#11/21)
- FinBen - MultiFin: 38.89 (#11/20)
Gemma 4 12B — ELO 1100, #731
- LLM Stats (MRCR v2): 43.4 (#3/7)
- LLM Stats (FLEURS): 93.1 (#4/6)
- LLM Stats (MedXpertQA): 48.7 (#8/12)
- LLM Stats (MathVision): 79.7 (#9/28)
- LLM Stats (AIME 2026): 77.5 (#13/16)
- LLM Stats (OmniDocBench 1.5): 16.4 (#13/15)
- LLM Stats (CodeForces): 55.3 (#15/16)
- LLM Stats (MMMLU): 83.4 (#34/48)
- ZeroEval GPQA Diamond: 78.8 (#82/223)

Top-10 New Scores (186)

Claude Fable 5 on AI Chess Leaderboard (Continuation): 1092.0 (#30)
Claude Fable 5 on AI Chess Leaderboard (Reasoning): 1711.0 (#8)
Claude Fable 5 on Chatbot Arena (Document): 1495.0 (#5)
Claude Fable 5 on Chatbot Arena (Vision): 1307.0 (#2)
Claude Fable 5 on ClockBench: 35.0 (#4)
Claude Fable 5 on Epoch AI - Apex Agents: 45.0 (#3)
Claude Fable 5 on LLM Stats (GDPval-AA): 1932.0 (#1)
Claude Fable 5 on Lynchmark: 100.0 (#1)
Claude Fable 5 on MineBench: 1790.51 (#4)
Claude Fable 5 on PM-LLM-Benchmark: 35.6 (#13)
Claude Fable 5 on PinchBench: 59.61 (#44)
Claude Fable 5 on React Native Evals: 86.96 (#4)
Claude Fable 5 on SEAL - MCP Atlas: 83.3 (#2)
Claude Fable 5 on Vals AI MedCode: 56.07 (#2)
Claude Fable 5 on Vals AI MortgageTax: 68.92 (#5)
Claude Fable 5 on Vals AI SAGE: 51.89 (#5)
Claude Fable 5 on Vals AI TaxEval v2: 76.94 (#3)
Claude Fable 5 on Vellum - GPQA: 94.1 (#3)
Claude Fable 5 on Vellum - HumanEval: 95.0 (#2)
Claude Fable 5 on Vending-Bench 2: 4529.94 (#18)
Claude Opus 4.7 on Android Bench: 68.7 (#4)
Claude Opus 4.7 on Evals for Every Language: 66.95 (#2)
Claude Opus 4.7 on Evals for Every Language - ARC: 97.23 (#6)
Claude Opus 4.7 on Evals for Every Language - Classification: 95.98 (#7)
Claude Opus 4.7 on Evals for Every Language - Language ace: 69.04 (#3)
Claude Opus 4.7 on Evals for Every Language - Language aeb: 50.61 (#4)
Claude Opus 4.7 on Evals for Every Language - Language af: 76.97 (#9)
Claude Opus 4.7 on Evals for Every Language - Language ak: 59.75 (#3)
Claude Opus 4.7 on Evals for Every Language - Language am: 67.86 (#2)
Claude Opus 4.7 on Evals for Every Language - Language apc: 55.53 (#5)
Claude Opus 4.7 on Evals for Every Language - Language ar: 70.69 (#2)
Claude Opus 4.7 on Evals for Every Language - Language ars: 49.83 (#13)
Claude Opus 4.7 on Evals for Every Language - Language ary: 44.23 (#12)
Claude Opus 4.7 on Evals for Every Language - Language arz: 52.06 (#2)
Claude Opus 4.7 on Evals for Every Language - Language as: 68.11 (#2)
Claude Opus 4.7 on Evals for Every Language - Language awa: 68.23 (#2)
Claude Opus 4.7 on Evals for Every Language - Language ay: 59.38 (#3)
Claude Opus 4.7 on Evals for Every Language - Language az: 65.04 (#8)
Claude Opus 4.7 on Evals for Every Language - Language ba: 67.46 (#6)
Claude Opus 4.7 on Evals for Every Language - Language ban: 65.75 (#3)
Claude Opus 4.7 on Evals for Every Language - Language be: 66.48 (#4)
Claude Opus 4.7 on Evals for Every Language - Language bem: 59.05 (#4)
Claude Opus 4.7 on Evals for Every Language - Language bg: 74.44 (#4)
Claude Opus 4.7 on Evals for Every Language - Language bho: 67.27 (#8)
Claude Opus 4.7 on Evals for Every Language - Language bjn: 48.88 (#4)
Claude Opus 4.7 on Evals for Every Language - Language bm: 58.41 (#3)
Claude Opus 4.7 on Evals for Every Language - Language bn: 72.35 (#4)
Claude Opus 4.7 on Evals for Every Language - Language bs: 70.22 (#24)
Claude Opus 4.7 on Evals for Every Language - Language ca: 72.35 (#14)
Claude Opus 4.7 on Evals for Every Language - Language ceb: 75.18 (#6)
Claude Opus 4.7 on Evals for Every Language - Language ckb: 70.88 (#3)
Claude Opus 4.7 on Evals for Every Language - Language crh: 66.99 (#3)
Claude Opus 4.7 on Evals for Every Language - Language cs: 74.38 (#1)
Claude Opus 4.7 on Evals for Every Language - Language cv: 62.92 (#4)
Claude Opus 4.7 on Evals for Every Language - Language cy: 79.87 (#5)
Claude Opus 4.7 on Evals for Every Language - Language da: 74.47 (#6)
Claude Opus 4.7 on Evals for Every Language - Language de: 75.66 (#6)
Claude Opus 4.7 on Evals for Every Language - Language dz: 59.16 (#3)
Claude Opus 4.7 on Evals for Every Language - Language ee: 60.86 (#2)
Claude Opus 4.7 on Evals for Every Language - Language el: 71.56 (#8)
Claude Opus 4.7 on Evals for Every Language - Language en: 84.79 (#5)
Claude Opus 4.7 on Evals for Every Language - Language eo: 75.16 (#4)
Claude Opus 4.7 on Evals for Every Language - Language es: 70.89 (#19)
Claude Opus 4.7 on Evals for Every Language - Language et: 71.59 (#5)
Claude Opus 4.7 on Evals for Every Language - Language eu: 68.46 (#7)
Claude Opus 4.7 on Evals for Every Language - Language fa: 69.71 (#8)
Claude Opus 4.7 on Evals for Every Language - MGSM: 95.57 (#2)
Claude Opus 4.7 on Evals for Every Language - MMLU: 95.33 (#13)
Claude Opus 4.7 on Evals for Every Language - Translation From: 40.53 (#7)
Claude Opus 4.7 on Evals for Every Language - Translation To: 39.5 (#4)
Claude Opus 4.7 on GRAB-Lite: 58.2 (#10)
Claude Opus 4.8 on Chess Puzzles (Epoch AI): 34.0 (#13)
Claude Opus 4.8 on Design Arena (Game Dev): 1300.0 (#17)
Claude Opus 4.8 on EQ-Bench Longform Writing: 80.8 (#3)
Claude Opus 4.8 on Epoch AI - Apex Agents: 42.5 (#4)
Claude Opus 4.8 on Epoch AI - ECI: 156.34 (#14)
Claude Opus 4.8 on Evals for Every Language: 66.27 (#3)
Claude Opus 4.8 on Evals for Every Language - ARC: 98.0 (#3)
Claude Opus 4.8 on Evals for Every Language - Classification: 90.31 (#21)
Claude Opus 4.8 on Evals for Every Language - Language ace: 66.63 (#6)
Claude Opus 4.8 on Evals for Every Language - Language aeb: 50.53 (#5)
Claude Opus 4.8 on Evals for Every Language - Language af: 78.38 (#4)
Claude Opus 4.8 on Evals for Every Language - Language ak: 60.02 (#2)
Claude Opus 4.8 on Evals for Every Language - Language am: 65.76 (#5)
Claude Opus 4.8 on Evals for Every Language - Language apc: 49.54 (#21)
Claude Opus 4.8 on Evals for Every Language - Language ars: 47.35 (#26)
Claude Opus 4.8 on Evals for Every Language - Language ary: 40.29 (#25)
Claude Opus 4.8 on Evals for Every Language - Language arz: 49.71 (#6)
Claude Opus 4.8 on Evals for Every Language - Language as: 66.93 (#4)
Claude Opus 4.8 on Evals for Every Language - Language awa: 67.71 (#4)
Claude Opus 4.8 on Evals for Every Language - Language ay: 58.4 (#6)
Claude Opus 4.8 on Evals for Every Language - Language az: 65.38 (#5)
Claude Opus 4.8 on Evals for Every Language - Language ba: 67.66 (#5)
Claude Opus 4.8 on Evals for Every Language - Language ban: 63.8 (#5)
Claude Opus 4.8 on Evals for Every Language - Language bem: 60.25 (#2)
Claude Opus 4.8 on Evals for Every Language - Language bg: 74.44 (#5)
Claude Opus 4.8 on Evals for Every Language - Language bho: 67.32 (#7)
Claude Opus 4.8 on Evals for Every Language - Language bjn: 47.35 (#6)
Claude Opus 4.8 on Evals for Every Language - Language bm: 59.47 (#2)
Claude Opus 4.8 on Evals for Every Language - Language bn: 70.4 (#12)
Claude Opus 4.8 on Evals for Every Language - Language bs: 74.0 (#5)
Claude Opus 4.8 on Evals for Every Language - Language ca: 74.29 (#3)
Claude Opus 4.8 on Evals for Every Language - Language ceb: 75.82 (#5)
Claude Opus 4.8 on Evals for Every Language - Language chm: 63.17 (#2)
Claude Opus 4.8 on Evals for Every Language - Language ckb: 71.59 (#2)
Claude Opus 4.8 on Evals for Every Language - Language crh: 69.2 (#2)
Claude Opus 4.8 on Evals for Every Language - Language cs: 73.8 (#3)
Claude Opus 4.8 on Evals for Every Language - Language cv: 64.32 (#3)
Claude Opus 4.8 on Evals for Every Language - Language cy: 79.83 (#6)
Claude Opus 4.8 on Evals for Every Language - Language da: 74.57 (#5)
Claude Opus 4.8 on Evals for Every Language - Language de: 76.71 (#3)
Claude Opus 4.8 on Evals for Every Language - Language doi: 70.16 (#4)
Claude Opus 4.8 on Evals for Every Language - Language dz: 58.51 (#4)
Claude Opus 4.8 on Evals for Every Language - Language ee: 57.06 (#4)
Claude Opus 4.8 on Evals for Every Language - Language el: 70.34 (#13)
Claude Opus 4.8 on Evals for Every Language - Language en: 86.15 (#2)
Claude Opus 4.8 on Evals for Every Language - Language eo: 74.5 (#6)
Claude Opus 4.8 on Evals for Every Language - Language es: 70.97 (#18)
Claude Opus 4.8 on Evals for Every Language - Language et: 70.93 (#7)
Claude Opus 4.8 on Evals for Every Language - Language eu: 66.0 (#19)
Claude Opus 4.8 on Evals for Every Language - Language fa: 69.54 (#9)
Claude Opus 4.8 on Evals for Every Language - MMLU: 98.31 (#4)
Claude Opus 4.8 on Evals for Every Language - Translation From: 39.86 (#9)
Claude Opus 4.8 on Evals for Every Language - Translation To: 38.22 (#7)
Claude Opus 4.8 on GRAB-Lite: 60.6 (#6)
Claude Opus 4.8 on OTIS Mock AIME 2024-25: 98.33 (#4)
Claude Opus 4.8 on SimpleQA Verified: 39.5 (#26)
Claude Opus 4.8 on WebDev Arena: 1545.05 (#6)
Claude Opus 4.8 on Wolfram LLM Benchmarking Project: 65.9 (#18)
Claude Opus 4.8 on ZeroBench: 17.0 (#7)
GPT-5.5 on Blueprint-Bench 2: 0.362 (#2)
GPT-5.5 on Evals for Every Language: 65.09 (#5)
GPT-5.5 on Evals for Every Language - ARC: 97.82 (#4)
GPT-5.5 on Evals for Every Language - Classification: 82.73 (#42)
GPT-5.5 on Evals for Every Language - Language ace: 67.32 (#5)
GPT-5.5 on Evals for Every Language - Language aeb: 44.61 (#22)
GPT-5.5 on Evals for Every Language - Language af: 77.33 (#8)
GPT-5.5 on Evals for Every Language - Language ak: 57.86 (#5)
GPT-5.5 on Evals for Every Language - Language am: 65.01 (#6)
GPT-5.5 on Evals for Every Language - Language apc: 50.92 (#12)
GPT-5.5 on Evals for Every Language - Language ar: 65.19 (#18)
GPT-5.5 on Evals for Every Language - Language ars: 46.47 (#33)
GPT-5.5 on Evals for Every Language - Language ary: 47.34 (#2)
GPT-5.5 on Evals for Every Language - Language arz: 45.23 (#19)
GPT-5.5 on Evals for Every Language - Language as: 66.04 (#8)
GPT-5.5 on Evals for Every Language - Language awa: 66.14 (#8)
GPT-5.5 on Evals for Every Language - Language ay: 59.02 (#4)
GPT-5.5 on Evals for Every Language - Language az: 65.39 (#4)
GPT-5.5 on Evals for Every Language - Language ba: 64.64 (#14)
GPT-5.5 on Evals for Every Language - Language ban: 62.74 (#8)
GPT-5.5 on Evals for Every Language - Language be: 64.63 (#16)
GPT-5.5 on Evals for Every Language - Language bem: 53.46 (#8)
GPT-5.5 on Evals for Every Language - Language bg: 71.22 (#23)
GPT-5.5 on Evals for Every Language - Language bho: 67.61 (#4)
GPT-5.5 on Evals for Every Language - Language bjn: 44.06 (#12)
GPT-5.5 on Evals for Every Language - Language bm: 54.72 (#4)
GPT-5.5 on Evals for Every Language - Language bn: 69.73 (#14)
GPT-5.5 on Evals for Every Language - Language bs: 71.46 (#13)
GPT-5.5 on Evals for Every Language - Language ca: 73.21 (#7)
GPT-5.5 on Evals for Every Language - Language ceb: 74.54 (#10)
GPT-5.5 on Evals for Every Language - Language chm: 58.46 (#9)
GPT-5.5 on Evals for Every Language - Language ckb: 68.48 (#5)
GPT-5.5 on Evals for Every Language - Language crh: 63.78 (#15)
GPT-5.5 on Evals for Every Language - Language cs: 71.8 (#10)
GPT-5.5 on Evals for Every Language - Language cv: 59.68 (#10)
GPT-5.5 on Evals for Every Language - Language cy: 77.61 (#8)
GPT-5.5 on Evals for Every Language - Language da: 71.48 (#23)
GPT-5.5 on Evals for Every Language - Language de: 73.13 (#20)
GPT-5.5 on Evals for Every Language - Language doi: 71.32 (#2)
GPT-5.5 on Evals for Every Language - Language dz: 58.36 (#6)
GPT-5.5 on Evals for Every Language - Language ee: 56.99 (#5)
GPT-5.5 on Evals for Every Language - Language el: 71.64 (#6)
GPT-5.5 on Evals for Every Language - Language en: 85.03 (#4)
GPT-5.5 on Evals for Every Language - Language eo: 72.05 (#13)
GPT-5.5 on Evals for Every Language - Language es: 70.48 (#23)
GPT-5.5 on Evals for Every Language - Language et: 72.25 (#3)
GPT-5.5 on Evals for Every Language - Language eu: 67.59 (#11)
GPT-5.5 on Evals for Every Language - Language fa: 67.54 (#12)
GPT-5.5 on Evals for Every Language - MGSM: 90.21 (#5)
GPT-5.5 on Evals for Every Language - MMLU: 98.21 (#5)
GPT-5.5 on Evals for Every Language - Translation From: 40.95 (#6)
GPT-5.5 on Evals for Every Language - Translation To: 39.31 (#5)
GPT-5.5 on GRAB-Lite: 71.8 (#2)
Qwen 3.7 Max on Position Bias (Lechmazur): 34.8 (#10)
Qwen 3.7 Max on RuneBench: 2222.0 (#11)
Qwen 3.7 Max on Wolfram LLM Benchmarking Project: 67.5 (#14)

New #1 Leaders (92)

YC-Bench: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1
PACT (Lechmazur): Claude Fable 5 (High) (2171.0) beat GPT-5.5 (High) by 155.0
Chatbot Arena (Code): Claude Fable 5 (1665.0) beat Claude Opus 4.7 (Thinking) by 98.0
Chatbot Arena (Text-to-Video): gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p by 64.0
Design Arena (UI Components): Claude Fable 5 (1417.0) beat Claude Opus 4.7 by 57.0
Multi-turn Debate (Lechmazur): Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8
AA GDPval: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67
Design Arena (Data Viz): Claude Fable 5 (1381.0) beat Claude Opus 4.7 (Thinking) by 42.0
Design Arena (Game Dev): Claude Fable 5 (1382.0) beat GPT-5.5 by 27.0
GSMA Open-Telco - TeleTables: TelecomGPT (88.0) beat OTel-LLM-8.3B-QnA by 26.2
LLM Stats (MCP-Mark): Kimi K2.7 Code (81.1) beat Qwen 3.7 Max by 20.3
Design Arena (Image): riverflow-2.5-pro (1419.0) beat gpt-image-2 by 17.0
WDCD: Qwen 3 Max (84.38) beat Claude Opus 4.7 by 14.38
Evals for Every Language - Language ay: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23
SEAL - SWE Atlas - Test Writing: Fable-5 (Claude Code) xHigh (58.52) beat GPT-5.4 (xHigh) by 14.16
LiveBench Python: Claude Fable 5 (xHigh) (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0
LLM Stats (FLEURS): Qwen2.5-Omni-7B (95.9) beat Gemini 1.5 Flash-8B by 9.5
CursorBench 3.1: Claude Fable 5 (Max) (72.9) beat Claude Opus 4.7 by 8.1
AA Omniscience - Software Engineering (SWE) - Dart: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0
AA Omniscience - Software Engineering (SWE) - R: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (High) by 8.0
AA Omniscience - Software Engineering (SWE) - Swift: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0
Vals AI Vibe Code Bench: Claude Fable 5 (90.35) beat Claude Opus 4.8 by 7.63
AA Humanity's Last Exam: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6
AA Omniscience: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22
FrontierSWE: Claude Fable 5 (90.0) beat Claude Opus 4.8 by 7.0
Vellum - HumanEval: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9
Vellum - Humanity's Last Exam: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6
Evals for Every Language - Language crh: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27
Chatbot Arena (Text): Claude Fable 5 (1510.0) beat Claude Opus 4.6 (Thinking) by 6.0
AA Omniscience - Software Engineering (SWE) - Java: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0
Vals AI ProofBench: Claude Fable 5 (77.0) beat aristotle by 6.0
AA Omniscience - Business: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9
FinBen - MultiFin: plutus-8B-instruct (72.22) beat Qwen 2.5 72B Instruct by 5.55
AA Omniscience - Science, Engineering & Mathematics: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8
Vals AI (Vals Index): Claude Fable 5 (75.14) beat Claude Opus 4.8 by 4.78
OpenClawProBench: GLM-5.2 (81.3) beat intern-s2-preview by 4.6
Vals AI IOI: Claude Fable 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42
AA Omniscience - Humanities & Social Sciences: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3
Design Arena (Website): Claude Fable 5 (1345.0) beat Claude Opus 4.6 by 4.0
AA Omniscience - Software Engineering (SWE) - Go: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0
MathArena - ARXIV April: Claude Fable 5 (Max) (70.73) beat GPT-5.5 (xHigh) by 3.66
GSMA Open-Telco LLM Leaderboard: TelecomGPT (89.64) beat OTel-LLM-8.3B-QnA by 3.66
FinBen - QA: GPT-4o (78.22) beat GPT-4.5 (Preview) by 3.55
Artificial Analysis Intelligence Index: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44
Evals for Every Language - Language cv: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39
SEAL - SWE Atlas - Codebase QnA: Opus 4.8 (Claude Code) (48.79) beat GPT-5.5 by 3.36
Vals AI CorpFin v2: Claude Fable 5 (71.83) beat Grok 4.3 by 3.3
Vals AI Multimodal Index: Claude Fable 5 (74.15) beat Claude Opus 4.8 by 3.26
AA Omniscience - Software Engineering (SWE): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2
Design Arena (3D): Claude Fable 5 (1370.0) beat Kimi K2.6 by 3.0
GRAB-Lite: Claude Fable 5 (74.0) beat GPT-5.4 by 3.0
WeirdML: Claude Fable 5 (High) (87.85) beat GPT-5.5 (xHigh) by 2.94
BIRD-SQL: Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) by 2.9
GSMA Open-Telco - 3GPP: TelecomGPT (84.22) beat OTel-LLM-8.3B-QnA by 2.82
GSMA Open-Telco - TeleLogs: TelecomGPT (98.96) beat OTel-LLM-8.3B-QnA by 2.66
Evals for Every Language - MGSM: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36
Evals for Every Language - Language ban: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32
SimpleBench: Claude Fable 5 (81.9) beat Gemini 3.1 Pro (Preview) by 2.3
AA Terminal-Bench Hard: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27
Chatbot Arena (Image-to-Video): gemini-omni-flash (1475.0) beat Grok 1.5 by 2.0
LiveBench Plot Unscrambling: Claude Fable 5 (xHigh) (78.09) beat GPT-5.5 (High) by 1.81
UGI - Writing: Claude Fable 5 (Adaptive Reasoning, High Effort) (74.23) beat Gemini 3.5 Flash (Thinking, Medium) by 1.69
GSMA Open-Telco - srsRAN-Bench: TelecomGPT (91.33) beat OTel-LLM-8.3B-QnA by 1.65
LLM Stats (OSWorld-Verified): Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6
AA Omniscience - Software Engineering (SWE) - Python: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5
Evals for Every Language - Language chm: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48
Evals for Every Language - Language doi: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46
AA CritPt: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43
Evals for Every Language - Language es: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42
AA SciCode: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - Language ace: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - MMLU: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27
Evals for Every Language - ARC: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26
EQ-Bench Longform Writing: Claude Fable 5 (83.0) beat Claude Opus 4.7 by 1.2
Vals AI LegalBench: Claude Fable 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16
Evals for Every Language - Language ca: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03
Design Arena (SVG): Claude Fable 5 (1370.0) beat prism by 1.0
Opper TaskBench: Claude Fable 5 (96.4) beat Claude Opus 4.7 by 1.0
Evals for Every Language - Language ar: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95
Evals for Every Language - Language en: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77
MathArena - HMMT Feb 2026: GPT-5.5 (xHigh) (98.48) beat GPT-5.4 (xHigh) by 0.75
Evals for Every Language - Language cy: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65
Evals for Every Language - Language am: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59
Vals AI MedScribe: Claude Fable 5 (88.52) beat GPT-5.1 by 0.43
Evals for Every Language - Language af: Gemini 3.1 Pro (Preview) (79.41) beat Claude Sonnet 4 by 0.43
Evals for Every Language - Language be: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32
LLM Stats (Video-MME): MiMo-V2.5 (87.7) beat Kimi K2.5 by 0.3
Evals for Every Language - Language ceb: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29
Evals for Every Language - Language el: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15
LLM Stats (CMMLU): MiMo-V2.5-Pro (90.2) beat Qwen 2 72B Instruct by 0.1
Blueprint-Bench 2: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02
LiveBench Olympiad: Claude Fable 5 (High) (92.18) beat Claude Opus 4.6 (Thinking, High) by 0.01

AI Benchmark Digest — 2026-06-13

2026-06-13T08:02:57.174839+00:00

Daily

Top-10 New Scores (11)

Claude 5 on Chess Puzzles (Epoch AI): 41.0 (#8)
Claude 5 on OTIS Mock AIME 2024-25: 99.72 (#3)
Claude 5 on SimpleQA Verified: 68.3 (#4)
Claude Fable 5 on Epoch AI - Apex Agents: 45.0 (#3)
Claude Fable 5 on Icelandic LLM - ARC-Challenge-IS: 72.95 (#59)
Claude Fable 5 on Icelandic LLM - Belebele-IS: 90.78 (#36)
Claude Fable 5 on Icelandic LLM - Inflection: 97.75 (#2)
Claude Fable 5 on Icelandic LLM - WinoGrande-IS: 96.05 (#2)
Claude Fable 5 on Icelandic LLM Leaderboard - Average: 87.4 (#4)
GPT-5.5 on Blueprint-Bench 2: 0.362 (#2)
Qwen 3.7 Max on Wolfram LLM Benchmarking Project: 67.5 (#14)

New #1 Leaders (6)

Design Arena (Image): riverflow-2.5-pro (1416.0) beat gpt-image-2 by 23.0
LLM Stats (MCP-Mark): Kimi K2.7 Code (81.1) beat Qwen 3.7 Max by 20.3
Icelandic LLM - WikiQA-IS: Claude Fable 5 (75.39) beat Gemini 3.1 Pro (Preview) by 7.65
Icelandic LLM - GED: Claude Fable 5 (91.5) beat Claude Opus 4.7 by 7.0
BIRD-SQL: Gemini-SQL2 (80.04) beat Gemini-SQL (Multitask SFT + Gemini-2.5-Pro) by 2.9
Design Arena (Graphic Design): riverflow-2.5-pro (1474.0) beat gpt-image-2 by 1.0

AI Benchmark Digest — 2026-06-12

2026-06-12T08:17:57.895837+00:00

Daily

New Benchmarks (2)

MathArena - ARXIV_FALSE May (Accuracy (%)): leader GPT-5.5 (xhigh) (50.0), 8 models
MathArena - ARXIV May (Accuracy (%)): leader Claude-Fable-5 (max) (86.67), 8 models

Top-10 New Scores (9)

Claude Fable 5 on Lynchmark: 100.0 (#1)
Claude Fable 5 on MineBench: 1929.84 (#2)
Claude Opus 4.8 on Chess Puzzles (Epoch AI): 34.0 (#12)
Claude Opus 4.8 on Design Arena (Game Dev): 1250.0 (#37)
Claude Opus 4.8 on GRAB-Lite: 60.6 (#6)
Claude Opus 4.8 on OTIS Mock AIME 2024-25: 98.33 (#3)
Claude Opus 4.8 on SimpleQA Verified: 39.5 (#24)
GPT-5.5 on GRAB-Lite: 71.8 (#2)
Qwen 3.7 Max on Position Bias (Lechmazur): 34.8 (#10)

New #1 Leaders (9)

Chatbot Arena (Text-to-Video): gemini-omni-flash (1527.0) beat dreamina-seedance-2.0-720p by 64.0
Design Arena (UI Components): Claude Fable 5 (1411.0) beat Claude Opus 4.7 (Thinking) by 56.0
Design Arena (Game Dev): Claude Fable 5 (1393.0) beat GPT-5.5 by 39.0
Design Arena (SVG): Claude Fable 5 (1384.0) beat prism by 18.0
SEAL - SWE Atlas - Test Writing: Fable-5 (Claude Code) xHigh (58.52) beat Opus 4.8 (Claude Code) by 12.96
MathArena - ARXIV April: Claude 5 (70.73) beat GPT-5.5 (xHigh) by 3.66
GRAB-Lite: Claude Fable 5 (74.0) beat GPT-5.4 by 3.0
WeirdML: Claude 5 (87.85) beat GPT-5.5 (xHigh) by 2.94
Chatbot Arena (Image-to-Video): gemini-omni-flash (1475.0) beat Grok 1.5 by 2.0

AI Benchmark Digest — 2026-06-11

2026-06-11T08:17:01.068404+00:00

Daily

New Benchmarks (1)

GDPval-AA (Elo): leader Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.0), 390 models

Top-10 New Scores (3)

Claude Fable 5 on Chatbot Arena (Document): 1495.0 (#5)
Claude Fable 5 on Chatbot Arena (Vision): 1307.0 (#2)
Claude Fable 5 on React Native Evals: 86.96 (#4)

New #1 Leaders (12)

PACT (Lechmazur): Claude Fable 5 (High) (2171.0) beat GPT-5.5 (High) by 155.0
Chatbot Arena (Code): Claude Fable 5 (1665.0) beat Claude Opus 4.7 (Thinking) by 98.0
Design Arena (Data Viz): Claude Fable 5 (1406.0) beat Claude Opus 4.7 (Thinking) by 68.0
Design Arena (Website): Claude Fable 5 (1364.0) beat Claude Opus 4.6 by 23.0
Design Arena (3D): Claude Fable 5 (1383.0) beat Kimi K2.6 by 17.0
FrontierSWE: Claude Fable 5 (90.0) beat Claude Opus 4.8 by 7.0
Chatbot Arena (Text): Claude Fable 5 (1510.0) beat Claude Opus 4.6 (Thinking) by 6.0
SimpleBench: Claude Fable (81.9) beat Gemini 3.1 Pro (Preview) by 2.3
UGI - Writing: Claude 5 (74.23) beat Gemini 3.5 Flash (Thinking, Medium) by 1.69
EQ-Bench Longform Writing: Claude Fable 5 (83.0) beat Claude Opus 4.7 by 1.2
LLM Stats (Video-MME): MiMo-V2.5 (87.7) beat Kimi K2.5 by 0.3
LLM Stats (CMMLU): MiMo-V2.5-Pro (90.2) beat Qwen 2 72B Instruct by 0.1

AI Benchmark Digest — 2026-06-10

2026-06-10T09:55:36.786616+00:00

Daily

New Benchmarks (1)

SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.

New Models (1)

Claude Fable 5 — ELO 1871, #31
- Blueprint-Bench 2: 0.386 (#1/14)
- Opper TaskBench: 96.4 (#1/85)
- LLM Stats (OSWorld-Verified): 85.0 (#1/16)
- YC-Bench: 1977.6 (#1/21)
- Vals AI (Vals Index): 75.14 (#1/25)
- Vals AI Multimodal Index: 74.15 (#1/20)
- Vals AI LegalBench: 88.56 (#1/114)
- Vals AI CorpFin v2: 71.83 (#1/111)
- Vals AI MedScribe: 88.52 (#1/62)
- Vals AI ProofBench: 77.0 (#1/37)

New #1 Leaders (55)

YC-Bench: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1
Multi-turn Debate (Lechmazur): Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8
AA GDPval: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67
Evals for Every Language - Language ay: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23
LiveBench Python: Claude Fable 5 (xHigh) (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0
CursorBench 3.1: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1
AA Omniscience - Software Engineering (SWE) - Dart: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0
AA Omniscience - Software Engineering (SWE) - R: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0
AA Omniscience - Software Engineering (SWE) - Swift: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0
Vals AI Vibe Code Bench: Claude Fable 5 (90.35) beat Claude Opus 4.8 by 7.63
AA Humanity's Last Exam: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6
AA Omniscience: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22
Vellum - HumanEval: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9
Vellum - Humanity's Last Exam: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6
Evals for Every Language - Language crh: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27
AA Omniscience - Software Engineering (SWE) - Java: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0
Vals AI ProofBench: Claude Fable 5 (77.0) beat aristotle by 6.0
AA Omniscience - Business: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9
AA Omniscience - Science, Engineering & Mathematics: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8
Vals AI (Vals Index): Claude Fable 5 (75.14) beat Claude Opus 4.8 by 4.78
Vals AI IOI: Claude Fable 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42
AA Omniscience - Humanities & Social Sciences: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3
AA Omniscience - Software Engineering (SWE) - Go: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0
Artificial Analysis Intelligence Index: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44
Evals for Every Language - Language cv: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39
Vals AI CorpFin v2: Claude Fable 5 (71.83) beat Grok 4.3 by 3.3
Vals AI Multimodal Index: Claude Fable 5 (74.15) beat Claude Opus 4.8 by 3.26
AA Omniscience - Software Engineering (SWE): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2
Evals for Every Language - MGSM: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36
Evals for Every Language - Language ban: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32
AA Terminal-Bench Hard: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27
LiveBench Plot Unscrambling: Claude Fable 5 (xHigh) (78.09) beat GPT-5.5 (High) by 1.81
LLM Stats (OSWorld-Verified): Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6
AA Omniscience - Software Engineering (SWE) - Python: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5
Evals for Every Language - Language chm: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48
Evals for Every Language - Language doi: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46
AA CritPt: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43
Evals for Every Language - Language es: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42
AA SciCode: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - Language ace: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - MMLU: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27
Evals for Every Language - ARC: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26
Vals AI LegalBench: Claude Fable 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16
Evals for Every Language - Language ca: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03
Opper TaskBench: Claude Fable 5 (96.4) beat Claude Opus 4.7 by 1.0
Evals for Every Language - Language ar: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95
Evals for Every Language - Language en: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77
Evals for Every Language - Language cy: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65
Evals for Every Language - Language am: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59
Vals AI MedScribe: Claude Fable 5 (88.52) beat GPT-5.1 by 0.43
Evals for Every Language - Language be: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32
Evals for Every Language - Language ceb: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29
Evals for Every Language - Language el: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15
Blueprint-Bench 2: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02
LiveBench Olympiad: Claude Fable 5 (High) (92.18) beat Claude Opus 4.6 (Thinking, High) by 0.01

AI Benchmark Digest — 2026-06-10

2026-06-10T08:06:50.673963+00:00

Daily

New Benchmarks (1)

SkateBench (Success Rate (%)): leader gemini-3.1-pro-preview (96.92), 28 models
Skateboarding-domain knowledge benchmark ranking models by how well they identify technical skateboard tricks from 390 trick definitions. SkateBench v2 reports success rate, cost, and speed.

New Models (2)

Claude 5 — ELO 1904, #22
- LiveBench Olympiad: 92.18 (#1/124)
- LiveBench Plot Unscrambling: 78.09 (#1/124)
- LiveBench Python: 95.0 (#1/124)
- Opper TaskBench: 96.4 (#1/85)
- Vals AI (Vals Index): 75.14 (#1/25)
- Vals AI Multimodal Index: 74.15 (#1/20)
- Vals AI LegalBench: 88.56 (#1/114)
- Vals AI CorpFin v2: 71.83 (#1/111)
- Vals AI MedScribe: 88.52 (#1/62)
- Vals AI ProofBench: 77.0 (#1/37)
Claude Fable 5 — ELO 1901, #23
- Blueprint-Bench 2: 0.386 (#1/14)
- LLM Stats (OSWorld-Verified): 85.0 (#1/16)
- YC-Bench: 1977.6 (#1/21)
- SEAL - MCP Atlas: 83.3 (#2/23)
- Vellum - HumanEval: 95.0 (#2/38)
- Vellum - GPQA: 94.1 (#3/57)
- ClockBench: 35.0 (#4/27)
- LLM Stats (GDPval-AA): 64.4 (#11/12)

New #1 Leaders (55)

YC-Bench: Claude Fable 5 (1977.6) beat Claude Opus 4.7 by 263.1
Multi-turn Debate (Lechmazur): Claude Fable 5 (High) (1770.9) beat Claude Opus 4.7 (High) by 53.8
AA GDPval: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (1932.47) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 42.67
Evals for Every Language - Language ay: step-3.7-flash-20260528 (77.14) beat Gemini 3.1 Pro (Preview) by 14.23
LiveBench Python: Claude 5 (95.0) beat Claude Opus 4.5 (Thinking 64K, High) (2025-11-01) by 10.0
CursorBench 3.1: Fable 5 Max (72.9) beat Claude Opus 4.7 by 8.1
AA Omniscience - Software Engineering (SWE) - Dart: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.3 Codex (xHigh) by 8.0
AA Omniscience - Software Engineering (SWE) - R: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (82.0) beat GPT-5.5 (Medium) by 8.0
AA Omniscience - Software Engineering (SWE) - Swift: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (100.0) beat GPT-5.5 (xHigh) by 8.0
Vals AI Vibe Code Bench: Claude 5 (90.35) beat Claude Opus 4.8 by 7.63
AA Humanity's Last Exam: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (53.34) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 7.6
AA Omniscience: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (40.15) beat Gemini 3.1 Pro (Preview) by 7.22
Vellum - HumanEval: Claude Mythos 5 (95.5) beat Claude Opus 4.8 by 6.9
Vellum - Humanity's Last Exam: Claude Mythos 5 (64.5) beat Claude Opus 4.8 by 6.6
Evals for Every Language - Language crh: step-3.7-flash-20260528 (73.05) beat Gemini 3.1 Pro (Preview) by 6.27
AA Omniscience - Software Engineering (SWE) - Java: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (79.0) beat GPT-5.3 Codex (xHigh) by 6.0
Vals AI ProofBench: Claude 5 (77.0) beat aristotle by 6.0
AA Omniscience - Business: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (55.0) beat GPT-5.5 (xHigh) by 5.9
AA Omniscience - Science, Engineering & Mathematics: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (57.1) beat GPT-5.5 (High) by 4.8
Vals AI (Vals Index): Claude 5 (75.14) beat Claude Opus 4.8 by 4.78
Vals AI IOI: Claude 5 (72.25) beat GPT-5.4 (2026-03-05) by 4.42
AA Omniscience - Humanities & Social Sciences: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.9) beat Gemini 3 Pro (Preview) (High) by 4.3
AA Omniscience - Software Engineering (SWE) - Go: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (88.0) beat GPT-5.5 (High) by 4.0
Artificial Analysis Intelligence Index: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (64.88) beat Claude Opus 4.8 (Adaptive Reasoning, Max Effort) by 3.44
Evals for Every Language - Language cv: gemma-4-31B-it-20260402 (69.3) beat Claude Opus 4.5 by 3.39
Vals AI CorpFin v2: Claude 5 (71.83) beat Grok 4.3 by 3.3
Vals AI Multimodal Index: Claude 5 (74.15) beat Claude Opus 4.8 by 3.26
AA Omniscience - Software Engineering (SWE): Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (87.6) beat GPT-5.5 (xHigh) by 3.2
Evals for Every Language - MGSM: Claude Opus 4.8 (96.62) beat Claude Opus 4.6 by 2.36
Evals for Every Language - Language ban: step-3.7-flash-20260528 (69.03) beat Claude Opus 4.5 by 2.32
AA Terminal-Bench Hard: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (62.88) beat GPT-5.5 (xHigh) by 2.27
LiveBench Plot Unscrambling: Claude 5 (78.09) beat GPT-5.5 (High) by 1.81
LLM Stats (OSWorld-Verified): Claude Fable 5 (85.0) beat Claude Opus 4.8 by 1.6
AA Omniscience - Software Engineering (SWE) - Python: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (92.0) beat GPT-5.5 (xHigh) by 1.5
Evals for Every Language - Language chm: Claude Opus 4.7 (63.6) beat Gemini 3.1 Pro (Preview) by 1.48
Evals for Every Language - Language doi: Claude Opus 4.7 (71.84) beat Gemini 3 Pro (Preview) by 1.46
AA CritPt: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (28.57) beat GPT-5.5 (xHigh) by 1.43
Evals for Every Language - Language es: Gemini 3.1 Flash Lite (76.16) beat Claude Opus 4.6 by 1.42
AA SciCode: Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback) (60.19) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - Language ace: step-3.7-flash-20260528 (72.48) beat Gemini 3.1 Pro (Preview) by 1.28
Evals for Every Language - MMLU: intellect-3-20251126 (100.0) beat Claude Sonnet 4.6 by 1.27
Evals for Every Language - ARC: intellect-3-20251126 (100.0) beat Gemini 3.1 Pro (Preview) by 1.26
Vals AI LegalBench: Claude 5 (88.56) beat Gemini 3.1 Pro (Preview) by 1.16
Evals for Every Language - Language ca: Gemini 3.1 Flash Lite (76.29) beat Gemini 3 Pro (Preview) by 1.03
Opper TaskBench: Claude 5 (96.4) beat Claude Opus 4.7 by 1.0
Evals for Every Language - Language ar: Claude Opus 4.8 (71.58) beat Claude Opus 4.5 by 0.95
Evals for Every Language - Language en: Gemini 3.1 Flash Lite (87.28) beat MiniMax-M2.5 by 0.77
Evals for Every Language - Language cy: Gemini 3.1 Flash Lite (82.03) beat Claude Sonnet 4.5 by 0.65
Evals for Every Language - Language am: Gemini 3.1 Flash Lite (68.6) beat Claude Opus 4.6 by 0.59
Vals AI MedScribe: Claude 5 (88.52) beat GPT-5.1 by 0.43
Evals for Every Language - Language be: Claude Opus 4.8 (69.43) beat Gemini 3.1 Pro (Preview) by 0.32
Evals for Every Language - Language ceb: Gemini 3.1 Flash Lite (78.06) beat Gemini 3.1 Pro (Preview) by 0.29
Evals for Every Language - Language el: Gemini 3.1 Flash Lite (73.81) beat Claude Opus 4.5 by 0.15
Blueprint-Bench 2: Claude Fable 5 (0.386) beat GPT-5.5 by 0.02
LiveBench Olympiad: Claude 5 (92.18) beat Claude Opus 4.6 (Thinking) (High) by 0.01

AI Benchmark Digest — 2026-06-09

2026-06-09T07:53:25.528997+00:00

Daily

Top-10 New Scores (2)

GPT-5.5 (xHigh) on SEAL - SWE Atlas - Codebase QnA: 45.43 (#2)
GPT-5.5 (xHigh) on SEAL - SWE Atlas - Test Writing: 42.59 (#3)

New #1 Leaders (7)

GSMA Open-Telco - TeleTables: TelecomGPT (88.0) beat OTel-LLM-8.3B-QnA by 26.2
GSMA Open-Telco LLM Leaderboard: TelecomGPT (89.64) beat OTel-LLM-8.3B-QnA by 3.66
SEAL - SWE Atlas - Codebase QnA: Opus 4.8 (Claude Code) (48.79) beat GPT-5.5 by 3.36
GSMA Open-Telco - 3GPP: TelecomGPT (84.22) beat OTel-LLM-8.3B-QnA by 2.82
GSMA Open-Telco - TeleLogs: TelecomGPT (98.96) beat OTel-LLM-8.3B-QnA by 2.66
GSMA Open-Telco - srsRAN-Bench: TelecomGPT (91.33) beat OTel-LLM-8.3B-QnA by 1.65
SEAL - SWE Atlas - Test Writing: Opus 4.8 (Claude Code) (45.56) beat GPT-5.4 (xHigh) by 1.2

AI Benchmark Digest — 2026-06-07

2026-06-07T08:34:58.487719+00:00

Weekly

New Models (2)

MiniMax-M3 — ELO 1762, #83
- LLM Stats (OmniDocBench 1.5): 91.6 (#1/13)
- LLM Stats (Video-MME): 85.4 (#2/13)
- OpenClawProBench: 75.1 (#2/65)
- Vals AI MedScribe: 87.25 (#2/61)
- AA IFBench: 82.86 (#3/429)
- LLM Stats (Claw-Eval): 74.5 (#3/9)
- LLM Stats (NL2Repo): 42.13 (#3/7)
- AA GPQA Diamond: 92.93 (#4/501)
- Vals AI CorpFin v2: 68.1 (#4/110)
- Design Arena (3D): 1348.0 (#5/115)
nemotron-3-ultra-550B-a55B — ELO 1587, #292
- PinchBench: 90.58 (#10/49)
- Vals AI CorpFin v2: 65.46 (#16/110)
- Vals AI (Vals Index): 43.99 (#18/24)
- LiveBench Python: 75.0 (#24/122)
- LiveBench Paraphrase: 61.15 (#33/122)
- Vals AI TaxEval v2: 73.1 (#34/116)
- Bullshit Benchmark: 41.8 (#34/148)
- Vals AI MedCode: 38.62 (#35/62)
- AI Chess Leaderboard (Reasoning): 975.0 (#39/277)
- LiveBench Code Generation: 77.47 (#43/122)

Top-10 New Scores (4)

GPT-5.5 (xHigh) on IMO-Bench: 71.9 (#4)
GPT-5.5 Pro on IUMB: 100.0 (#2)
GPT-5.5 Pro (xHigh) on IMO-Bench: 88.1 (#2)
Gemini 3 Deep Think on IUMB: 87.5 (#6)

New #1 Leaders (10)

EQ-Bench Creative Writing v3: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8
Chatbot Arena (Image-to-Video): Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0
LLM Stats (Multi-Challenge): Nova 2 Pro (77.7) beat GPT-5 by 8.1
MathArena - Kangaroo 2025 Levels 11-12: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67
MathArena - APEX 2025: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04
MathArena - Kangaroo 2025 Levels 7-8: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84
MathArena - AIME 2026: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83
LLM Stats (OmniDocBench 1.5): MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4
GAIA: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34
ForecastBench: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.2

AI Benchmark Digest — 2026-06-06

2026-06-06T07:45:06.870709+00:00

Daily

New Benchmarks (20)

Pencil Puzzle Bench - Yajilin (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (20.0), 51 models
PPBench direct-ask success rate on Yajilin loop-and-shading puzzles from the golden_300 split, testing exact constraint solving from puzz.link grids.
Pencil Puzzle Bench - Slitherlink (Direct-ask Success Rate (%)): leader gpt-5.2 (High) (33.3), 51 models
PPBench direct-ask success rate on Slitherlink loop puzzles, where numbered cells constrain how a single continuous loop surrounds the grid.
Pencil Puzzle Bench - Heyawake (Direct-ask Success Rate (%)): leader claude-opus-4-5-high (0.0), 51 models
PPBench direct-ask success rate on Heyawake room-shading puzzles, testing region constraints, connectivity, and line-of-sight reasoning.
Pencil Puzzle Bench - Mashu (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models
PPBench direct-ask success rate on Mashu loop puzzles, where black and white pearls impose turn and straight-line constraints.
Pencil Puzzle Bench - Shakashaka (Direct-ask Success Rate (%)): leader claude-sonnet-4-5 (0.0), 51 models
PPBench direct-ask success rate on Shakashaka triangle-shading puzzles, testing local clue satisfaction and global rectangle formation.
Pencil Puzzle Bench - Nurikabe (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models
PPBench direct-ask success rate on Nurikabe island puzzles, where numbered islands must be separated by one connected wall region.
Pencil Puzzle Bench - LITS (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (53.3), 51 models
PPBench direct-ask success rate on LITS tetromino-shading puzzles, testing region-wise shape placement and adjacency constraints.
Pencil Puzzle Bench - Light Up (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models
PPBench direct-ask success rate on Light Up puzzles, where lamps must illuminate every open cell while satisfying numbered black-cell clues.
Pencil Puzzle Bench - Nurimisaki (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (33.3), 51 models
PPBench direct-ask success rate on Nurimisaki puzzles, a Nurikabe-family grid task requiring connected-region reasoning around clue cells.
Pencil Puzzle Bench - Shikaku (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (80.0), 51 models
PPBench direct-ask success rate on Shikaku rectangle-partitioning puzzles, where each numbered clue defines one rectangle of matching area.
Pencil Puzzle Bench - Norinori (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (93.3), 51 models
PPBench direct-ask success rate on Norinori shading puzzles, testing room constraints and two-cell adjacency patterns.
Pencil Puzzle Bench - Double Choco (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models
PPBench direct-ask success rate on Double Choco region-division puzzles, testing balanced partitioning under color and shape constraints.
Pencil Puzzle Bench - Firefly (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (33.3), 51 models
PPBench direct-ask success rate on Firefly line-drawing puzzles, testing path construction from directional clues and grid constraints.
Pencil Puzzle Bench - Sashigane (Direct-ask Success Rate (%)): leader mistral-large-2512 (0.0), 51 models
PPBench direct-ask success rate on Sashigane shape-partitioning puzzles, testing right-angle region construction from numbered and directional clues.
Pencil Puzzle Bench - Sudoku (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (20.0), 51 models
PPBench direct-ask success rate on Sudoku puzzles, testing classic row, column, and box constraint satisfaction through exact move outputs.
Pencil Puzzle Bench - Nurimaze (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (26.7), 51 models
PPBench direct-ask success rate on Nurimaze puzzles, testing maze-style path and shading constraints in a connected grid.
Pencil Puzzle Bench - Tapa (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (60.0), 51 models
PPBench direct-ask success rate on Tapa shading puzzles, where clue numbers describe blocks of shaded neighboring cells.
Pencil Puzzle Bench - Kurodoko (Direct-ask Success Rate (%)): leader gpt-5.2 (xHigh) (6.7), 51 models
PPBench direct-ask success rate on Kurodoko visibility puzzles, testing shading, sight-line counts, and connected unshaded cells.
Pencil Puzzle Bench - Country (Direct-ask Success Rate (%)): leader gemini-3.1-pro (6.7), 51 models
PPBench direct-ask success rate on Country region puzzles, testing loop and region constraints over a partitioned grid.
Pencil Puzzle Bench - Hitori (Direct-ask Success Rate (%)): leader claude-opus-4-6 (Thinking) (66.7), 51 models
PPBench direct-ask success rate on Hitori number-grid puzzles, where repeated numbers are shaded while preserving connectivity and non-adjacency constraints.

New #1 Leaders (24)

LLM Stats (Multi-Challenge): Nova 2 Pro (77.7) beat GPT-5 by 8.1
Ukrainian LLM - Global MMLU Full UK World Religions: MamayLM-Gemma-3-27B-IT-v2.0 (87.13) beat gemma-3-12B-pt by 7.6
Ukrainian LLM - Global MMLU Full UK High School US History: MamayLM-Gemma-3-27B-IT-v2.0 (91.67) beat MamayLM-Gemma-3-12B-IT-v1.0 by 5.4
Ukrainian LLM - Global MMLU Full UK Anatomy: MamayLM-Gemma-3-27B-IT-v2.0 (65.19) beat lapa-12B-pt by 5.19
Ukrainian LLM - Global MMLU Full UK Clinical Knowledge: MamayLM-Gemma-3-27B-IT-v2.0 (77.74) beat gemma-3-12B-pt by 4.53
Ukrainian LLM - Global MMLU Full UK Professional LAW: MamayLM-Gemma-3-27B-IT-v2.0 (51.5) beat gemma-3-12B-pt by 4.43
Ukrainian LLM - Global MMLU Full UK Humanities: MamayLM-Gemma-3-27B-IT-v2.0 (61.68) beat Qwen3-8B-Base by 4.12
Ukrainian LLM - Global MMLU Full UK Computer Security: MamayLM-Gemma-3-12B-IT-v2.0 (82.0) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 4.0
Ukrainian LLM - Global MMLU Full UK Global Facts: MamayLM-Gemma-3-27B-IT-v2.0 (52.0) beat Gemma 3 12B (IT) by 4.0
Ukrainian LLM - Global MMLU Full UK Miscellaneous: MamayLM-Gemma-3-27B-IT-v2.0 (83.52) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 3.95
Ukrainian LLM - Global MMLU Full UK Prehistory: MamayLM-Gemma-3-27B-IT-v2.0 (77.78) beat gemma-3-12B-pt by 3.71
Ukrainian LLM - Global MMLU Full UK Other: MamayLM-Gemma-3-27B-IT-v2.0 (74.57) beat gemma-3-12B-pt by 3.41
Ukrainian LLM - Global MMLU Full UK Business Ethics: MamayLM-Gemma-3-12B-IT-v2.0 (77.0) beat MamayLM-Gemma-3-12B-IT-v1.0 by 3.0
Ukrainian LLM - Global MMLU Full UK High School World History: MamayLM-Gemma-3-27B-IT-v2.0 (86.08) beat gemma-3-12B-pt by 1.69
Ukrainian LLM - Global MMLU Full UK High School Microeconomics: MamayLM-Gemma-3-27B-IT-v2.0 (84.45) beat Qwen3-8B-Base by 1.68
Ukrainian LLM - Global MMLU Full UK Marketing: MamayLM-Gemma-3-27B-IT-v2.0 (88.89) beat MamayLM-Gemma-3-12B-IT-v1.0 by 1.28
Ukrainian LLM - Global MMLU Full UK Professional Psychology: MamayLM-Gemma-3-27B-IT-v2.0 (70.1) beat gemma-3-12B-pt by 0.98
Ukrainian LLM - Global MMLU Full UK Public Relations: MamayLM-Gemma-3-12B-IT-v2.0 (68.18) beat lapa-12B-pt by 0.91
Ukrainian LLM - Global MMLU Full UK High School European History: MamayLM-Gemma-3-27B-IT-v2.0 (84.24) beat MamayLM-Gemma-3-12B-IT-v1.0-FP8-Static-Nadiia by 0.6
Ukrainian LLM - Global MMLU Full UK High School Macroeconomics: MamayLM-Gemma-3-27B-IT-v2.0 (76.67) beat gemma-3-12B-pt by 0.52
Ukrainian LLM - Global MMLU Full UK Sociology: MamayLM-Gemma-3-27B-IT-v2.0 (83.08) beat lapa-v0.1.2-instruct by 0.49
LLM Stats (OmniDocBench 1.5): MiniMax-M3 (91.6) beat Qwen 3.6 Plus by 0.4
Ukrainian LLM - Global MMLU Full UK Professional Medicine: MamayLM-Gemma-3-27B-IT-v2.0 (80.15) beat gemma-3-12B-pt by 0.37
ForecastBench: Grok 4.20 (Beta, D) (68.1) beat green-tree by 0.3

AI Benchmark Digest — 2026-06-04

2026-06-04T08:22:19.073162+00:00

Daily

New #1 Leaders (1)

GAIA: CustomGPT.ai Research Lab v44 (93.36) beat Co-Sight Pro v1.0.1 by 0.34

AI Benchmark Digest — 2026-06-03

2026-06-03T08:25:40.519214+00:00

Daily

Top-10 New Scores (2)

GPT-5.5 Pro on IUMB: 100.0 (#2)
Gemini 3 Deep Think on IUMB: 87.5 (#6)

New #1 Leaders (4)

MathArena - Kangaroo 2025 Levels 11-12: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 1.67
MathArena - APEX 2025: Claude Opus 4.8 (Thinking) (81.25) beat GPT-5.5 (xHigh) by 1.04
MathArena - Kangaroo 2025 Levels 7-8: Claude Opus 4.8 (Thinking) (96.67) beat GPT-5.4 (xHigh) by 0.84
MathArena - AIME 2026: Claude Opus 4.8 (Thinking) (100.0) beat GPT-5.4 (xHigh) by 0.83

AI Benchmark Digest — 2026-06-02

2026-06-02T08:19:29.198019+00:00

Daily

New Benchmarks (1)

GIM (IRT ability (theta)): leader GPT-5.4 Pro (High) (2.16), 46 models
Grounded Integration Measure from Meta FAIR: 820 multimodal and text-grounded problems testing integrated reasoning across quantitative, spatial, language, world-knowledge, and document tasks. Scores are reported as IRT ability on GIM-820.

Top-10 New Scores (2)

GPT-5.5 (xHigh) on IMO-Bench: 71.9 (#4)
GPT-5.5 Pro (xHigh) on IMO-Bench: 88.1 (#2)

AI Benchmark Digest — 2026-06-01

2026-06-01T08:29:45.265204+00:00

Daily

New #1 Leaders (3)

EQ-Bench Creative Writing v3: Claude Opus 4.7 (2050.8) beat GPT-5.4 by 144.8
Design Arena (Data Viz): GLM-5.1 (1367.0) beat Claude Opus 4.7 (Thinking) by 23.0
Chatbot Arena (Image-to-Video): Grok 1.5 (1473.0) beat dreamina-seedance-2.0-720p by 11.0

AI Benchmark Digest — 2026-05-30

2026-05-30T07:49:09.779753+00:00

Daily

Top-10 New Scores (5)

Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Natural Intelligence: 65.39 (#30)
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Willingness (W/10): 2.2 (#1094)
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI - Writing: 65.88 (#34)
Claude Opus 4.8 (Adaptive Reasoning, Max Effort) on UGI Leaderboard: 52.64 (#69)
GPT-5.4 (xHigh) on Creative Writing (Lechmazur): 3.4 (#2)

New #1 Leaders (2)

Bullshit Benchmark: Claude Opus 4.8 (96.4) beat Claude Sonnet 4.6 by 1.9
Creative Writing (Lechmazur): GPT-5.5 (xHigh) (3.5) beat GPT-5.5 (Thinking, xHigh) by 0.3