The methodology to judge AI needs realignment
By
Mathew C S
When Anthropic released Claude 4 a week ago, the artificial intelligence (AI) company said these models set “new standards for coding, advanced reasoning, and AI agents”. They cite leading scores on SWE-bench Verified, a benchmark for performance on real software engineering tasks. OpenAI also claims the o3 and o4-mini models return best scores on certain benchmarks. As does Mistral, for the open-source Devstral coding model.