For frontier AI news

Code Arena

Compare the performance of AI models on agentic coding tasks involving multi-step reasoning and tool use

Feb 9, 2026
143,756 votes
40 models
Rank Spread
1
12
Anthropic
Anthropic · Proprietary
1576+20/-20
1,268
2
12
Anthropic
Anthropic · Proprietary
1569+17/-17
1,824
3
33
Anthropic
1503+9/-9
9,409
4
47
OpenAI · Proprietary
1472+16/-16
1,691
5
45
Anthropic
Anthropic · Proprietary
1471+8/-8
9,597
6
59
Google · Proprietary
1449+8/-8
15,589
7
510
MoonshotAI
Moonshot · Modified MIT
1446+12/-12
2,495
8
610
Google · Proprietary
1443+8/-8
11,158
9
610
Z.ai · MIT
1441+10/-10
5,130
10
714
MoonshotAI
Moonshot · Modified MIT
1421+17/-17
1,439
11
1016
1407+9/-9
7,208
12
1016
Minimax
MiniMax · MIT
1405+9/-9
8,466
13
1019
OpenAI · Proprietary
1397+16/-16
1,632
14
1019
OpenAI · Proprietary
1394+12/-12
3,926
15
1119
Anthropic
Anthropic · Proprietary
1389+8/-8
8,980
16
1119
OpenAI · Proprietary
1389+9/-9
6,438
17
1319
Anthropic
1388+7/-7
12,743
18
1319
Anthropic
Anthropic · Proprietary
1386+7/-7
14,337
19
1320
DeepSeek · MIT
1375+10/-10
4,765
20
1922
Z.ai · MIT
1357+8/-8
8,744
21
2023
OpenAI · Proprietary
1348+7/-7
11,643
22
2025
1340+9/-9
5,535
23
2225
MoonshotAI
Moonshot · Modified MIT
1333+7/-7
11,143
24
2125
OpenAI · Proprietary
1333+10/-10
4,234
25
2226
OpenAI · Proprietary
1329+9/-9
6,502
26
2528
Minimax
MiniMax · Apache 2.0
1313+9/-9
8,833
27
2628
DeepSeek · MIT
1310+9/-9
6,028
28
2629
Anthropic
Anthropic · Proprietary
1303+7/-7
12,415
29
2830
DeepSeek · MIT
1287+10/-10
5,131
30
2931
Qwen Icon
Alibaba · Apache 2.0
1280+7/-7
12,199
31
3032
Kwai
KwaiKAT · Proprietary
1259+15/-15
1,953
32
3134
OpenAI · Proprietary
1243+17/-17
1,537
33
3234
xAI · Proprietary
1234+9/-9
6,886
34
3237
Mistral · Apache 2.0
1223+20/-20
1,037
35
3437
Google · Proprietary
1206+13/-13
3,453
36
3437
xAI · Proprietary
1204+19/-19
1,266
37
3437
Mistral · Modified MIT
1198+16/-16
1,678
38
3839
xAI · Proprietary
1153+22/-22
968
39
3840
xAI · Proprietary
1141+21/-21
1,016
40
3940
Mistral · Proprietary
1099+22/-22
1,021

Remove Style Control Leaderboard Plots

Fraction of Model A Wins for All Non-tied A vs. B Battles

Confidence Intervals on Model Strength (via Bootstrapping)

Battle Count for Each Combination of Models (without Ties)

Average Win Rate Against All Other Models (Uniform Sampling and No Ties)