Roo Code Evals

Roo Code tests each frontier model against a suite of hundreds of exercises across 5 programming languages with varying difficulty. These results can help you find the right price-to-intelligence ratio for your use case.

Want to see the results for a model we haven't tested yet? Ping us in Discord.

Cost Versus Score
(Note: Very expensive models are excluded from the scatter plot.)
Model		Metrics			Scores
Name Context Window	Price In / Out	Duration	Tokens In / Out	Cost USD						Total
Claude Sonnet 4 0	$0.00 / $0.00	5h 35m 31s	39M / 644K	$39.61	94%	100%	98%	100%	97%	98%
Gemini 2.5 Pro 0	$0.00 / $0.00	6h 17m 23s	43M / 1M	$57.80	97%	91%	96%	100%	97%	96%
Claude Opus 4 0	$0.00 / $0.00	7h 50m 29s	30M / 485K	$172.29	92%	91%	94%	94%	100%	94%
Claude 3.7 Sonnet 0	$0.00 / $0.00	4h 52m 36s	19M / 603K	$27.16	92%	93%	98%	97%	87%	94%
GPT 4.1 0	$0.00 / $0.00	4h 39m 51s	37M / 624K	$38.64	92%	91%	90%	94%	90%	91%
Gemini 2.5 Flash 0	$0.00 / $0.00	3h 39m 38s	61M / 1M	$14.15	89%	91%	92%	85%	90%	90%
Claude 3.5 Sonnet 0	$0.00 / $0.00	3h 37m 58s	19M / 323K	$24.98	94%	91%	92%	88%	80%	90%
Grok 3 0	$0.00 / $0.00	5h 14m 20s	40M / 890K	$74.40	97%	89%	90%	91%	77%	89%
Qwen 3 Coder 0	$0.00 / $0.00	7h 56m 14s	51M / 828K	$27.63	86%	80%	82%	85%	87%	84%
Kimi K2 0	$0.00 / $0.00	7h 52m 24s	27M / 433K	$12.39	81%	80%	88%	82%	83%	83%
GPT 4.1 Mini 0	$0.00 / $0.00	5h 17m 57s	47M / 715K	$8.81	81%	84%	94%	76%	70%	83%
Qwen3 235B A22B 2507 0	$0.00 / $0.00	8h 3m 37s	44M / 498K	$6.94	69%	84%	82%	79%	80%	79%
o4 Mini (High) 0	$0.00 / $0.00	14h 44m 26s	13M / 3M	$25.70	75%	82%	86%	79%	67%	79%
DeepSeek V3 0	$0.00 / $0.00	7h 12m 41s	30M / 524K	$12.82	83%	76%	82%	76%	67%	77%
o3 Mini (High) 0	$0.00 / $0.00	13h 1m 13s	12M / 2M	$20.36	67%	78%	72%	88%	73%	75%