Leaderboard
Overall Violation Rate
Lower is better. Values report the fraction of assigned design-requirement checks violated by each model.
Full Table
Model Rankings
Rank
Model
Family
Violation Rate
1GPT-5.5OpenAI27.2%
2Claude Opus 4.7Anthropic30.7%
3Claude Opus 4.6Anthropic30.8%
4GPT-5.4OpenAI34.5%
5Grok 4.3xAI37.6%
6GPT-4oOpenAI39.9%
7GPT-4o miniOpenAI43.8%
8Gemini 3.0 FlashGoogle54.7%
9Qwen3 32BAlibaba54.9%
10Qwen3.6 MaxAlibaba55.6%
11Qwen3 14BAlibaba57.0%
12Qwen3 8BAlibaba57.1%
13Claude Sonnet 4Anthropic57.2%
14Gemini 3.1 ProGoogle57.7%
15DeepSeek v4 ProDeepSeek57.7%
16DeepSeek v4 FlashDeepSeek61.2%
17Qwen3 1.7BAlibaba61.2%
18DeepSeek ChatDeepSeek61.5%
19Grok 4xAI62.9%
20Gemini 2.0 FlashGoogle63.9%
21Qwen3 4BAlibaba65.1%
22Grok 3xAI67.3%