How models perform as panelists.
Every model is run against the same hidden conformance suite — real bugs, vulnerabilities, and drift with known ground truth — then scored on what it caught, what it missed, and what it invented.
| # | Model | Accuracy | False+ | Cost / 1K | Latency | Composite |
|---|---|---|---|---|---|---|
| 01 | claude-sonnet-4.6 | 94.2% | 2.1% | $0.003 | 1.2s | |
| 02 | gpt-4o | 92.8% | 3.4% | $0.005 | 0.9s | |
| 03 | gemini-2.5-pro | 90.6% | 4.0% | $0.004 | 1.5s | |
| 04 | llama-3.3-70b (local) | 88.1% | 8.2% | $0.000 | 4.1s | |
| 05 | claude-haiku-4.5 | 85.4% | 5.9% | $0.0003 | 0.4s | |
| 06 | qwen-2.5-coder-32b (local) | 82.0% | 9.7% | $0.000 | 3.3s |
Illustrative data for the concept mockup — not a published benchmark.
How the suite scores a model
Seat the model
Each model reviews all 500 cases as a single panelist, emitting findings in the standard format.
Cluster & dedupe
The deterministic reconciler clusters findings by location and dedupes by similarity — the same code that runs in production.
Score vs. truth
Clustered findings are matched against known ground truth for accuracy, false-positive rate, and an F1 composite.
The composite is one input, not a verdict. A cheap, fast model with a higher false-positive rate is still a strong panelist when paired with a slower, pickier one — because the reconciler scores them together. Use the table to staff a panel, not to crown a winner.
Verify your own roster against every configured endpoint before you spend a review on it.