Back to Benchmarks
Multi-Model Variance Benchmark Report
2025-12-30
Model: Simulated variance patterns
Temperature: 0.0
20 runs/question
10 questions
Executive Summary
| Metric |
Raw Prompts |
Structured |
Change |
| Mean Agreement Rate (TARa) |
80.0% |
98.5% |
+18.5 pp |
| Inconsistency Rate |
20.0% |
1.5% |
-18.5 pp |
| Mean Variance Reduction |
- |
- |
38.0% |
Results by Category
Logic (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 67.5% |
95.0% |
+27.5 pp |
Factual (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 92.5% |
100.0% |
+7.5 pp |
Math (3 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 98.3% |
100.0% |
+1.7 pp |
Decision (2 questions)
| Raw Agreement |
Structured Agreement |
Improvement |
| 65.0% |
97.5% |
+32.5 pp |
Complex (1 question)
| Raw Agreement |
Structured Agreement |
Improvement |
| 55.0% |
100.0% |
+45.0 pp |
Methodology
Based on academic literature:
Protocol
-
Each question run 20 times with identical
parameters
- Temperature: 0.0
- Two conditions: Raw prompts vs 5-step structured reasoning
- Metric: TARa (Total Agreement Rate for parsed answers)
Structured prompting reduced output inconsistency from
20.0% to 1.5%
38% variance reduction