Evaluation evidence
Evals
Immutable UI generation eval runs compare raw generated interfaces with JudgmentKit-guided handoff outputs. Use these reports as evidence, not as broad benchmark claims.
- Latest run
- 2026-05-15 / mcp-0.1.0 / run-001
- Claim level
- repeated_pair_signal
- Result
- 2/2 passed
- Guided wins
- 2
Qualitative paired-artifact evidence only; not a statistically powered benchmark.
All runs
| Date | MCP release | Run | Claim level | Result | Reports |
|---|---|---|---|---|---|
| 2026-05-15 | 0.1.0 | run-001 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-13 | 0.1.0 | run-001 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-12 | 0.1.0 | run-005 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-12 | 0.1.0 | run-004 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-12 | 0.1.0 | run-003 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-12 | 0.1.0 | run-002 | repeated_pair_signal | 2/2 passed | HTML · JSON |
| 2026-05-12 | 0.1.0 | run-001 | repeated_pair_signal | 2/2 passed | HTML · JSON |