Evaluation evidence

Evals

Immutable UI generation eval runs compare raw generated interfaces with JudgmentKit-guided handoff outputs. Use these reports as evidence, not as broad benchmark claims.

Latest run
2026-05-15 / mcp-0.1.0 / run-001
Claim level
repeated_pair_signal
Result
2/2 passed
Guided wins
2

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

All runs

Date MCP release Run Claim level Result Reports
2026-05-15 0.1.0 run-001 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-13 0.1.0 run-001 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-12 0.1.0 run-005 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-12 0.1.0 run-004 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-12 0.1.0 run-003 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-12 0.1.0 run-002 repeated_pair_signal 2/2 passed HTML · JSON
2026-05-12 0.1.0 run-001 repeated_pair_signal 2/2 passed HTML · JSON