JudgmentKit UI-Generation Eval
Deterministic paired-artifact scoring for existing standalone comparison apps.
- Eval id
- judgmentkit-ui-generation-paired-artifact-v1
- Claim level
- repeated_pair_signal
- Run date
- 2026-05-12
- MCP release
- 0.1.0
- Run
- run-002
- Cases
- 2
- Passed
- 2
- Guided wins
- 2
- Baseline wins
- 0
Qualitative paired-artifact evidence only; not a statistically powered benchmark.
Refund triage handoff
Review the selected refund request and prepare the next handoff.
- Winner
- judgmentkit_handoff
- Expected winner
- judgmentkit_handoff
- Score delta
- 96
- Minimum delta
- 20
Rationale
- JudgmentKit-guided artifact scored 96 points above baseline.
- Implementation leakage changed from 8 baseline terms to 0 guided terms.
- Activity-fit evidence changed from 0 matched terms to 5 matched terms.
Expected Outcomes
- The reviewer can identify the selected case.
- The reviewer can choose approve, policy review, or return for evidence.
- The reviewer can complete a handoff with a reason and next owner.
Version A
raw_brief_baseline
examples/comparison/version-a.html
| Metric | Score | Summary | Evidence |
|---|---|---|---|
| activity_fit | 0 | 0 present |
Present: None
Missing:
|
| decision_support | 0 | 0 present |
Present: None
Missing:
|
| disclosure_discipline | 0 | 8 leaks |
Implementation leakage:
Review-packet leakage: None
|
| handoff_completeness | 1 | 1 present |
Present:
Missing:
|
| task_success_support | 0 | 0 present |
Present: None
Missing:
|
| confidence_rework_signals | 0 | 0 present |
Present: None
Missing:
|
Version B
judgmentkit_handoff
examples/comparison/version-b.html
| Metric | Score | Summary | Evidence |
|---|---|---|---|
| activity_fit | 5 | 5 present |
Present:
Missing: None
|
| decision_support | 5 | 4 present |
Present:
Missing: None
|
| disclosure_discipline | 5 | 0 leaks |
Implementation leakage: None
Review-packet leakage: None
|
| handoff_completeness | 5 | 5 present |
Present:
Missing: None
|
| task_success_support | 5 | 5 present |
Present:
Missing: None
|
| confidence_rework_signals | 5 | 3 present |
Present:
Missing: None
|
Dinner playlist builder
Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.
- Winner
- judgmentkit_handoff
- Expected winner
- judgmentkit_handoff
- Score delta
- 88.82
- Minimum delta
- 20
Rationale
- JudgmentKit-guided artifact scored 88.82 points above baseline.
- Implementation leakage changed from 10 baseline terms to 0 guided terms.
- Activity-fit evidence changed from 1 matched terms to 6 matched terms.
Expected Outcomes
- The host can assemble a 10-song playlist.
- The host can catch explicit and disliked-artist conflicts.
- The host can explain the sequence from mellow start through lifted middle to soft close.
Version A
raw_brief_baseline
examples/comparison/music/version-a.html
| Metric | Score | Summary | Evidence |
|---|---|---|---|
| activity_fit | 0.83 | 1 present |
Present:
Missing:
|
| decision_support | 0 | 0 present |
Present: None
Missing:
|
| disclosure_discipline | 0 | 10 leaks |
Implementation leakage:
Review-packet leakage: None
|
| handoff_completeness | 1.25 | 1 present |
Present:
Missing:
|
| task_success_support | 1.43 | 2 present |
Present:
Missing:
|
| confidence_rework_signals | 0 | 0 present |
Present: None
Missing:
|
Version B
judgmentkit_handoff
examples/comparison/music/version-b.html
| Metric | Score | Summary | Evidence |
|---|---|---|---|
| activity_fit | 5 | 6 present |
Present:
Missing: None
|
| decision_support | 5 | 5 present |
Present:
Missing: None
|
| disclosure_discipline | 5 | 0 leaks |
Implementation leakage: None
Review-packet leakage: None
|
| handoff_completeness | 5 | 4 present |
Present:
Missing: None
|
| task_success_support | 5 | 7 present |
Present:
Missing: None
|
| confidence_rework_signals | 5 | 3 present |
Present:
Missing: None
|