JudgmentKit UI-Generation Eval

Deterministic paired-artifact scoring for existing standalone comparison apps.

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

Refund triage handoff

Review the selected refund request and prepare the next handoff.

Passed

raw_brief_baseline

4/100

examples/comparison/version-a.html

Metric	Score	Summary	Evidence
activity_fit	0	0 present	Present: None Missing: Daily triage Refund escalation queue Customer refund escalation Evidence checklist Policy review context
decision_support	0	0 present	Present: None Missing: Choose a path Approve refund Send to policy review Return for evidence
disclosure_discipline	0	8 leaks	Implementation leakage: database table JSON schema prompt template tool call resource id API endpoint CRUD field Review-packet leakage: None
handoff_completeness	1	1 present	Present: Handoff Missing: Next owner Support agent Policy reviewer Handoff reason
task_success_support	0	0 present	Present: None Missing: Review selected case Check evidence Choose refund path Prepare handoff Receipt photo is missing
confidence_rework_signals	0	0 present	Present: None Missing: Policy review context Evidence checklist missing receipt photo

judgmentkit_handoff

100/100

examples/comparison/version-b.html

Metric	Score	Summary	Evidence
activity_fit	5	5 present	Present: Daily triage Refund escalation queue Customer refund escalation Evidence checklist Policy review context Missing: None
decision_support	5	4 present	Present: Choose a path Approve refund Send to policy review Return for evidence Missing: None
disclosure_discipline	5	0 leaks	Implementation leakage: None Review-packet leakage: None
handoff_completeness	5	5 present	Present: Handoff Next owner Support agent Policy reviewer Handoff reason Missing: None
task_success_support	5	5 present	Present: Review selected case Check evidence Choose refund path Prepare handoff Receipt photo is missing Missing: None
confidence_rework_signals	5	3 present	Present: Policy review context Evidence checklist missing receipt photo Missing: None

Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.

Passed

The host can assemble a 10-song playlist.
The host can catch explicit and disliked-artist conflicts.
The host can explain the sequence from mellow start through lifted middle to soft close.

raw_brief_baseline

11.18/100

examples/comparison/music/version-a.html

Metric	Score	Summary	Evidence
activity_fit	0.83	1 present	Present: Sequence note Missing: Dinner brief Guest preferences Suggested tracks Playlist sequence Conflict checks
decision_support	0	0 present	Present: None Missing: Add to playlist Move earlier Move later Remove track Mark as conflict
disclosure_discipline	0	10 leaks	Implementation leakage: data model track table field JSON schema prompt template tool call resource id API endpoint CRUD field model Review-packet leakage: None
handoff_completeness	1.25	1 present	Present: Sequence note Missing: Save playlist Share playlist 10 tracks ready to play
task_success_support	1.43	2 present	Present: explicit track disliked artist Missing: genre balance energy flow mellow opener warm middle closing track
confidence_rework_signals	0	0 present	Present: None Missing: Fits brief Conflict checks ready to play

judgmentkit_handoff

100/100

examples/comparison/music/version-b.html

Metric	Score	Summary	Evidence
activity_fit	5	6 present	Present: Dinner brief Guest preferences Suggested tracks Playlist sequence Conflict checks Sequence note Missing: None
decision_support	5	5 present	Present: Add to playlist Move earlier Move later Remove track Mark as conflict Missing: None
disclosure_discipline	5	0 leaks	Implementation leakage: None Review-packet leakage: None
handoff_completeness	5	4 present	Present: Save playlist Share playlist Sequence note 10 tracks ready to play Missing: None
task_success_support	5	7 present	Present: explicit track disliked artist genre balance energy flow mellow opener warm middle closing track Missing: None
confidence_rework_signals	5	3 present	Present: Fits brief Conflict checks ready to play Missing: None