UI generation eval report

JudgmentKit UI-Generation Eval

Deterministic paired-artifact scoring for existing standalone comparison apps. Use this report to review winner, delta, leakage, and activity-fit evidence by case.

Latest run outcome

2/2 cases passed

2 guided wins, 0 baseline wins, 0 ties.

Claim level
repeated_pair_signal
Run date
2026-05-12
MCP release
0.1.0
Run
run-005
Eval id
judgmentkit-ui-generation-paired-artifact-v1
Metric scale
0-5

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

Case review

Refund triage handoff

Review the selected refund request and prepare the next handoff.

Passed
Winner
JudgmentKit guided
Expected winner
JudgmentKit guided
Score delta
+96
Threshold
20
+96 guided delta

Visual evidence

Screenshots show the initial viewport of each committed artifact. They are archived with this run and are not scoring inputs.

Mobile screenshots

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric Baseline Guided Delta
Activity Fit
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Decision Support
0/5 0 present, 4 missing
5/5 4 present, 0 missing
+5
Disclosure Discipline
0/5 8 leaks
5/5 0 leaks
+5
Handoff Completeness
1/5 1 present, 4 missing
5/5 5 present, 0 missing
+4
Task Success Support
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Confidence Rework Signals
0/5 0 present, 3 missing
5/5 3 present, 0 missing
+5

Activity-fit evidence

0 to 5 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (0) None
Guided matched (5)
  • Daily triage
  • Refund escalation queue
  • Customer refund escalation
  • Evidence checklist
  • Policy review context
Guided missing (0) None

Implementation leakage

8 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (8 leaks)
  • database table
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
Guided leakage (0 leaks) None
Expected outcomes and rationale

Expected outcomes

  • The reviewer can identify the selected case.
  • The reviewer can choose approve, policy review, or return for evidence.
  • The reviewer can complete a handoff with a reason and next owner.

Rationale

  • JudgmentKit-guided artifact scored 96 points above baseline.
  • Implementation leakage changed from 8 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 0 matched terms to 5 matched terms.

Case review

Dinner playlist builder

Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.

Passed
Winner
JudgmentKit guided
Expected winner
JudgmentKit guided
Score delta
+88.82
Threshold
20
+88.82 guided delta

Visual evidence

Screenshots show the initial viewport of each committed artifact. They are archived with this run and are not scoring inputs.

Mobile screenshots

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric Baseline Guided Delta
Activity Fit
0.83/5 1 present, 5 missing
5/5 6 present, 0 missing
+4.17
Decision Support
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Disclosure Discipline
0/5 10 leaks
5/5 0 leaks
+5
Handoff Completeness
1.25/5 1 present, 3 missing
5/5 4 present, 0 missing
+3.75
Task Success Support
1.43/5 2 present, 5 missing
5/5 7 present, 0 missing
+3.57
Confidence Rework Signals
0/5 0 present, 3 missing
5/5 3 present, 0 missing
+5

Activity-fit evidence

1 to 6 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (1)
  • Sequence note
Guided matched (6)
  • Dinner brief
  • Guest preferences
  • Suggested tracks
  • Playlist sequence
  • Conflict checks
  • Sequence note
Guided missing (0) None

Implementation leakage

10 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (10 leaks)
  • data model
  • track table field
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
  • model
Guided leakage (0 leaks) None
Expected outcomes and rationale

Expected outcomes

  • The host can assemble a 10-song playlist.
  • The host can catch explicit and disliked-artist conflicts.
  • The host can explain the sequence from mellow start through lifted middle to soft close.

Rationale

  • JudgmentKit-guided artifact scored 88.82 points above baseline.
  • Implementation leakage changed from 10 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 1 matched terms to 6 matched terms.