UI generation eval report

JudgmentKit UI-Generation Eval

Deterministic paired-artifact scoring for existing standalone comparison apps. Use this report to review winner, delta, leakage, and activity-fit evidence by case.

Latest run outcome

2/2 cases passed

2 guided wins, 0 baseline wins, 0 ties.

Claim level
repeated_pair_signal
Run date
2026-05-12
MCP release
0.1.0
Run
run-003
Eval id
judgmentkit-ui-generation-paired-artifact-v1
Metric scale
0-5

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

Case review

Refund triage handoff

Review the selected refund request and prepare the next handoff.

Passed
Winner
JudgmentKit guided
Expected winner
JudgmentKit guided
Score delta
+96
Threshold
20
+96 guided delta

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric Baseline Guided Delta
Activity Fit
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Decision Support
0/5 0 present, 4 missing
5/5 4 present, 0 missing
+5
Disclosure Discipline
0/5 8 leaks
5/5 0 leaks
+5
Handoff Completeness
1/5 1 present, 4 missing
5/5 5 present, 0 missing
+4
Task Success Support
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Confidence Rework Signals
0/5 0 present, 3 missing
5/5 3 present, 0 missing
+5

Activity-fit evidence

0 to 5 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (0) None
Guided matched (5)
  • Daily triage
  • Refund escalation queue
  • Customer refund escalation
  • Evidence checklist
  • Policy review context
Guided missing (0) None

Implementation leakage

8 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (8 leaks)
  • database table
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
Guided leakage (0 leaks) None
Expected outcomes and rationale

Expected outcomes

  • The reviewer can identify the selected case.
  • The reviewer can choose approve, policy review, or return for evidence.
  • The reviewer can complete a handoff with a reason and next owner.

Rationale

  • JudgmentKit-guided artifact scored 96 points above baseline.
  • Implementation leakage changed from 8 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 0 matched terms to 5 matched terms.

Case review

Dinner playlist builder

Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.

Passed
Winner
JudgmentKit guided
Expected winner
JudgmentKit guided
Score delta
+88.82
Threshold
20
+88.82 guided delta

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric Baseline Guided Delta
Activity Fit
0.83/5 1 present, 5 missing
5/5 6 present, 0 missing
+4.17
Decision Support
0/5 0 present, 5 missing
5/5 5 present, 0 missing
+5
Disclosure Discipline
0/5 10 leaks
5/5 0 leaks
+5
Handoff Completeness
1.25/5 1 present, 3 missing
5/5 4 present, 0 missing
+3.75
Task Success Support
1.43/5 2 present, 5 missing
5/5 7 present, 0 missing
+3.57
Confidence Rework Signals
0/5 0 present, 3 missing
5/5 3 present, 0 missing
+5

Activity-fit evidence

1 to 6 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (1)
  • Sequence note
Guided matched (6)
  • Dinner brief
  • Guest preferences
  • Suggested tracks
  • Playlist sequence
  • Conflict checks
  • Sequence note
Guided missing (0) None

Implementation leakage

10 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (10 leaks)
  • data model
  • track table field
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
  • model
Guided leakage (0 leaks) None
Expected outcomes and rationale

Expected outcomes

  • The host can assemble a 10-song playlist.
  • The host can catch explicit and disliked-artist conflicts.
  • The host can explain the sequence from mellow start through lifted middle to soft close.

Rationale

  • JudgmentKit-guided artifact scored 88.82 points above baseline.
  • Implementation leakage changed from 10 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 1 matched terms to 6 matched terms.