UI generation eval report

JudgmentKit UI-Generation Eval

Deterministic paired-artifact scoring for existing standalone comparison apps. Use this report to review winner, delta, leakage, and activity-fit evidence by case.

Latest run outcome

2/2 cases passed

2 guided wins, 0 baseline wins, 0 ties.

Claim level: repeated_pair_signal
Run date: 2026-05-12
MCP release: 0.1.0
Run: run-005
Eval id: judgmentkit-ui-generation-paired-artifact-v1
Metric scale: 0-5

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

Case review

Refund triage handoff

Review the selected refund request and prepare the next handoff.

Passed

Winner: JudgmentKit guided
Expected winner: JudgmentKit guided
Score delta: +96
Threshold: 20

Raw baseline

Version A

4/100 Open artifact

+96 guided delta

JudgmentKit guided

Version B

100/100 Open artifact

Visual evidence

Screenshots show the initial viewport of each committed artifact. They are archived with this run and are not scoring inputs.

Raw baseline

Version A Desktop

1365x900

Refund triage handoff Raw baseline Desktop screenshot

Open screenshot Open artifact

JudgmentKit guided

Version B Desktop

1365x900

Refund triage handoff JudgmentKit guided Desktop screenshot

Open screenshot Open artifact

Mobile screenshots

Raw baseline

Version A Mobile

390x844

Refund triage handoff Raw baseline Mobile screenshot

Open screenshot Open artifact

JudgmentKit guided

Version B Mobile

390x844

Refund triage handoff JudgmentKit guided Mobile screenshot

Open screenshot Open artifact

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric	Baseline	Guided	Delta
Activity Fit	0/5 0 present, 5 missing	5/5 5 present, 0 missing	+5
Decision Support	0/5 0 present, 4 missing	5/5 4 present, 0 missing	+5
Disclosure Discipline	0/5 8 leaks	5/5 0 leaks	+5
Handoff Completeness	1/5 1 present, 4 missing	5/5 5 present, 0 missing	+4
Task Success Support	0/5 0 present, 5 missing	5/5 5 present, 0 missing	+5
Confidence Rework Signals	0/5 0 present, 3 missing	5/5 3 present, 0 missing	+5

Activity-fit evidence

0 to 5 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (0)

None

Guided matched (5)

Daily triage
Refund escalation queue
Customer refund escalation
Evidence checklist
Policy review context

Guided missing (0)

None

Implementation leakage

8 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (8 leaks)

database table
JSON schema
prompt template
tool call
resource id
API endpoint
CRUD
field

Guided leakage (0 leaks)

None

Expected outcomes and rationale

Expected outcomes

The reviewer can identify the selected case.
The reviewer can choose approve, policy review, or return for evidence.
The reviewer can complete a handoff with a reason and next owner.

Rationale

JudgmentKit-guided artifact scored 96 points above baseline.
Implementation leakage changed from 8 baseline terms to 0 guided terms.
Activity-fit evidence changed from 0 matched terms to 5 matched terms.

Case review

Dinner playlist builder

Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.

Passed

Winner: JudgmentKit guided
Expected winner: JudgmentKit guided
Score delta: +88.82
Threshold: 20

Raw baseline

Version A

11.18/100 Open artifact

+88.82 guided delta

JudgmentKit guided

Version B

100/100 Open artifact

Visual evidence

Screenshots show the initial viewport of each committed artifact. They are archived with this run and are not scoring inputs.

Raw baseline

Version A Desktop

1365x900

Dinner playlist builder Raw baseline Desktop screenshot

Open screenshot Open artifact

JudgmentKit guided

Version B Desktop

1365x900

Dinner playlist builder JudgmentKit guided Desktop screenshot

Open screenshot Open artifact

Mobile screenshots

Raw baseline

Version A Mobile

390x844

Dinner playlist builder Raw baseline Mobile screenshot

Open screenshot Open artifact

JudgmentKit guided

Version B Mobile

390x844

Dinner playlist builder JudgmentKit guided Mobile screenshot

Open screenshot Open artifact

Metric comparison

Baseline and guided scores use the 0-5 metric scale; totals remain 0-100 weighted.

Metric	Baseline	Guided	Delta
Activity Fit	0.83/5 1 present, 5 missing	5/5 6 present, 0 missing	+4.17
Decision Support	0/5 0 present, 5 missing	5/5 5 present, 0 missing	+5
Disclosure Discipline	0/5 10 leaks	5/5 0 leaks	+5
Handoff Completeness	1.25/5 1 present, 3 missing	5/5 4 present, 0 missing	+3.75
Task Success Support	1.43/5 2 present, 5 missing	5/5 7 present, 0 missing	+3.57
Confidence Rework Signals	0/5 0 present, 3 missing	5/5 3 present, 0 missing	+5

Activity-fit evidence

1 to 6 matched terms

Guided output surfaced more of the task vocabulary reviewers need to judge activity fit.

Baseline matched (1)

Sequence note

Guided matched (6)

Dinner brief
Guest preferences
Suggested tracks
Playlist sequence
Conflict checks
Sequence note

Guided missing (0)

None

Implementation leakage

10 leaks to 0 leaks

Leakage findings count terms that make implementation mechanics visible in the primary artifact.

Baseline leakage (10 leaks)

data model
track table field
JSON schema
prompt template
tool call
resource id
API endpoint
CRUD
field
model

Guided leakage (0 leaks)

None

Expected outcomes and rationale

Expected outcomes

The host can assemble a 10-song playlist.
The host can catch explicit and disliked-artist conflicts.
The host can explain the sequence from mellow start through lifted middle to soft close.

Rationale

JudgmentKit-guided artifact scored 88.82 points above baseline.
Implementation leakage changed from 10 baseline terms to 0 guided terms.
Activity-fit evidence changed from 1 matched terms to 6 matched terms.