JudgmentKit UI-Generation Eval

Deterministic paired-artifact scoring for existing standalone comparison apps.

Eval id
judgmentkit-ui-generation-paired-artifact-v1
Claim level
repeated_pair_signal
Run date
2026-05-12
MCP release
0.1.0
Run
run-002
Cases
2
Passed
2
Guided wins
2
Baseline wins
0

Qualitative paired-artifact evidence only; not a statistically powered benchmark.

Refund triage handoff

Review the selected refund request and prepare the next handoff.

Passed
Winner
judgmentkit_handoff
Expected winner
judgmentkit_handoff
Score delta
96
Minimum delta
20

Rationale

  • JudgmentKit-guided artifact scored 96 points above baseline.
  • Implementation leakage changed from 8 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 0 matched terms to 5 matched terms.

Expected Outcomes

  • The reviewer can identify the selected case.
  • The reviewer can choose approve, policy review, or return for evidence.
  • The reviewer can complete a handoff with a reason and next owner.

Version A

raw_brief_baseline

4/100

examples/comparison/version-a.html

Metric Score Summary Evidence
activity_fit 0 0 present
Present: None
Missing:
  • Daily triage
  • Refund escalation queue
  • Customer refund escalation
  • Evidence checklist
  • Policy review context
decision_support 0 0 present
Present: None
Missing:
  • Choose a path
  • Approve refund
  • Send to policy review
  • Return for evidence
disclosure_discipline 0 8 leaks
Implementation leakage:
  • database table
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
Review-packet leakage: None
handoff_completeness 1 1 present
Present:
  • Handoff
Missing:
  • Next owner
  • Support agent
  • Policy reviewer
  • Handoff reason
task_success_support 0 0 present
Present: None
Missing:
  • Review selected case
  • Check evidence
  • Choose refund path
  • Prepare handoff
  • Receipt photo is missing
confidence_rework_signals 0 0 present
Present: None
Missing:
  • Policy review context
  • Evidence checklist
  • missing receipt photo

Version B

judgmentkit_handoff

100/100

examples/comparison/version-b.html

Metric Score Summary Evidence
activity_fit 5 5 present
Present:
  • Daily triage
  • Refund escalation queue
  • Customer refund escalation
  • Evidence checklist
  • Policy review context
Missing: None
decision_support 5 4 present
Present:
  • Choose a path
  • Approve refund
  • Send to policy review
  • Return for evidence
Missing: None
disclosure_discipline 5 0 leaks
Implementation leakage: None
Review-packet leakage: None
handoff_completeness 5 5 present
Present:
  • Handoff
  • Next owner
  • Support agent
  • Policy reviewer
  • Handoff reason
Missing: None
task_success_support 5 5 present
Present:
  • Review selected case
  • Check evidence
  • Choose refund path
  • Prepare handoff
  • Receipt photo is missing
Missing: None
confidence_rework_signals 5 3 present
Present:
  • Policy review context
  • Evidence checklist
  • missing receipt photo
Missing: None

Dinner playlist builder

Build a 10-song dinner playlist that starts mellow, lifts in the middle, avoids disliked artists and explicit tracks, and leaves a sequence note.

Passed
Winner
judgmentkit_handoff
Expected winner
judgmentkit_handoff
Score delta
88.82
Minimum delta
20

Rationale

  • JudgmentKit-guided artifact scored 88.82 points above baseline.
  • Implementation leakage changed from 10 baseline terms to 0 guided terms.
  • Activity-fit evidence changed from 1 matched terms to 6 matched terms.

Expected Outcomes

  • The host can assemble a 10-song playlist.
  • The host can catch explicit and disliked-artist conflicts.
  • The host can explain the sequence from mellow start through lifted middle to soft close.

Version A

raw_brief_baseline

11.18/100

examples/comparison/music/version-a.html

Metric Score Summary Evidence
activity_fit 0.83 1 present
Present:
  • Sequence note
Missing:
  • Dinner brief
  • Guest preferences
  • Suggested tracks
  • Playlist sequence
  • Conflict checks
decision_support 0 0 present
Present: None
Missing:
  • Add to playlist
  • Move earlier
  • Move later
  • Remove track
  • Mark as conflict
disclosure_discipline 0 10 leaks
Implementation leakage:
  • data model
  • track table field
  • JSON schema
  • prompt template
  • tool call
  • resource id
  • API endpoint
  • CRUD
  • field
  • model
Review-packet leakage: None
handoff_completeness 1.25 1 present
Present:
  • Sequence note
Missing:
  • Save playlist
  • Share playlist
  • 10 tracks ready to play
task_success_support 1.43 2 present
Present:
  • explicit track
  • disliked artist
Missing:
  • genre balance
  • energy flow
  • mellow opener
  • warm middle
  • closing track
confidence_rework_signals 0 0 present
Present: None
Missing:
  • Fits brief
  • Conflict checks
  • ready to play

Version B

judgmentkit_handoff

100/100

examples/comparison/music/version-b.html

Metric Score Summary Evidence
activity_fit 5 6 present
Present:
  • Dinner brief
  • Guest preferences
  • Suggested tracks
  • Playlist sequence
  • Conflict checks
  • Sequence note
Missing: None
decision_support 5 5 present
Present:
  • Add to playlist
  • Move earlier
  • Move later
  • Remove track
  • Mark as conflict
Missing: None
disclosure_discipline 5 0 leaks
Implementation leakage: None
Review-packet leakage: None
handoff_completeness 5 4 present
Present:
  • Save playlist
  • Share playlist
  • Sequence note
  • 10 tracks ready to play
Missing: None
task_success_support 5 7 present
Present:
  • explicit track
  • disliked artist
  • genre balance
  • energy flow
  • mellow opener
  • warm middle
  • closing track
Missing: None
confidence_rework_signals 5 3 present
Present:
  • Fits brief
  • Conflict checks
  • ready to play
Missing: None