Skip to content

Search is only available in production builds. Try building and previewing the site to test it out locally.

v2.1 vs v3.0

The two scores answer different questions about the same telemetry. This page lays them out side by side.

v2.1v3.0
Number of scoresThree (Speed · Skill · Efficiency)One (PRISM Score)
What it measuresPrompting behaviorSession outcomes
Unit0–100 (Skill) · hours/week (Speed) · tokens/hour (Efficiency)0–100 (% of sessions crushed)
Primary inputLLM rubric on every prompt + behavioral proxiesLLM outcome judge on every session
Baseline (B grade)7.0–7.9Calibrated to median developer
Aspectv2.1v3.0
CompositeSkill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC)PRISM Score = 100 × crushed_count / total_count
Unit of evaluationSub-session (one coherent goal)Session (until /clear or topic shift)
Goal detectionImplicit — read from prompt rubric and convergenceExplicit — per-intent outcome judge
Intent classes1 (generic)8 (Question · Investigation · Review · Plan · Small change · Bug fix · Feature · Refactor)
Floor for inclusionAll sub-sessions countedSubstance floor (≥3 turns OR ≥10 LOC OR ≥1 mutating tool call)
Rework signalSurfaces inside SSEFirst-class fact — fails the session
/clear handlingSub-session boundarySession boundary, with 10-min anti-fragmentation merge
Quality discountSpeed multiplied by QRNone — sessions either crush or don’t
Trivial-chat behaviorDrags Skill downExcluded from both numerator and denominator
LLM costOne rubric call per promptOne rubric + one outcome call per session
Session labelsNone — sub-sessions show up by id/timestampLLM-generated title + 1–2 sentence summary on every session, user-editable from the dashboard
  • Telemetry pipeline — OTel from the Prism plugin → Ingest → NATS → Engine. Unchanged.
  • Letter-grade scale — A+ through F, B = baseline. Same table for both scores.
  • Per-prompt PES rubric — still computed, still visible on the Insight page. v3.0 uses it as one input to the outcome judge.
  • Sub-session reviews/intelligence/reviews still shows the prompt-level breakdown for any session.
Areav2.1 → v3.0
Score hub route/prism/score-v3
Hero metricSkill (0–100)
Speed pageHours/week with QR discount
EfficiencyTokens per active hour
Outcome attributionBehavioral inference
Failure modesMany small score deductions

During the calibration window, both run. Use v2.1 if you want to coach prompt behavior turn-by-turn. Use v3.0 if you want to know how often your sessions actually shipped what you set out to ship.

Toggle between them in the dashboard sidebar — the toggle is labeled PRISM Score v3.

Once v3.0 reaches steady-state agreement against the hand-labeled calibration set (target: ≥85% on the outcome judge, ≥90% on intent classification), v2.1 retires from the default UI. Historical v2.1 scores remain queryable via the API for at least 6 months after retirement.

See the algorithm overview for the mechanics behind v3.0.