v2.1 vs v3.0

The two scores answer different questions about the same telemetry. This page lays them out side by side.

The headline difference

	v2.1	v3.0
Number of scores	Three (Speed · Skill · Efficiency)	One (PRISM Score)
What it measures	Prompting behavior	Session outcomes
Unit	0–100 (Skill) · hours/week (Speed) · tokens/hour (Efficiency)	0–100 (% of sessions crushed)
Primary input	LLM rubric on every prompt + behavioral proxies	LLM outcome judge on every session
Baseline (B grade)	7.0–7.9	Calibrated to median developer

Side-by-side metrics

Aspect	v2.1	v3.0
Composite	`Skill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC)`	`PRISM Score = 100 × crushed_count / total_count`
Unit of evaluation	Sub-session (one coherent goal)	Session (until `/clear` or topic shift)
Goal detection	Implicit — read from prompt rubric and convergence	Explicit — per-intent outcome judge
Intent classes	1 (generic)	8 (Question · Investigation · Review · Plan · Small change · Bug fix · Feature · Refactor)
Floor for inclusion	All sub-sessions counted	Substance floor (≥3 turns OR ≥10 LOC OR ≥1 mutating tool call)
Rework signal	Surfaces inside SSE	First-class fact — fails the session
`/clear` handling	Sub-session boundary	Session boundary, with 10-min anti-fragmentation merge
Quality discount	Speed multiplied by QR	None — sessions either crush or don’t
Trivial-chat behavior	Drags Skill down	Excluded from both numerator and denominator
LLM cost	One rubric call per prompt	One rubric + one outcome call per session
Session labels	None — sub-sessions show up by id/timestamp	LLM-generated title + 1–2 sentence summary on every session, user-editable from the dashboard

What stayed the same

Telemetry pipeline — OTel from the Prism plugin → Ingest → NATS → Engine. Unchanged.
Letter-grade scale — A+ through F, B = baseline. Same table for both scores.
Per-prompt PES rubric — still computed, still visible on the Insight page. v3.0 uses it as one input to the outcome judge.
Sub-session reviews — /intelligence/reviews still shows the prompt-level breakdown for any session.

What changed

Area	v2.1 → v3.0
Score hub route	`/prism` → `/score-v3`
Hero metric	Skill (0–100)
Speed page	Hours/week with QR discount
Efficiency	Tokens per active hour
Outcome attribution	Behavioral inference
Failure modes	Many small score deductions

Which score should I trust?

During the calibration window, both run. Use v2.1 if you want to coach prompt behavior turn-by-turn. Use v3.0 if you want to know how often your sessions actually shipped what you set out to ship.

Toggle between them in the dashboard sidebar — the toggle is labeled PRISM Score v3.

When v2.1 retires

Once v3.0 reaches steady-state agreement against the hand-labeled calibration set (target: ≥85% on the outcome judge, ≥90% on intent classification), v2.1 retires from the default UI. Historical v2.1 scores remain queryable via the API for at least 6 months after retirement.

See the algorithm overview for the mechanics behind v3.0.