v2.1 vs v3.0
The two scores answer different questions about the same telemetry. This page lays them out side by side.
The headline difference
Section titled “The headline difference”| v2.1 | v3.0 | |
|---|---|---|
| Number of scores | Three (Speed · Skill · Efficiency) | One (PRISM Score) |
| What it measures | Prompting behavior | Session outcomes |
| Unit | 0–100 (Skill) · hours/week (Speed) · tokens/hour (Efficiency) | 0–100 (% of sessions crushed) |
| Primary input | LLM rubric on every prompt + behavioral proxies | LLM outcome judge on every session |
| Baseline (B grade) | 7.0–7.9 | Calibrated to median developer |
Side-by-side metrics
Section titled “Side-by-side metrics”| Aspect | v2.1 | v3.0 |
|---|---|---|
| Composite | Skill = 100 × (0.45·SSE + 0.20·PES + 0.15·IE + 0.10·CRR + 0.10·FC) | PRISM Score = 100 × crushed_count / total_count |
| Unit of evaluation | Sub-session (one coherent goal) | Session (until /clear or topic shift) |
| Goal detection | Implicit — read from prompt rubric and convergence | Explicit — per-intent outcome judge |
| Intent classes | 1 (generic) | 8 (Question · Investigation · Review · Plan · Small change · Bug fix · Feature · Refactor) |
| Floor for inclusion | All sub-sessions counted | Substance floor (≥3 turns OR ≥10 LOC OR ≥1 mutating tool call) |
| Rework signal | Surfaces inside SSE | First-class fact — fails the session |
/clear handling | Sub-session boundary | Session boundary, with 10-min anti-fragmentation merge |
| Quality discount | Speed multiplied by QR | None — sessions either crush or don’t |
| Trivial-chat behavior | Drags Skill down | Excluded from both numerator and denominator |
| LLM cost | One rubric call per prompt | One rubric + one outcome call per session |
| Session labels | None — sub-sessions show up by id/timestamp | LLM-generated title + 1–2 sentence summary on every session, user-editable from the dashboard |
What stayed the same
Section titled “What stayed the same”- Telemetry pipeline — OTel from the Prism plugin → Ingest → NATS → Engine. Unchanged.
- Letter-grade scale — A+ through F, B = baseline. Same table for both scores.
- Per-prompt PES rubric — still computed, still visible on the Insight page. v3.0 uses it as one input to the outcome judge.
- Sub-session reviews —
/intelligence/reviewsstill shows the prompt-level breakdown for any session.
What changed
Section titled “What changed”| Area | v2.1 → v3.0 |
|---|---|
| Score hub route | /prism → /score-v3 |
| Hero metric | Skill (0–100) |
| Speed page | Hours/week with QR discount |
| Efficiency | Tokens per active hour |
| Outcome attribution | Behavioral inference |
| Failure modes | Many small score deductions |
Which score should I trust?
Section titled “Which score should I trust?”During the calibration window, both run. Use v2.1 if you want to coach prompt behavior turn-by-turn. Use v3.0 if you want to know how often your sessions actually shipped what you set out to ship.
Toggle between them in the dashboard sidebar — the toggle is labeled PRISM Score v3.
When v2.1 retires
Section titled “When v2.1 retires”Once v3.0 reaches steady-state agreement against the hand-labeled calibration set (target: ≥85% on the outcome judge, ≥90% on intent classification), v2.1 retires from the default UI. Historical v2.1 scores remain queryable via the API for at least 6 months after retirement.
See the algorithm overview for the mechanics behind v3.0.