Skip to content

Search is only available in production builds. Try building and previewing the site to test it out locally.

Algorithm Overview

PRISM Score v3.0 is one number: the percentage of qualifying sessions that crushed their goal.

PRISM Score = 100 × crushed_count / total_count

A session is crushed when four facts all land:

  1. Substance floor passed — ≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call
  2. Goal complete — the LLM outcome judge says the session landed its goal under per-intent criteria
  3. Not rework — no later session reverts or rewrites the same code
  4. Rubric judge committed to an intent — a clear goal was established

Sessions that fail the substance floor or never establish a goal drop out of both the numerator and the denominator. Trivial chat does not pad or punish the score.

The score is built from two independent algorithms feeding one formula:

  1. Session boundary detection — when does one session end and the next begin?
  2. Crushed-session determination — for each closed session, did it crush its goal?

A session closes when any of these fire — never on time alone:

SignalDetection
/clearOTel event from the plugin — explicit context wipe
Topic shiftLLM call comparing consecutive user turns; closes if topic differs materially
Git commit (Phase 2)A commit lands touching session-modified files

/compact is not a boundary — it preserves the goal in summarized form. Idle and meeting gaps are not boundaries either — they create false closures.

/clear can over-fire when developers clear mid-task. A post-close merger rejoins sessions that:

  • Restart within 10 minutes of the previous close, AND
  • Touch ≥50% of the same files

The merger is the prerequisite for treating /clear as a hard boundary. Without it, /clear-spam would tank the score of diligent developers.

/clear is the strongest single boundary signal. Here is what happens when the plugin emits one:

  1. Session closes. The current session is sealed and queued for scoring. The next user prompt opens a fresh session.
  2. The closed session is graded as-is. Whatever state the work was in at the /clear moment is what the outcome judge sees. If you /clear before verifying a fix, the session may not crush — see How to crush a session explicitly.
  3. Anti-fragmentation merge runs. If the next session starts within 10 minutes and touches ≥50% of the same files, the two sessions are merged back into one. The merged session is re-scored as a single unit, so a mid-task /clear does not split one task into two failed sessions.
  4. /compact is different. It compresses the transcript but keeps the goal — so it is not a session boundary. Use /compact when you want to free context without ending the task; use /clear when you are actually starting something new.

The practical rule: use /clear between tasks, not inside one. If you clear mid-task and immediately keep working, the merger usually catches it, but it is safer not to rely on the merger.

Every closed session writes four facts:

FactSourceWhat it answers
Substance floorDeterministic detectorDid real work happen? (≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call)
Goal completeLLM outcome judgeDid the session land its goal under per-intent criteria?
Tokens usedOTelHow many tokens did this session burn?
ReworkDeterministic detectorDid a later session revert or rewrite this one?

The outcome judge is one LLM call per session. It takes the transcript plus the rubric judge’s intent classification, and decides goal_complete against intent-specific rules:

Intentgoal_complete = TRUE when…Silent-completion floor
QuestionUser accepted the answer (no follow-up clarification)0.50
InvestigationReached a stated conclusion (root cause, hypothesis confirmed/refuted)0.60
ReviewProduced an actionable verdict (approve / reject / specific changes)0.60
Plan / SpecPlan accepted in-session (explicit “let’s build it” or implicit acceptance via implementation)0.60
Small changeChange applied + no immediate revert0.50
Bug fixFix applied + verification evidence (test pass, error gone, repro confirmed broken)0.75
FeatureScaffolded matching scope + acceptance + (tests OR explicit “tests later”)0.75
RefactorBehavior-preservation evidence (tests green, lint clean, type check passes)0.75

High-stakes intents (Bug fix, Feature, Refactor) require stronger evidence to count as crushed when only silent-completion signals are available.

The judge prefers explicit completion signals — you saying “thanks, that works”, a test passing, an error disappearing. When those aren’t there, the judge falls back on implicit evidence: the conversation just ended, you moved on, no follow-up complaint.

The silent-completion floor is the confidence threshold the judge must clear on that implicit evidence alone to mark the session crushed. A higher floor means more circumstantial evidence is required.

  • 0.50 (Question, Small change) — mild confidence is enough; these are low-stakes.
  • 0.60 (Investigation, Review, Plan) — moderate confidence; the conclusion or verdict has to be clearly inferable.
  • 0.75 (Bug fix, Feature, Refactor) — high confidence; without explicit verification, the judge defaults to “not crushed”.

If you fix a bug and never confirm it, the session will most likely not crush — the floor is set high on purpose to keep the score honest.

You don’t have to do anything special — most sessions crush on their own when work actually lands. But the judge is far more confident when you leave explicit signals. A few quick habits raise the hit rate:

IntentEasiest explicit signal
QuestionAcknowledge the answer (“got it”, “thanks”) and don’t ask a follow-up clarification.
InvestigationState the conclusion in the chat: “root cause is X” / “hypothesis confirmed”.
ReviewGive a verdict: “approve”, “reject”, or list the specific changes you want.
Plan / SpecSay “let’s build it” — or just start implementing in the same session.
Small changeApply the change; don’t immediately revert it.
Bug fixRun the failing test or repro and confirm it’s green / the error is gone.
FeatureLand the scaffolding with tests (or explicitly defer them: “tests later”).
RefactorRun tests, lint, and type check; mention that they pass.

What to avoid: ending mid-fix without verifying, switching topics with /clear before confirming the work, or trailing off with “I’ll come back to this”.

Any of these in the final 2–3 turns forces goal_complete = FALSE:

SignalPattern
Unresolved errorStack trace, Error:, failing test with no subsequent fix
User frustration”ugh”, “still broken”, “never mind”, “this isn’t working”
Scope abandoned”let’s try something else”, “I’ll come back to this”, “skip it”
Iteration loop≥3 attempts on the same fix with no resolution visible
crushed = substance_floor_passed
AND judge_status == "ok"
AND goal_complete
AND NOT rework
PRISM Score = 100 × crushed_count / total_count

The denominator counts only sessions where substance_floor_passed = TRUE AND judge_status = "ok". Trivial sessions and goal-less ambient chat drop out of both sides.

Four LLM agents back v3.0. All share one contract (Haiku 4.5 default, temperature = 0, JSON-only output, judge_version pinned per persisted row):

#AgentJob
1Sub-Session TrackerTopic-shift boundary detection
2Language DetectionTags the session’s primary language
3Rubric JudgeIntent classification + 7-boolean rubric + letter grade + sub-session title and summary
4Outcome Judgegoal_complete against per-intent criteria

A fifth agent (Task Classifier, for TUR) is deferred post-launch.

Every transcript that leaves the engine goes through the existing redaction path. Secrets are stripped before any LLM sees them. Tenant opt-out is honored by the same code path — no agent reads raw transcripts directly.

For the full per-agent specifications, see the internal design docs in the optra-prism repository (docs/prism-score-v3-*.md).