Algorithm Overview
Summary
Section titled “Summary”PRISM Score v3.0 is one number: the percentage of qualifying sessions that crushed their goal.
PRISM Score = 100 × crushed_count / total_countA session is crushed when four facts all land:
- Substance floor passed — ≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call
- Goal complete — the LLM outcome judge says the session landed its goal under per-intent criteria
- Not rework — no later session reverts or rewrites the same code
- Rubric judge committed to an intent — a clear goal was established
Sessions that fail the substance floor or never establish a goal drop out of both the numerator and the denominator. Trivial chat does not pad or punish the score.
The score is built from two independent algorithms feeding one formula:
- Session boundary detection — when does one session end and the next begin?
- Crushed-session determination — for each closed session, did it crush its goal?
Session boundaries
Section titled “Session boundaries”A session closes when any of these fire — never on time alone:
| Signal | Detection |
|---|---|
/clear | OTel event from the plugin — explicit context wipe |
| Topic shift | LLM call comparing consecutive user turns; closes if topic differs materially |
| Git commit (Phase 2) | A commit lands touching session-modified files |
/compact is not a boundary — it preserves the goal in summarized form. Idle and meeting gaps are not boundaries either — they create false closures.
Anti-fragmentation merge
Section titled “Anti-fragmentation merge”/clear can over-fire when developers clear mid-task. A post-close merger rejoins sessions that:
- Restart within 10 minutes of the previous close, AND
- Touch ≥50% of the same files
The merger is the prerequisite for treating /clear as a hard boundary. Without it, /clear-spam would tank the score of diligent developers.
How /clear is handled
Section titled “How /clear is handled”/clear is the strongest single boundary signal. Here is what happens when the plugin emits one:
- Session closes. The current session is sealed and queued for scoring. The next user prompt opens a fresh session.
- The closed session is graded as-is. Whatever state the work was in at the
/clearmoment is what the outcome judge sees. If you/clearbefore verifying a fix, the session may not crush — see How to crush a session explicitly. - Anti-fragmentation merge runs. If the next session starts within 10 minutes and touches ≥50% of the same files, the two sessions are merged back into one. The merged session is re-scored as a single unit, so a mid-task
/cleardoes not split one task into two failed sessions. /compactis different. It compresses the transcript but keeps the goal — so it is not a session boundary. Use/compactwhen you want to free context without ending the task; use/clearwhen you are actually starting something new.
The practical rule: use /clear between tasks, not inside one. If you clear mid-task and immediately keep working, the merger usually catches it, but it is safer not to rely on the merger.
The four facts
Section titled “The four facts”Every closed session writes four facts:
| Fact | Source | What it answers |
|---|---|---|
| Substance floor | Deterministic detector | Did real work happen? (≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call) |
| Goal complete | LLM outcome judge | Did the session land its goal under per-intent criteria? |
| Tokens used | OTel | How many tokens did this session burn? |
| Rework | Deterministic detector | Did a later session revert or rewrite this one? |
Per-intent completion criteria
Section titled “Per-intent completion criteria”The outcome judge is one LLM call per session. It takes the transcript plus the rubric judge’s intent classification, and decides goal_complete against intent-specific rules:
| Intent | goal_complete = TRUE when… | Silent-completion floor |
|---|---|---|
| Question | User accepted the answer (no follow-up clarification) | 0.50 |
| Investigation | Reached a stated conclusion (root cause, hypothesis confirmed/refuted) | 0.60 |
| Review | Produced an actionable verdict (approve / reject / specific changes) | 0.60 |
| Plan / Spec | Plan accepted in-session (explicit “let’s build it” or implicit acceptance via implementation) | 0.60 |
| Small change | Change applied + no immediate revert | 0.50 |
| Bug fix | Fix applied + verification evidence (test pass, error gone, repro confirmed broken) | 0.75 |
| Feature | Scaffolded matching scope + acceptance + (tests OR explicit “tests later”) | 0.75 |
| Refactor | Behavior-preservation evidence (tests green, lint clean, type check passes) | 0.75 |
High-stakes intents (Bug fix, Feature, Refactor) require stronger evidence to count as crushed when only silent-completion signals are available.
What “Silent-completion floor” means
Section titled “What “Silent-completion floor” means”The judge prefers explicit completion signals — you saying “thanks, that works”, a test passing, an error disappearing. When those aren’t there, the judge falls back on implicit evidence: the conversation just ended, you moved on, no follow-up complaint.
The silent-completion floor is the confidence threshold the judge must clear on that implicit evidence alone to mark the session crushed. A higher floor means more circumstantial evidence is required.
0.50(Question, Small change) — mild confidence is enough; these are low-stakes.0.60(Investigation, Review, Plan) — moderate confidence; the conclusion or verdict has to be clearly inferable.0.75(Bug fix, Feature, Refactor) — high confidence; without explicit verification, the judge defaults to “not crushed”.
If you fix a bug and never confirm it, the session will most likely not crush — the floor is set high on purpose to keep the score honest.
How to crush a session explicitly
Section titled “How to crush a session explicitly”You don’t have to do anything special — most sessions crush on their own when work actually lands. But the judge is far more confident when you leave explicit signals. A few quick habits raise the hit rate:
| Intent | Easiest explicit signal |
|---|---|
| Question | Acknowledge the answer (“got it”, “thanks”) and don’t ask a follow-up clarification. |
| Investigation | State the conclusion in the chat: “root cause is X” / “hypothesis confirmed”. |
| Review | Give a verdict: “approve”, “reject”, or list the specific changes you want. |
| Plan / Spec | Say “let’s build it” — or just start implementing in the same session. |
| Small change | Apply the change; don’t immediately revert it. |
| Bug fix | Run the failing test or repro and confirm it’s green / the error is gone. |
| Feature | Land the scaffolding with tests (or explicitly defer them: “tests later”). |
| Refactor | Run tests, lint, and type check; mention that they pass. |
What to avoid: ending mid-fix without verifying, switching topics with /clear before confirming the work, or trailing off with “I’ll come back to this”.
Failure signals
Section titled “Failure signals”Any of these in the final 2–3 turns forces goal_complete = FALSE:
| Signal | Pattern |
|---|---|
| Unresolved error | Stack trace, Error:, failing test with no subsequent fix |
| User frustration | ”ugh”, “still broken”, “never mind”, “this isn’t working” |
| Scope abandoned | ”let’s try something else”, “I’ll come back to this”, “skip it” |
| Iteration loop | ≥3 attempts on the same fix with no resolution visible |
The crushed-session formula
Section titled “The crushed-session formula”crushed = substance_floor_passed AND judge_status == "ok" AND goal_complete AND NOT rework
PRISM Score = 100 × crushed_count / total_countThe denominator counts only sessions where substance_floor_passed = TRUE AND judge_status = "ok". Trivial sessions and goal-less ambient chat drop out of both sides.
The agents
Section titled “The agents”Four LLM agents back v3.0. All share one contract (Haiku 4.5 default, temperature = 0, JSON-only output, judge_version pinned per persisted row):
| # | Agent | Job |
|---|---|---|
| 1 | Sub-Session Tracker | Topic-shift boundary detection |
| 2 | Language Detection | Tags the session’s primary language |
| 3 | Rubric Judge | Intent classification + 7-boolean rubric + letter grade + sub-session title and summary |
| 4 | Outcome Judge | goal_complete against per-intent criteria |
A fifth agent (Task Classifier, for TUR) is deferred post-launch.
Privacy
Section titled “Privacy”Every transcript that leaves the engine goes through the existing redaction path. Secrets are stripped before any LLM sees them. Tenant opt-out is honored by the same code path — no agent reads raw transcripts directly.
For the full per-agent specifications, see the internal design docs in the optra-prism repository (docs/prism-score-v3-*.md).