Algorithm Overview

Summary

PRISM Score v3.0 is one number: the percentage of qualifying sessions that crushed their goal.

PRISM Score = 100 × crushed_count / total_count

A session is crushed when four facts all land:

Substance floor passed — ≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call
Goal complete — the LLM outcome judge says the session landed its goal under per-intent criteria
Not rework — no later session reverts or rewrites the same code
Rubric judge committed to an intent — a clear goal was established

Sessions that fail the substance floor or never establish a goal drop out of both the numerator and the denominator. Trivial chat does not pad or punish the score.

The score is built from two independent algorithms feeding one formula:

Session boundary detection — when does one session end and the next begin?
Crushed-session determination — for each closed session, did it crush its goal?

Session boundaries

A session closes when any of these fire — never on time alone:

Signal	Detection
`/clear`	OTel event from the plugin — explicit context wipe
Topic shift	LLM call comparing consecutive user turns; closes if topic differs materially
Git commit (Phase 2)	A commit lands touching session-modified files

/compact is not a boundary — it preserves the goal in summarized form. Idle and meeting gaps are not boundaries either — they create false closures.

Anti-fragmentation merge

/clear can over-fire when developers clear mid-task. A post-close merger rejoins sessions that:

Restart within 10 minutes of the previous close, AND
Touch ≥50% of the same files

The merger is the prerequisite for treating /clear as a hard boundary. Without it, /clear-spam would tank the score of diligent developers.

How `/clear` is handled

/clear is the strongest single boundary signal. Here is what happens when the plugin emits one:

Session closes. The current session is sealed and queued for scoring. The next user prompt opens a fresh session.
The closed session is graded as-is. Whatever state the work was in at the /clear moment is what the outcome judge sees. If you /clear before verifying a fix, the session may not crush — see How to crush a session explicitly.
Anti-fragmentation merge runs. If the next session starts within 10 minutes and touches ≥50% of the same files, the two sessions are merged back into one. The merged session is re-scored as a single unit, so a mid-task /clear does not split one task into two failed sessions.
/compact is different. It compresses the transcript but keeps the goal — so it is not a session boundary. Use /compact when you want to free context without ending the task; use /clear when you are actually starting something new.

The practical rule: use /clear between tasks, not inside one. If you clear mid-task and immediately keep working, the merger usually catches it, but it is safer not to rely on the merger.

The four facts

Every closed session writes four facts:

Fact	Source	What it answers
Substance floor	Deterministic detector	Did real work happen? (≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call)
Goal complete	LLM outcome judge	Did the session land its goal under per-intent criteria?
Tokens used	OTel	How many tokens did this session burn?
Rework	Deterministic detector	Did a later session revert or rewrite this one?

Per-intent completion criteria

The outcome judge is one LLM call per session. It takes the transcript plus the rubric judge’s intent classification, and decides goal_complete against intent-specific rules:

Intent	`goal_complete = TRUE` when…	Silent-completion floor
Question	User accepted the answer (no follow-up clarification)	0.50
Investigation	Reached a stated conclusion (root cause, hypothesis confirmed/refuted)	0.60
Review	Produced an actionable verdict (approve / reject / specific changes)	0.60
Plan / Spec	Plan accepted in-session (explicit “let’s build it” or implicit acceptance via implementation)	0.60
Small change	Change applied + no immediate revert	0.50
Bug fix	Fix applied + verification evidence (test pass, error gone, repro confirmed broken)	0.75
Feature	Scaffolded matching scope + acceptance + (tests OR explicit “tests later”)	0.75
Refactor	Behavior-preservation evidence (tests green, lint clean, type check passes)	0.75

High-stakes intents (Bug fix, Feature, Refactor) require stronger evidence to count as crushed when only silent-completion signals are available.

What “Silent-completion floor” means

The judge prefers explicit completion signals — you saying “thanks, that works”, a test passing, an error disappearing. When those aren’t there, the judge falls back on implicit evidence: the conversation just ended, you moved on, no follow-up complaint.

The silent-completion floor is the confidence threshold the judge must clear on that implicit evidence alone to mark the session crushed. A higher floor means more circumstantial evidence is required.

0.50 (Question, Small change) — mild confidence is enough; these are low-stakes.
0.60 (Investigation, Review, Plan) — moderate confidence; the conclusion or verdict has to be clearly inferable.
0.75 (Bug fix, Feature, Refactor) — high confidence; without explicit verification, the judge defaults to “not crushed”.

If you fix a bug and never confirm it, the session will most likely not crush — the floor is set high on purpose to keep the score honest.

How to crush a session explicitly

You don’t have to do anything special — most sessions crush on their own when work actually lands. But the judge is far more confident when you leave explicit signals. A few quick habits raise the hit rate:

Intent	Easiest explicit signal
Question	Acknowledge the answer (“got it”, “thanks”) and don’t ask a follow-up clarification.
Investigation	State the conclusion in the chat: “root cause is X” / “hypothesis confirmed”.
Review	Give a verdict: “approve”, “reject”, or list the specific changes you want.
Plan / Spec	Say “let’s build it” — or just start implementing in the same session.
Small change	Apply the change; don’t immediately revert it.
Bug fix	Run the failing test or repro and confirm it’s green / the error is gone.
Feature	Land the scaffolding with tests (or explicitly defer them: “tests later”).
Refactor	Run tests, lint, and type check; mention that they pass.

What to avoid: ending mid-fix without verifying, switching topics with /clear before confirming the work, or trailing off with “I’ll come back to this”.

Failure signals

Any of these in the final 2–3 turns forces goal_complete = FALSE:

Signal	Pattern
Unresolved error	Stack trace, `Error:`, failing test with no subsequent fix
User frustration	”ugh”, “still broken”, “never mind”, “this isn’t working”
Scope abandoned	”let’s try something else”, “I’ll come back to this”, “skip it”
Iteration loop	≥3 attempts on the same fix with no resolution visible

The crushed-session formula

crushed = substance_floor_passed
       AND judge_status == "ok"
       AND goal_complete
       AND NOT rework

PRISM Score = 100 × crushed_count / total_count

The denominator counts only sessions where substance_floor_passed = TRUE AND judge_status = "ok". Trivial sessions and goal-less ambient chat drop out of both sides.

The agents

Four LLM agents back v3.0. All share one contract (Haiku 4.5 default, temperature = 0, JSON-only output, judge_version pinned per persisted row):

#	Agent	Job
1	Sub-Session Tracker	Topic-shift boundary detection
2	Language Detection	Tags the session’s primary language
3	Rubric Judge	Intent classification + 7-boolean rubric + letter grade + sub-session title and summary
4	Outcome Judge	`goal_complete` against per-intent criteria

A fifth agent (Task Classifier, for TUR) is deferred post-launch.

Privacy

Every transcript that leaves the engine goes through the existing redaction path. Secrets are stripped before any LLM sees them. Tenant opt-out is honored by the same code path — no agent reads raw transcripts directly.

For the full per-agent specifications, see the internal design docs in the optra-prism repository (docs/prism-score-v3-*.md).