Measurement

The PRISM Score is built from four facts written for every session. This page explains where each fact comes from, how the engine decides when a session starts and ends, and which LLM agents are involved.

For the formula and the per-page breakdown, see PRISM Score. For the full algorithm spec, see Algorithm Overview.

The four facts

Every closed session writes four facts. The score then combines them.

Fact	Source	What it answers
Substance floor	Deterministic detector	Did real work happen? — ≥3 turns OR ≥10 net lines of code OR ≥1 mutating tool call
Goal complete	LLM outcome judge	Did the session land its goal under per-intent criteria?
Rework	Deterministic detector	Did a later session revert or rewrite this one?
Intent established	LLM rubric judge	Did the session commit to a clear class (Question, Bug fix, Feature, etc.)?

A session crushes when all four land in the right state. See PRISM Score → What “crushed” means.

Substance floor

A deterministic filter — no LLM is involved. The floor catches sessions that didn’t do real work and stops them from padding the denominator.

A session passes the floor if any of these is true:

≥3 turns between user and assistant.
≥10 net lines of code added (across all files touched).
≥1 mutating tool call — Edit, Write, Bash with side effects, etc.

Sessions that fail the floor land on the Trivia page. They aren’t penalized — they’re filtered out of both sides of the ratio.

Outcome judge — per-intent criteria

The outcome judge is one LLM call per session. It takes the transcript plus the rubric judge’s intent classification, and decides goal_complete against intent-specific rules:

Intent	`goal_complete = TRUE` when…	Silent-completion floor
Question	User accepted the answer (no follow-up clarification)	0.50
Investigation	Reached a stated conclusion	0.60
Review	Produced an actionable verdict	0.60
Plan / Spec	Plan accepted in-session	0.60
Small change	Change applied + no immediate revert	0.50
Bug fix	Fix applied + verification evidence (test pass, error gone, repro confirmed)	0.75
Feature	Scaffolded matching scope + acceptance + (tests OR explicit “tests later”)	0.75
Refactor	Behavior-preservation evidence (tests green, lint clean, type check passes)	0.75

The silent-completion floor is the confidence the judge needs from circumstantial evidence alone when no explicit “this worked” signal is present. Bug fixes, features, and refactors require the highest confidence — a fix without a verifying test usually won’t crush.

Rework detection

A deterministic check that runs after the session has closed. If a later session reverts or rewrites the same code, the earlier session is downgraded — it didn’t land, even if the outcome judge thought it did. This stops “claim it works, fix it tomorrow” from inflating the score.

Session boundaries

A session closes when any of these fire — never on time alone.

Signal	Detection
`/clear`	OTel event from the plugin — explicit context wipe
Topic shift	LLM comparison of consecutive user turns; closes if the topic differs materially
Git commit	A commit lands touching files the session modified

/compact is not a boundary. It compresses the transcript but preserves the goal. Use /compact when you want more context room; use /clear when you’re starting something new.

Anti-fragmentation merge

/clear can fire mid-task — someone clears to get the context window back, then keeps working. A post-close merger rejoins sessions when both are true:

The next session starts within 10 minutes of the close.
They touch ≥50% of the same files.

The merger is what makes /clear safe to treat as a hard boundary. Without it, /clear-spam would tank the score of diligent developers.

The LLM agents

Four LLM agents back v3. All share one contract (Haiku 4.5 default, temperature = 0, JSON-only output, judge version pinned per persisted row):

#	Agent	Job
1	Sub-Session Tracker	Topic-shift boundary detection
2	Language Detection	Tags the session’s primary language
3	Rubric Judge	Intent classification + 7-boolean rubric + letter grade + title and summary
4	Outcome Judge	`goal_complete` against per-intent criteria

Every transcript that leaves the engine goes through the existing redaction path before any agent sees it. Tenant opt-out is honored.

Update cadence

Every page in the PRISM Score v3 group updates within a few minutes of a session closing. The scoring pipeline is event-driven — it runs as soon as the engine sees the boundary fire, not on a timer.

For the full data pipeline (ingest → NATS → S3 → DataFusion → scoring), see Architecture.