Most outcome measurement in routine CBT practice is intake-and-discharge. PHQ-9 and GAD-7 at session 1, PHQ-9 and GAD-7 at the final session, subtract one from the other, write it up. The number that emerges is treated as the outcome data and, in audit terms, it is. It also tells the therapist almost nothing about what was happening while the treatment was running.
This is measurement in the same sense that taking a thermometer reading at the start and end of an illness is measurement. The numbers are technically true. They are not, in any meaningful clinical sense, of use to anyone trying to treat the patient in between.
The field has known this for some time. The continuous-measurement literature has been steadily accumulating since the 1990s, and the Lutz programme of research has by now produced enough evidence that the question is not really whether session-by-session routine outcome monitoring (ROM) improves outcomes — it does — but why routine practice has not yet adopted it at the rate the evidence suggests it should.
What session-by-session ROM actually means
The phrase "routine outcome monitoring" is used loosely enough in the literature that it is worth being precise about what session-by-session ROM is, distinct from the kind of ROM that most services run as standard.
Session-by-session ROM is a brief, repeated, structured measure administered at the start of every session — not just the first and the last. PHQ-9 for depression, GAD-7 for anxiety, WSAS for functional impairment, ORS/SRS in some traditions, PCL-5 for trauma where relevant. The specific instrument matters less than the principle: the same measure, taken before the session opens, scored automatically, and visible to the therapist as the session begins.
The data this generates is structurally different from intake-and-discharge data in a way that matters clinically. Intake-and-discharge gives you two points and a difference. Session-by-session ROM gives you a trajectory — a curve of how the client's symptom severity has moved over time, plotted week by week, against the population trajectory for clients with similar presenting severity. The trajectory is the unit of analysis, not the endpoints.
That difference is not cosmetic. It is the difference between knowing where the client ended up and knowing where the client is heading, while the treatment is still running.
The Lutz programme — what continuous measurement is actually for
Wolfgang Lutz and colleagues have built one of the most consequential programmes of research in psychotherapy measurement over the past twenty years, and a useful synthesis of it appears in Lutz, De Jong, Rubel and Delgadillo's 2022 piece in World Psychiatry. The argument is straightforward when laid out clearly.
If you measure clients repeatedly over the course of treatment, you can generate an expected trajectory for clients with similar baseline severity, presenting problem, and demographic profile. A given client's actual trajectory can then be compared to that expected curve in real time. Clients who are tracking the expected response curve do not require any particular intervention — the treatment is working as it should. Clients who are tracking below the expected curve, however, are exhibiting an early signal that the treatment is not delivering for them, and that signal arrives weeks before the discharge measure would surface the same information.
This is the mechanism by which feedback-informed treatment (FIT) generates its outcome effects. The signal is not the absolute score; it is the deviation from expected trajectory, surfaced to the therapist while there is still time to act on it. Clients identified as "not on track" by these methods, when their therapists are alerted to that fact, show improved outcomes relative to comparison conditions where the same monitoring is happening but the alerts are not fed back. The alert, in other words, is doing the work — not the measurement itself.
What the alert allows is the kind of clinical adjustment that intake-and-discharge measurement cannot prompt. A reformulation. A change in technique emphasis. A conversation with the client about why progress has stalled. A consultation with supervision focused specifically on what has gone wrong with this case rather than a general case discussion. Sometimes a stepped-up referral. The point is that the adjustment becomes possible because the information arrives in time to use it.
This is what continuous measurement is actually for. It is not a reporting overhead. It is a decision-support input.
The Delgadillo evidence — UK Talking Therapies and stratified care
The Lutz argument is general; the UK-specific evidence comes principally from Jaime Delgadillo and colleagues' work in NHS Talking Therapies (formerly IAPT). Delgadillo et al.'s stratified-care research, published across a series of papers culminating in the 2021 stratified-care trial, looked at what happens when ROM data is used not just to track outcomes but to inform stepped-care decisions.
The stratified-care logic is this: high-intensity CBT is more resource-intensive than low-intensity guided self-help, but it is not more effective for all clients. Some clients do as well at the lower intensity; some clients require the higher intensity from the start. ROM data, particularly when combined with baseline predictors, can be used to allocate clients to the intensity their presentation suggests they need rather than allocating everyone to low-intensity first and stepping up only after failure. The Delgadillo trial demonstrated that stratified care, with allocation informed by predictive modelling built on ROM data, produced improved outcomes compared with the standard stepped-care approach.
The implication that travels beyond the Talking Therapies context is the more important one. ROM data is not just for reporting. It is for clinical decision-making — about intensity, about technique, about when to consult, about when to refer, about when the treatment plan needs to change. The Delgadillo work demonstrated this at scale within a national service; the same logic applies to individual practice.
This is the structural argument that intake-and-discharge measurement cannot answer. By the time the discharge PHQ-9 tells you the client did not improve, the treatment is over. The data has audit value; it has no clinical decision-support value for the case it describes. Whatever could have been adjusted, cannot now be adjusted.
What intake-and-discharge measurement actually misses
It is worth being concrete about what gets lost when measurement is collapsed to two points.
Early identification of stalling cases. A client whose PHQ-9 has not moved between session 1 and session 6 is, by the Lutz framework, a not-on-track case. The clinical conversation about why — formulation gap? alliance issue? wrong intervention? life event not being attended to? — needs to happen at session 6, not at discharge. Intake-and-discharge measurement makes that conversation impossible to ground in data.
Early identification of deterioration. Some clients get worse during therapy, particularly during trauma-focused work or behavioural activation that is finding its difficulty curve incorrectly. Deterioration is a clinical phenomenon worth catching early; intake-and-discharge measurement catches it, if at all, only when the client either drops out or completes treatment showing higher symptom severity than at baseline. Session-by-session ROM catches it at the session where the deterioration appears.
The rupture-engagement signal. A drop in ROM engagement — the client stops completing the measure, completes it perfunctorily, or skips items they previously answered — is one of the more reliable early signals of alliance trouble. This is taken up in detail in the alliance rupture prediction piece; the relevant point here is that the signal does not exist at all without the ongoing measurement to compare against.
The plateau case. A client whose scores improved rapidly in the first four sessions and have plateaued for three sessions running is in a different clinical position from a client whose scores have improved steadily across the same period. Intake-and-discharge measurement cannot distinguish these. Session-by-session ROM does, and the clinical implication — that the plateau case may have hit a maintaining factor that needs explicit work rather than a continuation of the existing intervention — is the kind of decision the therapist needs the data to support.
These are not edge cases. They are common patterns that appear across any reasonable caseload over the course of a year. The data exists; the question is whether it is being captured in a form that can influence the therapy.
Why this is hard to maintain by hand
If session-by-session ROM is so clearly preferable, the next question is why most routine practice does not run it. The answer is structural rather than informational, in the same way the homework problem is structural rather than informational.
Reliable session-by-session ROM requires that the measure be present, completed, scored, and reviewed before every session. Each of these steps is small in isolation; cumulatively across a caseload they are not. The therapist has to remember to send the measure or hand it out. The client has to complete it. The score has to be calculated. The result has to be plotted against the trajectory. The therapist has to see it before the session opens, with enough preparation time to think about what it means.
In practice, this whole sequence collapses under caseload pressure within weeks. The measure does not get sent. When it does get sent, the client completes it irregularly. When the client completes it, the score is calculated in the closing minutes of the previous case or not at all. By the time the session opens, the therapist is working from memory of what the score was four sessions ago and a general impression of how things are going. The data is technically being collected; it is not, functionally, informing the work.
This is the same structural-drift pattern documented in the therapist drift literature. The behaviour the evidence supports is the behaviour that erodes first under pressure, not because the therapist disagrees with the evidence but because the behaviour is high-effort to maintain without supporting infrastructure. Therapists who manage to sustain session-by-session ROM by hand tend to be either obsessional about their administrative systems or operating at sub-full caseloads. Neither of these is a scalable solution.
The honest position is that session-by-session ROM as a clinical decision-support tool requires instrumentation. Without instrumentation, what gets measured will be what is easiest to measure — the start and the end — and the trajectory data the evidence keeps pointing to as the clinically useful quantity will not be available when the case needs it.
What good ROM infrastructure looks like
The components of a working session-by-session ROM system are not exotic, but they are precise. Each component fails differently when it is missing.
The measure is delivered to the client between sessions, on whatever channel they use. Web link, app prompt, SMS — the channel matters less than the reliability. The client should not have to remember to complete the measure; the system should prompt them at a consistent time relative to the next session.
The measure is completed before the session opens. This is the constraint that makes the data clinically useful. A PHQ-9 completed in the waiting room is too late to inform the agenda setting; a PHQ-9 completed three days earlier can be reviewed and woven into the session opening.
The score is calculated automatically and stored against the client's record. Manual scoring is one of the steps that erodes first. Automatic scoring removes the failure mode.
The trajectory is plotted against an expected response curve. The score in isolation matters less than its position relative to where the trajectory predicts it should be. The Lutz framework cannot operate on raw scores alone; it requires the comparison curve.
Flags surface when the client deviates from trajectory. Not-on-track cases need to be visible. A clinician with thirty active cases cannot manually compare each trajectory against its expected curve every week; the system needs to surface the deviations.
Engagement with the measure is tracked alongside the measure itself. A drop in completion rate is itself diagnostic information — about engagement, about the alliance, about whether the client is becoming avoidant of the measure for clinically interesting reasons. Engagement data should be visible at the same time as score data.
These components together turn ROM from an audit overhead into a decision-support layer. The therapist sits down to open a session knowing what the trajectory looks like, whether the client is on track, whether their engagement with the measure has changed, and what the relevant clinical questions are. None of this requires the therapist to remember to do anything; the infrastructure remembers for them.
The evidence keeps pointing the same direction
The case for session-by-session ROM is not novel and it is not contested in the methodological literature. The Lutz programme of research has built the evidence base over twenty years. The Delgadillo work has demonstrated it at national-service scale. The feedback-informed treatment literature has shown the mechanism by which the measurement converts into improved outcomes. The implementation literature has, with similar consistency, identified the structural barriers that prevent the evidence from reaching routine practice.
The gap between what the evidence supports and what most therapists actually do is not a knowledge gap. It is an instrumentation gap. The behaviour the evidence recommends is high-effort to maintain by hand and low-effort to maintain with appropriate infrastructure, and the historic absence of that infrastructure at a price-point that fits routine practice is most of why the gap persists.
That constraint is changing, and the case for closing the gap is now harder to set aside than it used to be.
Supervisia Companion is the between-session and pre-session layer for ROM.
Companion delivers the measure to the client on whatever channel they use, captures the response, scores it automatically, plots it against the expected trajectory, and surfaces deviations to the therapist before the next session opens. The not-on-track signal becomes a visible flag in the therapist's dashboard rather than something the therapist has to derive from raw scores. The Kazantzis-evidence layer for homework and the Lutz-evidence layer for outcome data sit alongside each other as the pre-session brief — the data the session needs in the form the session can actually use.
References
- Lutz, W., De Jong, K., Rubel, J. A. & Delgadillo, J. (2022). Measuring, predicting, and tracking change in psychotherapy. World Psychiatry, 21(2), 213–214. DOI: 10.1002/wps.20977.
- Delgadillo, J., Ali, S., Fleck, K., Agnew, C., Southgate, A., Parkhouse, L., Cohen, Z. D., DeRubeis, R. J. & Barkham, M. (2021). Stratified care versus stepped care for depression: A cluster randomized clinical trial. JAMA Psychiatry. (UK NHS Talking Therapies stratified-care evidence.)
- Lambert, M. J., Whipple, J. L. & Kleinstäuber, M. (2018). Collecting and delivering progress feedback: A meta-analysis of routine outcome monitoring. Psychotherapy, 55(4), 520–537. DOI: 10.1037/pst0000167.
- Kazantzis, N., Whittington, C., Zelencich, L., Kyrios, M., Norton, P. J. & Hofmann, S. G. (2016). Quantity and quality of homework compliance: A meta-analysis of relations with outcome in cognitive behavior therapy. Behavior Therapy, 47(5), 755–772. DOI: 10.1016/j.beth.2016.05.002. PubMed: 27816086.
- Waller, G. & Turner, H. (2016). Therapist drift redux: Why well-meaning clinicians fail to deliver evidence-based therapy, and how to get back on track. Behaviour Research and Therapy, 77, 129–137. DOI: 10.1016/j.brat.2016.01.007. PubMed: 26752326.
Last updated: May 2026
