Research

Why CBT trials need continuous fidelity infrastructure, not quarterly tape rating

A trial's reported effect size is a function of two things — the intervention's true efficacy and the fidelity with which it was delivered. The field treats the first as science and the second as logistics. That ordering is backwards, and it quietly degrades a great deal of CBT trial output.

15 January 202612 min read

A trial's reported effect size is a function of two things. The first is the true efficacy of the intervention as defined in the manual. The second is the fidelity with which the trial therapists actually delivered that intervention to participants. In a clean trial these two converge; in a contaminated trial they pull apart, and the difference between them is read by the field as a smaller-than-expected effect for the intervention itself.

The field treats the first of these as the science and the second as logistics. That ordering is the wrong way around. Fidelity is not the administrative layer beneath the science; it is one of the two factors that determine what the trial actually reports. Treating it as logistics is the design decision that quietly degrades a great deal of CBT trial output.

The Borrelli framework — and what trials usually leave undone

Belinda Borrelli and colleagues' 2005 paper in the Journal of Consulting and Clinical Psychology set out what has become the standard framework for thinking about treatment fidelity in behavioural and psychological trials. They identified five domains:

Design. Whether the protocol specifies the intervention in sufficient detail for reproducible delivery, whether the comparator is genuinely a comparator rather than a degraded version of the active intervention, whether dose is specified.

Training. Whether the trial therapists were trained to a defined competency standard before delivering the intervention, and whether that training was calibrated against the manual rather than against pre-existing local practice.

Delivery. Whether the intervention as actually delivered to participants in session matches the intervention as specified in the protocol. This is the domain most exposed to therapist drift, supervision variability, and site-level cultural pull.

Receipt. Whether participants understood and engaged with the intervention as delivered. The intervention delivered correctly to a participant who did not engage with it is not, for practical purposes, the intervention.

Enactment. Whether participants used the intervention's skills outside the treatment sessions, in the contexts where it is supposed to produce change.

Their review of ten years of health behaviour trials found that reporting was systematically uneven across these domains. Trials reported well on design and training. They reported poorly on delivery, and worse on receipt and enactment. The pattern was not random. Design and training are activities that happen at the trial team's desk; delivery, receipt, and enactment require infrastructure embedded in the routine running of the trial across every participating site, week after week, throughout the recruitment and treatment phases.

Fig. 1 — Borrelli's five domains

What gets reported in trial fidelity papers, by domain desk-side activity embedded in routine

The five domains of treatment fidelity, after Borrelli et al. 2005, shaded by how completely each is typically reported in CBT trial papers. Design and training — the work that happens at the trial team's desk — are well-covered. Delivery, receipt and enactment require infrastructure embedded across every site, and are reported correspondingly sparsely.After Borrelli et al. (2005). Journal of Consulting and Clinical Psychology, 73(5), 852–860.

That infrastructure is, in most CBT trials, not present. The standard delivery-fidelity arrangement is a small sample of sessions rated by an independent assessor, often late in the trial, often once the data are largely collected. This is not fidelity monitoring in any meaningful sense. It is fidelity post-mortem. It can confirm whether delivery was broadly aligned with the manual; it cannot influence whether delivery was aligned during the trial when influence was still possible.

What the drift literature implies for trials

The therapist drift literature is usually read as a problem for routine clinical practice. It is also a problem for trials, and the implication has not been fully absorbed by trial design.

Glen Waller's original 2009 paper in Behaviour Research and Therapy documented that qualified therapists working under standard supervision arrangements move away from evidence-based protocols in characteristic ways: under-delivering exposure, softening behavioural experiments, expanding the cognitive components at the expense of behavioural ones, accepting non-engagement with homework as a clinical decision rather than a problem to address. Waller and Turner's 2016 redux paper extended the analysis with the additional finding that these patterns are not driven primarily by knowledge deficits — they are driven by therapist beliefs, emotional avoidance, habit, and the surrounding social context.

The trial-design implication is direct. Trial therapists are qualified therapists. They have beliefs, emotions, habits, and social contexts. They were selected and trained for the trial, but selection and training do not place them outside the drift literature; selection and training are themselves what Waller and Turner identify as insufficient on their own to prevent drift. If routine clinical therapists working under standard supervision drift, then trial therapists working under episodic monitoring drift too. The structural protections in a trial — calibration meetings, a study protocol, the trial therapists' awareness of being studied — slow the drift; they do not eliminate it.

This means therapist drift is not a confound to be acknowledged in the limitations section of a trial paper. It is a systematic threat to internal validity, and the appropriate response is the same response that the drift literature recommends for clinical practice: continuous, structured, externally-instrumented fidelity feedback. Not as a quality-assurance afterthought. As part of the trial's measurement architecture.

The same principle applies, with sharper consequence, to multi-site trials. The companion piece on drift detection in multi-site CBT trials takes that problem in its own right — drift accumulates differently at each site, and without an instrument to see it, site-level drift is read at analysis as treatment failure.

The Bearman finding — and what it does to standard fidelity practice

Sarah Bearman and colleagues' 2022 randomised trial of adherence measurement methods, published as PubMed 36229116, is the paper that should have changed standard fidelity practice in trials and has not yet. They compared methods of measuring therapist adherence to a manualised CBT protocol, including direct observation of sessions, behavioural rehearsal of techniques outside session, and therapist self-report.

The headline finding is the one that matters. Therapist self-report significantly overestimated adherence relative to direct observation. The gap was not small and was not noise; it was a systematic bias in the same direction across therapists. Behavioural rehearsal, on the other hand, aligned with direct observation as a fidelity measurement approach — meaning that asking a therapist to demonstrate a technique outside of session captured something closer to actual in-session behaviour than asking them to rate themselves.

The implication for trial fidelity instrumentation is uncomfortable. A great deal of trial adherence data is, in practice, collected by therapist self-report — session checklists filled in by the therapist who delivered the session, often immediately afterward, asking whether each manual component was covered. The Bearman finding says that this data measures something other than what the trial team has assumed it measures. It measures the therapist's self-perception of adherence, which is systematically inflated relative to actual delivery.

This is not a Bearman-specific artefact. It connects directly to Walfish and colleagues' 2012 work on clinician self-assessment bias — the well-documented finding that mental health practitioners overwhelmingly rate themselves as above-average performers, which is statistically impossible across a sample. The Walfish bias and the Bearman bias are the same phenomenon viewed through different instruments. Therapist self-report, in both routine practice and trial settings, does not give us reliable signal on what is actually happening in session.

A fidelity protocol that depends on therapist self-report is not a fidelity protocol. It is a measure of how the therapists feel about their delivery.

What continuous fidelity infrastructure looks like

The argument so far is that periodic tape-rating sampled late, plus therapist self-report sampled often, is not a fidelity instrument adequate to the threats that drift, self-assessment bias, and multi-site variability pose. What would adequate infrastructure look like?

The components are not exotic. They are unfamiliar mainly because trials have historically been unable to afford the labour to run them at scale, and because the labour has historically required infrastructure that did not exist.

Every treatment session recorded, with consent built into the trial enrolment. Recording cannot be a sample. A sampling rate of 5–10% of sessions, which is what most trials manage, is too sparse to detect drift at the therapist level and far too sparse to detect it at the site level within the trial's timeframe. The recording itself is a low-cost activity; the cost of building the consent into the protocol and the storage architecture is one-off; the value of having the recordings available for systematic rating is the difference between credible fidelity claims and aspirational ones.

Systematic CTS-R-type sampling by independent raters, throughout the trial rather than at the end. Independent rating against a defined competency framework (CTS-R is the obvious choice for CBT; other instruments exist for other modalities) is the gold standard against which therapist self-report fails. The constraint is rater capacity. The historical reason this was not done continuously is that it required dedicated rater time on a rolling basis throughout the trial, which most budgets did not accommodate. Modern rater workflow infrastructure — structured queues, calibration cohorts, partial automation of the easier rating steps — substantially lowers the marginal cost of additional rated sessions.

Fig. 2 — Two sampling patterns

Continuous sampling throughout trial starts trial ends

detection in time to correct

The historical norm samples a handful of late sessions and produces a post-mortem rating; continuous sampling distributes the same effort across the trial's lifetime and produces drift signal early enough to act on. The total rating effort is comparable; the timing of the signal is not.Schematic. Standard trial practice (top) vs continuous-fidelity proposal (bottom).

Feedback timed to influence the next session, not the next quarter. This is the operationally critical point. The purpose of fidelity monitoring within the trial is not retrospective. It is to detect drift early enough that the trial team can intervene — booster training, focused supervision input, re-calibration of the therapist against the manual — before the drift has compounded across a cohort of participants. A fidelity rating that arrives three months after the session it rated has scientific value as historical record; it has no corrective value for the trial as it runs. Continuous infrastructure means the loop closes within days, not months.

Drift-detection signals visible at therapist and site level. What the trial team needs is not a stack of individual session ratings. It is a dashboard view of how each therapist's technique use is trending, which protocol components are being delivered less often than the manual specifies, and whether one site is drifting in a direction the other sites are not. Drift becomes detectable only when you can see it as a pattern; individual session ratings are the raw input, but pattern visibility is what makes the signal actionable.

The specific problem of catching site-level drift in multi-site trials before it shows up at analysis as a contaminated effect size is taken up in the drift detection in RCTs piece.

The cost objection, addressed

The standard objection at this point is cost. Continuous fidelity infrastructure is more expensive than the standard sampled approach. The objection is correct but incomplete.

Trials already cost millions of pounds per arm at any meaningful scale. The marginal cost of continuous fidelity infrastructure — recording, rater time, rater-workflow tooling, the dashboard layer that surfaces drift — is small relative to the total trial budget. More importantly, it is small relative to the cost of the alternative.

The alternative is a trial that produces a smaller effect size than the intervention actually warrants because delivery was inconsistent, and which then enters the meta-analytic literature as a data point of ambiguous interpretation. The cost of that outcome — to the trial team's reputation, to the funder's confidence, and most consequentially to the evidence base — is large and durable. A trial that under-reports an intervention's true effect because of fidelity contamination is not just one wasted trial. It contributes to a downward bias in the evidence base that subsequent trials will inherit and that meta-analyses will average over without being able to correct.

Set against that cost, the marginal investment in continuous fidelity infrastructure looks straightforwardly worth making.

On inter-rater reliability

A point that needs flagging rather than fully addressed here. Continuous fidelity infrastructure presupposes that the raters generating the underlying data are calibrated against each other to an acceptable standard. Inter-rater reliability is its own methodological territory, and it is the territory where naive scaling of rated session volume breaks down — adding more raters without adding rater calibration adds noise, not signal.

Cohen's kappa is the standard headline statistic but it is not, on its own, sufficient to characterise rater calibration on a multi-item observational measure like the CTS-R. The detail of how trial-rater cohorts should be calibrated, monitored, and re-calibrated through the life of the trial is the subject of a planned companion piece on inter-rater reliability in CBT research. For this piece, the relevant point is that the infrastructure being described above must include a rater-calibration layer, not just a rating-throughput layer. Continuous fidelity without continuous rater calibration is continuous noise.

The structural argument, restated

The case for continuous fidelity infrastructure in CBT trials is straightforward when laid out together. Therapist drift is a documented phenomenon that applies to trial therapists as well as routine ones. Self-report measures of adherence are biased relative to observation. Multi-site trials amplify these problems because site-level variation accumulates. Sampled, late-stage fidelity rating cannot detect drift early enough to correct it. Continuous, externally-instrumented, dashboarded fidelity monitoring can.

The reason this is not yet the default is historical: the infrastructure to make it tractable did not exist at a price point that fit standard trial budgets. That constraint is changing. The trials that build continuous fidelity infrastructure into their design from the outset will produce evidence with substantially less measurement contamination than the ones that do not. The trials that do not will continue to report effect sizes that quietly under-state the interventions they are testing, and the field will continue to be uncertain about how much of the variance is the intervention and how much is the delivery.

Supervisia Research is the platform layer for this work.

Research provides the rater workflows, calibration cohorts, session-level CTS-R capture, drift dashboards, and bespoke trial training modules that make continuous fidelity infrastructure tractable rather than aspirational. The Borrelli framework's delivery, receipt, and enactment domains move from under-reported to instrumented. The platform is designed to be used from protocol design through to fidelity reporting — including the rater calibration architecture the surface methodology depends on.

See how Research supports trial teams →

References

Borrelli, B., Sepinwall, D., Ernst, D., Bellg, A. J., Czajkowski, S., Breger, R., DeFrancesco, C., Levesque, C., Sharp, D. L., Ogedegbe, G., Resnick, B. & Orwig, D. (2005). A new tool to assess treatment fidelity and evaluation of treatment fidelity across 10 years of health behavior research. Journal of Consulting and Clinical Psychology, 73(5), 852–860. DOI: 10.1037/0022-006X.73.5.852. PubMed: 16287385.
Waller, G. & Turner, H. (2016). Therapist drift redux: Why well-meaning clinicians fail to deliver evidence-based therapy, and how to get back on track. Behaviour Research and Therapy, 77, 129–137. DOI: 10.1016/j.brat.2016.01.007. PubMed: 26752326.
Waller, G. (2009). Evidence-based treatment and therapist drift. Behaviour Research and Therapy, 47(2), 119–127. DOI: 10.1016/j.brat.2008.10.018. PubMed: 19036354.
Bearman, S. K. et al. (2022). A randomized trial to identify accurate measurement methods for adherence to cognitive-behavioral therapy. PubMed: 36229116.
Liness, S. et al. (2019). Clinical supervision in cognitive behavior therapy improves therapists' competence: A single-case experimental pilot study. PubMed: 32213046.
Walfish, S., McAlister, B., O'Donnell, P. & Lambert, M. J. (2012). An investigation of self-assessment bias in mental health providers. Psychological Reports, 110(2), 639–644. DOI: 10.2466/02.07.17.PR0.110.2.639-644.

Last updated: May 2026

See how Supervisia Research supports trial fidelity

Start free — no card required.

See how Supervisia Research supports trial fidelity →

← Back to Research