Research

Catching drift in multi-site CBT trials before it contaminates your effect size

In a multi-site CBT trial, the question is not whether some sites will drift. It is which sites and by how much. If you haven't built an instrument to see it, you will read it as treatment failure.

22 January 202610 min read

In a multi-site CBT trial of any meaningful size, the question is not whether some sites will drift. It is which sites, by how much, and starting when. If you have not built an instrument to see the drift, you will read it at analysis as a smaller-than-expected pooled effect, attribute it to patient mix, comorbidity load, or the universal "real-world generalisability" caveat, and quietly absorb the contamination into the published number.

This is one of the more durable forms of measurement bias in the CBT trials literature, and it is sustained mostly by the fact that the field has no convenient way to make it visible.

Why multi-site amplifies the drift problem

The single-site drift problem is by now reasonably well documented. Glen Waller's 2009 paper in Behaviour Research and Therapy established that qualified, trained therapists drift away from evidence-based protocols in systematic ways: under-delivering exposure, softening behavioural experiments, replacing the harder behavioural components with the more comfortable cognitive ones, accepting non-engagement with homework rather than addressing it. Waller and Turner's 2016 redux paper added the analytic point that this is not knowledge-driven — it is shaped by therapist beliefs, emotional avoidance, and the surrounding social context.

The surrounding social context is where the multi-site amplification happens.

Each site in a multi-site trial has its own local culture. The lead clinician at the site has a particular practice style; the supervision arrangements vary in quality and frequency; the team's prior exposure to the intervention being tested differs; the local interpretation of ambiguous parts of the manual stabilises around whatever the senior people on site already do. None of these site-level factors are visible from the central trial team's vantage point unless infrastructure is built to make them visible.

The consequence is that drift, in a multi-site trial, does not drift uniformly. It drifts at site-specific rates and in site-specific directions. One site under-delivers exposure because the lead clinician has long believed exposure is too distressing for "fragile" patients. Another site over-relies on Socratic dialogue because that is what the team is most fluent in. A third site adheres tightly because the supervisor happens to be particularly committed to manual fidelity. The trial's pooled effect, calculated at analysis, averages across these site-level patterns without being able to attribute the variance to its actual sources.

This matters because the trial team will then attempt to interpret a smaller-than-expected effect. The instinct is to explain it through patient characteristics — sicker sample, more comorbidity, less treatment-naive, more chronic. Sometimes these explanations are correct. Often they are not. The actual story — one or two sites drifted to a degree that pulled the pooled effect down by enough to matter — is invisible because it was never instrumented.

It is, in this sense, a measurement gap masquerading as a clinical finding.

What happens without detection

Take a concrete sketch. A six-site trial of CBT for a presenting condition, two therapists per site, eight participants per therapist over the trial's recruitment window. Ninety-six participants in total in the active arm. Suppose four of the six sites deliver the manual at reasonable fidelity, one site under-delivers a critical component (say, the behavioural experiments) consistently, and one site is broadly fidelity-aligned but is delivering at the lower bound of dose.

At analysis, the active arm under-performs the trial's pre-registered effect size by perhaps a third. The discussion section attempts to account for the gap. The trial team — sensibly and conscientiously — examines patient characteristics, looks at recruitment timing, checks for early-dropout patterns. None of these analyses identify the actual cause, because the actual cause is in two sites' delivery patterns and the trial did not have an instrument that could see delivery patterns.

The trial publishes. It enters the meta-analytic literature with an effect size that under-states the intervention. The next planned trial of the same intervention runs with sample-size calculations powered against the inherited effect size and may itself be under-powered for the intervention's actual effect. Meanwhile, the two drifted sites continue with whatever delivery pattern caused the contamination, and the intervention's reputation in the field is quietly degraded by a measurement problem that was never named.

Fig. 1 — When site-level drift hides in the pooled effect

Per-site effect size Effect size

A six-site trial where four sites land near the expected effect and two drift below. Without per-site visibility, the trial reports a pooled effect (orange dashed line) that under-states the intervention. The drift is not in the patient mix or in the analysis — it is in two sites' delivery, invisible without instrumentation.Schematic. After Waller & Turner (2016) and Liness et al. (2019) on the trial-design implications of therapist drift.

This is not a hypothetical worry. It is the most plausible single explanation for a meaningful fraction of the "effectiveness gap" between efficacy trials of CBT and the routine-care effectiveness studies that consistently report smaller effects. Some of that gap is genuine — routine care really does have less infrastructure than trials do. Some of it is, almost certainly, drift that nobody had the instrument to detect.

What drift detection looks like in practice

Drift detection is not a single intervention; it is a stack of practices that together allow the trial to see and act on delivery variation. The components are not exotic, and they are easier to describe in isolation than to operationalise together.

Baseline CTS-R distribution established in calibration. Before any participant is randomised, each trial therapist's baseline competence on the trial intervention should be characterised on a recognised competency instrument (CTS-R is the standard for CBT). This is not a pass/fail gate; it is the reference distribution against which subsequent ratings are interpreted. A therapist whose baseline calibration was at the lower bound of acceptable competence is a different drift risk from a therapist who calibrated comfortably mid-range, and the trial team should know which is which before recruitment opens.

Ongoing session sampling, not just end-of-trial sampling. This is the structural point taken up in detail in the continuous fidelity infrastructure companion piece. Sampling rates of 5–10% of sessions, the historical norm, are too sparse to detect drift at the therapist level on a useful timescale, let alone at the site level. Higher sampling rates are now tractable in a way they were not a decade ago, and the design choice between low and high sampling is no longer purely a budget constraint — it is a design choice about what the trial will be able to see.

Drift signals derived from the ratings, not just the ratings themselves. What the trial team needs at the monitoring stage is not a stack of individual session scores. It is signal extracted from those scores: declining technique use over time for a given therapist, lower CTS-R scores on specific behavioural items relative to the manual's expected emphasis, qualitative shifts in supervision audit notes, divergence between sites on the proportion of sessions that include a specific intervention component. Drift becomes detectable when it is visible as a pattern. Pattern visibility is what turns raw session ratings into an instrument the trial team can act on.

Site-level dashboards visible to the trial PI in real time. The PI of a multi-site trial cannot supervise every session at every site; that is what the site leads are for. What the PI needs is a view across sites that surfaces divergence as it develops — a dashboard, in the basic sense, that shows whether site B is delivering the behavioural component at a meaningfully different rate from sites A, C, D, E, and F. This is what makes the trial team's response possible in time to matter, rather than retrospective.

Therapist-level views as well as site-level ones. Site-level drift can be driven by a single therapist within an otherwise compliant site. The PI's view needs to be able to drill in. A site that looks fine at the aggregate may contain one therapist whose delivery is the actual contamination source.

None of these components is in tension with standard trial methodology. They are extensions of what trials already aspire to. The reason they have not been the default is the historical infrastructure cost. The reason they are now feasible is that infrastructure cost has come down sharply.

Detection without intervention is just bookkeeping

A critical point that is easy to miss. A trial that detects drift but has no protocol for acting on it has merely improved its post-hoc explanatory capacity. That is worth something — being able to say in the discussion that two sites drifted and quantifying the effect on the pooled estimate is a real scientific contribution. But it is a much weaker contribution than catching the drift early enough to correct it.

The corrective action question is therefore part of the design, not separate from it. The trial protocol needs to specify what happens when drift is detected: a booster training intervention, focused supervision input directed at the specific drifted behaviours, re-calibration of the therapist against the manual, in serious cases removal of the therapist from the trial arm. Each of these has its own design implications — booster training affects the per-protocol analysis, re-calibration mid-trial affects how the data are stratified — but the right place to make those decisions is in the protocol before recruitment opens, not in the discussion section after the data are in.

Liness and colleagues' 2019 single-case experimental study, PubMed 32213046, is directly relevant here. They demonstrated that structured CBT supervision improves CTS-R competence ratings over time, with the gains specific to the supervised skills. The mechanism is exactly what trial drift-correction needs: supervision as a structured, fidelity-referenced feedback loop, not supervision as case discussion. The Liness paper is single-case in design but the mechanism it documents is the same one that drift-correction protocols should be built around.

Bearman and colleagues' 2022 trial, PubMed 36229116, adds an important constraint on which fidelity signals the trial can trust as corrective triggers. Therapist self-report systematically over-states adherence relative to direct observation. A drift-detection protocol that triggers booster training only when therapists themselves rate their own adherence as low will trigger almost never. Triggers must be based on observation, not self-report. Behavioural rehearsal aligns more closely with observation than self-report does, which is a useful adjunct measurement option when full observation rating is not available for every session.

The strategic argument for funders

There is a parallel argument worth making to the audience that ultimately decides whether trials can afford this infrastructure: research funders.

A funded trial that produces an uninterpretable result is a worse outcome for the funder than a funded trial that produces a slightly more expensive but interpretable one. Uninterpretable in this context means: a smaller effect than expected, with no instrumented explanation of why, and therefore no clear next step for either the science or the policy. Drift-detection infrastructure is the difference between that outcome and one in which the trial team can credibly say the intervention works as expected; here is the fidelity-rated delivery data to support that claim, or the intervention under-performed in our trial and we can characterise where and why the delivery diverged.

Both of those outcomes are scientifically useful. Neither is possible without the instrumentation. The marginal cost of building the instrumentation into the design is small relative to the value of moving from uninterpretable to interpretable trial output.

This is the same argument that gets made for adequate sample size, blinding, and pre-registration. Drift-detection infrastructure belongs in the same category — a methodological investment that improves the interpretability of the trial's output by enough to be worth the cost.

What this means for trial design now

The substantive design implications of taking the drift-detection argument seriously are not difficult to articulate. They are uncomfortable mainly because they impose more methodological discipline on trial design than has historically been the default.

Trials need to budget for continuous fidelity capture from protocol design forward, not as an add-on. They need to plan for independent rater capacity throughout the trial's treatment phase, not just at the end. They need to specify, in the protocol, what observable signals will trigger corrective action and what that action will consist of. They need to commit to site-level transparency on delivery patterns, accepting that some sites' deliveries will look worse than others and that this visibility is the trade-off for being able to correct in time. They need to treat their rater cohort as a methodologically managed group requiring calibration and re-calibration, not as a pool of contractors.

None of this is novel methodology. It is the application of measurement standards we already require in other domains of clinical research to the specific problem of CBT trial fidelity. It is overdue.

Supervisia Research provides the infrastructure layer for drift-detection workflows in multi-site CBT trials.

The platform supports continuous session capture, structured rater queues with calibration management, site- and therapist-level drift dashboards, and the protocol-specified corrective workflows that turn detected drift into actionable feedback to the trial therapists. It is designed to be brought in from protocol design through to fidelity reporting, including bespoke trial training programmes that match the intervention manual rather than generic CBT competencies.

See how Research supports trial teams →

References

Waller, G. (2009). Evidence-based treatment and therapist drift. Behaviour Research and Therapy, 47(2), 119–127. DOI: 10.1016/j.brat.2008.10.018. PubMed: 19036354.
Waller, G. & Turner, H. (2016). Therapist drift redux: Why well-meaning clinicians fail to deliver evidence-based therapy, and how to get back on track. Behaviour Research and Therapy, 77, 129–137. DOI: 10.1016/j.brat.2016.01.007. PubMed: 26752326.
Bearman, S. K. et al. (2022). A randomized trial to identify accurate measurement methods for adherence to cognitive-behavioral therapy. PubMed: 36229116.
Liness, S. et al. (2019). Clinical supervision in cognitive behavior therapy improves therapists' competence: A single-case experimental pilot study. PubMed: 32213046.
Borrelli, B., Sepinwall, D., Ernst, D., Bellg, A. J., Czajkowski, S., Breger, R., DeFrancesco, C., Levesque, C., Sharp, D. L., Ogedegbe, G., Resnick, B. & Orwig, D. (2005). A new tool to assess treatment fidelity and evaluation of treatment fidelity across 10 years of health behavior research. Journal of Consulting and Clinical Psychology, 73(5), 852–860. DOI: 10.1037/0022-006X.73.5.852. PubMed: 16287385.

Last updated: May 2026

See how Supervisia Research supports drift detection

Start free — no card required.

See how Supervisia Research supports drift detection →

← Back to Research