Train

CTS-R: what it is, how it's scored, and why it should be on every CBT therapist's desk

The Cognitive Therapy Scale — Revised is the closest the field has to a working language for observable CBT competence. Most qualified therapists encountered it once in training and quietly retired it. That is a mistake, and the reasons are structural rather than sentimental.

13 March 202510 min read

The Cognitive Therapy Scale — Revised, almost always referred to in the field as the CTS-R, is one of those instruments that every CBT trainee encounters at some point in training and then quietly forgets after qualification. It appears in the assessment requirements for course portfolios. It gets attached to a placement supervisor's feedback once or twice. And then, for the great majority of qualified therapists, it slides into the same drawer as the rest of training-era paperwork and is not retrieved.

This is a mistake, and not for sentimental reasons. The CTS-R is not sacred and it is not perfect. But it is the only widely-used instrument in CBT that describes what good practice looks like in terms a second observer can recognise. Without something like it, the field has no shared vocabulary for distinguishing competent CBT from CBT-flavoured conversation. With it, that vocabulary exists. Treating the CTS-R as a training artefact rather than a working tool is one of the quieter ways the post-qualification skill plateau gets baked in.

What CTS-R is

The CTS-R was developed at Newcastle in the late 1990s by Ian Blackburn and colleagues as a revision of the original Cognitive Therapy Scale (CTS), which had been the field's standard observational measure for over a decade. The development team published the psychometric properties of the revised instrument in 2001 in Behavioural and Cognitive Psychotherapy. The motivation for the revision was specific: the original CTS had known limitations around item overlap, inter-rater reliability for some items, and the calibration of the rating anchors. The CTS-R was an attempt to address these without losing the structure that had made the original useful.

The instrument is a 12-item observational rating scale. Each item is rated on a 0–6 scale, with anchored behavioural descriptors at each scale point. The total available score is therefore 72; the standard threshold for "competent" delivery in most accreditation contexts is 36, with higher scores indicating progressively more skilful practice.

The 12 items cover two broad domains. The first six items are generic therapy competencies — skills that any reasonable psychotherapy would expect to see, dressed in CBT-specific operationalisation. The remaining six items are CBT-specific competencies — the techniques and conceptual moves that distinguish CBT from other modalities.

The generic items are: agenda setting and adherence, feedback, collaboration, pacing and efficient use of time, interpersonal effectiveness, and eliciting appropriate emotional expression. The CBT-specific items are: eliciting key cognitions, eliciting behaviours, guided discovery, conceptual integration, application of change methods, and homework setting.

Each item is rated independently. A session can be strong on alliance and pacing whilst being weak on guided discovery and conceptual integration, and the rating must surface that pattern rather than smoothing it into a single global impression.

Why anchored descriptors are the point

It is worth saying directly what makes the CTS-R do useful work, because the answer is not the existence of twelve items. Twelve items is a structural choice; the actual mechanism that makes the instrument worth using is the behavioural anchoring of the scale points.

A rating scale that uses labels like "good", "very good", and "excellent" measures nothing reliably. Two raters using such a scale will, on average, agree only in the loosest sense, because "very good" is not a description of behaviour; it is a description of the rater's impression. Anchored descriptors solve this by specifying what behaviour the rater should observe in order to assign a given score.

To take a worked example: agenda setting (item 1) is rated 0 if no agenda is set. It is rated 2 if an agenda is set but is incomplete or not negotiated with the client — the therapist sets it, the client passively agrees. It is rated 4 if an agenda is set collaboratively, with the client contributing items and the therapist linking the session content to the agenda as the session progresses. It is rated 6 if the agenda is set collaboratively, items are prioritised explicitly, time is allocated, and the session is actively managed against the agenda throughout, with renegotiation if necessary.

The point of the anchoring is that two raters watching the same session, knowing the descriptors, can be expected to converge on similar ratings. They will not always; inter-rater reliability is a working problem that any continuous use of CTS-R has to address. But they will converge much more closely than two raters operating on global impressions, and the convergence becomes a measurable property of the rating process rather than an aspirational one.

Fig. 1 — A single CTS-R item, anchored

Why the anchoring matters "Good" and "very good" describe nothing. "Score 4 looks like this; score 6 looks like that" lets two raters disagree in identifiable, calibratable ways.

CTS-R Item 6 (Guided Discovery), shown with three anchor descriptors. The instrument's value is not the 0–6 scale; it is that each score is grounded in observable behaviour. That is what lets two raters agree on whether a session was a 4 or a 5.Illustrative; descriptors paraphrased. After Blackburn et al. (2001).

This is why the CTS-R is the spine of the continuous fidelity infrastructure argument for trial work, and why it sits at the centre of competence-focused supervision: it gives the supervision conversation something to converge on other than the supervisor's general sense of how the trainee is doing.

How the items actually score, with examples

Three of the twelve items repay walking through in detail, because they tend to be the ones trainees and qualified therapists find most opaque.

Agenda setting (item 1) is the most commonly underrated item in self-assessment because therapists tend to assume that "I set an agenda" is the relevant behaviour. The descriptor scaling makes clear that this is not the case. The 2 versus 4 versus 6 distinction is about how the agenda is set, negotiated, and used during the session. A therapist who lists agenda items at the start, gets a polite nod from the client, and proceeds to follow their own intended structure regardless is scoring 2, not 4. A therapist who actively invites client priorities, surfaces tensions between agenda items, and renegotiates timing as the session unfolds is scoring 4 or 5. Agenda setting in CBT is not an administrative formality; it is a collaborative procedure, and the rating reflects whether the procedure was actually carried out or merely gestured at.

Guided discovery (item 6) is the item that distinguishes CBT competence most starkly between confident-sounding sessions and genuinely skilful ones. The relevant behaviour is the use of questions, summaries, and synthesis to help the client arrive at a new understanding — not the use of questions to lead the client toward the understanding the therapist already had. The descriptor scaling makes the distinction explicit. A therapist who asks Socratic-style questions but whose questions are clearly steering the client to a predetermined conclusion is delivering psychoeducation in question form, which scores lower than genuine guided discovery. A therapist whose questioning opens up information the therapist did not already hold, who summarises what the client has said in a way that surfaces patterns the client can recognise, and who allows the client to draw the synthesis is delivering guided discovery in the technical sense.

This is the item where Walfish-style self-assessment bias bites hardest. Almost every CBT therapist believes they do guided discovery well. Observed ratings tend to suggest that a great deal of what is called Socratic dialogue is, on examination, Socratic-shaped delivery of conclusions the therapist had already formed. The CTS-R rater notices this because the descriptors require them to.

Conceptual integration (item 10) is the item that most distinguishes experienced CBT from trainee CBT. It rates the degree to which the therapist links the session's content to the case formulation, makes explicit the connections between cognitions, behaviours, emotions, and the maintaining cycles, and uses the formulation as a live tool during the session rather than as a document filed at the start of treatment. A trainee who has been taught formulation but does not yet integrate it into session work in real time will score around 2–3 on this item even when their generic skills are strong. A therapist whose conceptual integration scores in the 4–6 range is doing something qualitatively different — the formulation is not being consulted, it is being deployed.

The items most exposed to gaming are the items where the descriptor relies on what the therapist says they are doing rather than what an observer can directly verify. The items hardest to game — guided discovery, conceptual integration, application of change methods — are exactly the items that distinguish genuine CBT competence. This is part of why the instrument has held up.

What CTS-R is good for

The CTS-R earns its place on the desk by being useful for several distinct purposes, not all of which are obvious from how the instrument is introduced in training.

Structured supervision feedback. Liness and colleagues' 2019 single-case experimental design (PubMed: 32213046) is the most relevant evidence here. Therapists whose supervision was structured around CTS-R-referenced feedback on observed sessions improved on the CTS-R items being targeted, with the improvements specific to the supervised skills rather than diffusing into a general sense of "doing better." Supervision as case-discussion does not move CTS-R ratings consistently. Supervision as competency-referenced feedback on observed performance does. This is one of the clearer findings in the supervision-effects literature and it has direct implications for how supervision time is best spent.

Accreditation portfolios. Most CBT accreditation bodies require evidence of competence at a defined threshold. The CTS-R is the standard instrument for this evidence. A portfolio that includes externally-rated CTS-R sessions across a range of presentations is the evidence the accreditation route was designed around.

Trial fidelity monitoring. Continuous fidelity infrastructure for CBT trials, argued at more length in the trial fidelity piece, depends on having a defined competency framework against which delivery can be independently rated. The CTS-R is the standard choice for CBT trials precisely because it is the most validated and most widely-calibrated instrument available for the purpose.

Self-assessment, carefully. The standard caveat applies here — therapists who rate themselves on the CTS-R systematically inflate their ratings relative to independent raters, which is the same self-assessment problem documented across the drift literature. But self-rating against CTS-R is not useless; it surfaces the dimensions on which the therapist needs to be paying attention, even if their score is not directly comparable to an independent rater's. The instrument provides the scaffolding for noticing one's own practice.

What CTS-R is not

The CTS-R is the best instrument available for its purpose. It is not a perfect instrument, and pretending otherwise is the wrong way to defend it.

The behavioural-cognitive balance encoded in the items has been argued over. The instrument was designed in the context of CBT for depression and its translation to anxiety disorder protocols — particularly the exposure-heavy ones — is imperfect. An imaginal exposure session for PTSD does not look like a Beck-style depression session, and several of the CTS-R items have to be interpreted somewhat loosely to fit. There is a parallel literature on whether the CTS-R adequately captures third-wave CBT delivery — ACT, compassion-focused work, behaviourally heavy interventions — and the honest answer is that it captures some elements well and others less well.

Item weighting is implicit rather than explicit. All twelve items contribute equally to the total score, which assumes that a session with strong alliance but weak guided discovery is equivalent to a session with weak alliance but strong guided discovery. Whether that assumption is defensible depends on what the rating is for. For accreditation thresholds, equal weighting is probably defensible. For predicting clinical outcome from a single session, it is less clearly so.

Inter-rater reliability is not automatic. Untrained raters using the CTS-R do not agree closely; trained, calibrated raters do, and the calibration is itself an ongoing piece of work rather than a one-off event. Any use of CTS-R for high-stakes rating — accreditation, trial fidelity — has to include a rater calibration layer, not just a rating throughput layer.

These are real limitations and they should be acknowledged. They are not, individually or collectively, reasons to abandon the instrument. They are reasons to use it carefully and to keep an eye on the methodological literature as it evolves.

Why it belongs on the desk

The case for keeping the CTS-R live as a working tool, rather than retiring it after training, comes back to the structural problem the therapist drift literature documents at length.

Therapists drift. They drift not because they have forgotten their training but because the ongoing structures that would keep their behaviour aligned with the evidence base are usually absent after qualification. Self-monitoring is unreliable; the Walfish self-assessment work and the Bearman 2022 adherence-measurement work converge on the same finding from different angles. What stays the drift is external observation referenced to a competency framework.

The CTS-R is the competency framework. It is the working language in which "I delivered good CBT in that session" can mean something other than the therapist's general feeling about how the session went. Without a framework like the CTS-R, the conversation about CBT competence has no shared vocabulary; it collapses into preferences and impressions. With the CTS-R, the conversation has objects in it — agenda setting, guided discovery, conceptual integration — that two people can examine together.

This is why the instrument should be on every CBT therapist's desk, not because every session needs to be formally rated but because the items shape what the therapist pays attention to. A therapist who has internalised the descriptors of conceptual integration is paying different attention during their sessions than one who has not. A therapist who has been rated on the CTS-R recently against an external rater knows where their actual score sits relative to where they had assumed it sat — which is the corrective information the field has been struggling to deliver post-qualification for thirty years.

The alternative is the implicit clinical judgement of one's own competence that the drift and self-assessment literatures keep showing to be inadequate. The CTS-R is not a magical instrument. It is, however, the best available working language for the kind of competence the field is trying to develop and maintain. That makes it worth keeping.

Supervisia Train is built on the CTS-R.

Every drill the trainee runs is scored against the same twelve items, with the same anchored descriptors, that the supervisor's portfolio rating uses. Trainer commentary identifies the specific items where the practice was strong and where the edge is. Across a programme of drills, the platform surfaces the items that consistently score lower than the rest — which is the diagnostic information the deliberate-practice framework requires and which most post-qualification practice does not have a way of generating. The CTS-R is not a one-off training exercise; it is the working language of the practice, and Train is set up to make it that.

Start free on the Train pathway →

References

Blackburn, I.-M., James, I. A., Milne, D. L., Baker, C., Standart, S., Garland, A. & Reichelt, F. K. (2001). The Revised Cognitive Therapy Scale (CTS-R): Psychometric properties. Behavioural and Cognitive Psychotherapy, 29(4), 431–446. DOI: 10.1017/S1352465801004040.
Liness, S. et al. (2019). Clinical supervision in cognitive behavior therapy improves therapists' competence: A single-case experimental pilot study. PubMed: 32213046.
Bearman, S. K. et al. (2022). A randomized trial to identify accurate measurement methods for adherence to cognitive-behavioral therapy. PubMed: 36229116.
Walfish, S., McAlister, B., O'Donnell, P. & Lambert, M. J. (2012). An investigation of self-assessment bias in mental health providers. Psychological Reports, 110(2), 639–644. DOI: 10.2466/02.07.17.PR0.110.2.639-644.
Waller, G. & Turner, H. (2016). Therapist drift redux: Why well-meaning clinicians fail to deliver evidence-based therapy, and how to get back on track. Behaviour Research and Therapy, 77, 129–137. DOI: 10.1016/j.brat.2016.01.007. PubMed: 26752326.

Last updated: May 2026

See how Supervisia Train uses CTS-R

Start free — no card required.

See how Supervisia Train uses CTS-R →

← Back to Train