This blog was kindly contributed by Dennis Sherwood. Dennis has been writing for HEPI about A levels and exam marking over the last few years. You can find Dennis on Twitter @noookophile.
Suppose that Mr Sherwood is right, and that, on average, one exam grade in every four is wrong. Many GCSE candidates take exams in eight subjects, and so if what Mr Sherwood is alleging is true, then two grades are likely to be wrong. But they’ll be wrong both ways – most probably one up, one down. So it all comes out in the wash…
I didn’t make that up. It’s real. Those words were said several years ago at a meeting about the (un)reliability of school exam grades. And I sat in silence, and in dismay, as everyone (save for just a very few, also silent) nodded in agreement and the discussion went on to other matters.
I was reminded of that episode when I read this passage in the new book Noise – A Flaw in Human Judgement by Daniel Kahneman, Olivier Sibony and Cass Sunstein:
A frequent misconception about unwanted variability in judgements is that it doesn’t matter, because random errors supposedly cancel each other out. … If two felons who both should be sentenced to five years in prison receive sentences of three and seven years, justice has not, on average, been done. In noisy systems, errors do not cancel out. They add up.
As this extract demonstrates, this book addresses a topic of considerable significance: what the authors call ‘noise’, the ‘unwanted variability of judgements’.
‘Judgements’ embrace a broad spectrum of activities, including, for example, those relating to recruitment, predictions such as next week’s weather or next year’s GDP, whether an idea being tabled is a ‘good’ one or not, an employee’s performance review, the interpretation of medical diagnostics, matching fingerprints, the mark given to a student’s essay. In fact, very many aspects of every-day life. The shared feature is some form of human judgement, often that of an ‘expert’, often drawing on evidence that might be incomplete. And with the sting that, in many instances, those judgements have a significant impact on whomever is being judged.
Central to the book are the observations that different people will come to different judgements given the same evidence and that even the same person can come to a different judgement using the same evidence under different circumstances. This ubiquitous ‘variability of judgement’ is, the authors argue, ‘unwanted’, for its existence implies that the outcome is a lottery of whoever happens to be the ‘judge’, and the accident of the circumstances in which that judgement is being made, for example, whether the ‘judge’ happens to be feeling calm or stressed. The book argues – validly, in my view – that such lotteries and accidents are intrinsically bad. Where there is an objective truth, many of the outcomes will be wrong; where there is no objective truth, the outcomes are at best uncertain, at worst unfair, or to quote the authors once more:
Unwanted variability in judgements that should ideally be identical can create rampant injustice, high economic costs, and errors of many kinds.
Importantly, the authors distinguish between bias (a systematic tendency, of which those exercising judgement might be unaware) and noise (random variability), with both contributing to judgement errors. Furthermore, each should be identified, and every effort made to reduce them. The authors note, however, that bias tends to steal the limelight (all that anti-bias training, for example), whereas noise is often ignored, but is at least as important in contributing to errors.
Exploring noise in depth, the book provides extensive, often alarming, evidence of its existence, and the damage it can cause. So, for example, the book reports an American study in which fifty experienced (legal) judges were each asked to recommend sentences for criminals convicted in hypothetical cases. The sentences for the same heroin dealer varied from one year in prison to ten; for the same bank robber, imprisonment from five years to eighteen; for the same extortioner, from three years plus no fine to twenty years plus a fine of $65,000. All these sentences were allowable within the US laws of the time; all show how different, experienced, expert (real!) judges can reach different judgements from the same evidence.
Noise, then, is damaging. The more so for it is often hidden, and unknown. So, to take an example that is happening right now, across the entire country: the marking of school student scripts. Many schools are using internal ‘tests’, if not as surrogate exams, then as ‘evidence’ to support the determination of students’ GCSE and A level grades. The vast majority of scripts will be marked by a single teacher. In the language of the book, that mark might be biased, and is certainly noisy – a different teacher, perhaps from a different school, might give the script a different mark, and even the same teacher might give the same script a different mark if the marking were done when the teacher is feeling fresh and energetic rather than weary.
As well as emphasising the importance of measuring noise so that its impact can be recognised and taken into account, the book also offers many suggestions as to how noise can be identified, reduced and eliminated. At this point, the book (unfortunately in my view) lapses into management-consultant-speak: to use the book’s terms, they recommend a ‘noise audit’ to measure the noise; the use of ‘decision hygiene’ to prevent noise errors; a ‘decision observer’ to help identify bias, especially in meetings; the avoidance of ‘premature intuitions’ (which readers of Kahneman’s Thinking, Fast and Slow, will recognise as ‘System 1’); and the use of a ‘mediating assessments protocol’ that ‘breaks down a complex judgement into multiple fact-based assessments and aims to ensure that each one is evaluated independently of the others’. And (fortunately) the more straight-forward suggestions of working in well-balanced teams to avoid over-reliance on the judgements of a single individual, and – as far as possible – to use rankings (such as ‘Chris is better at [this] than Alex’), rather than absolute scores (‘Chris is grade A’, ‘Alex is grade B’).
Once the jargon has been untangled, these suggestions are constructive. But they also reveal what I consider to be a troublesome problem.
I think they are hard to implement. And as a result, the important benefits of understanding noise in any particular context, and, if not eliminating it altogether, then at least reducing its damage, may not be realised.
As an example, to measure the noise associated with marking students’ scripts requires extensive re-marking. In principle, this can be done, but at a potentially high cost and requiring considerable time. Alternatively, the noise could be eliminated by using artificial intelligence for marking, in that the same algorithm would be used for all scripts in exactly the same way and so deliver consistent, noise-free (but not necessarily bias-free) outcomes. But it will be some time before reliable algorithms can be developed, and trusted, to mark A level English Literature. So the reality is that marking is, and will for some time continue to be, noisy, with the attendant unfairness – if not ‘rampant injustice’ – of using just one of those noisy marks, totally randomly selected, to determine a student’s future.
Cost, time and technical feasibility are three important blockers of implementation. But to my mind a much more formidable blocker is the arrogance of ‘how dare you suggest that my judgement is flawed!!!’ – an arrogance not unknown amongst those in authority, those who have the power to address the noise problem, or indeed bury it. At this point, I must note that the authors are more polite than I am: they refer not to ‘arrogance’ but to ‘dignity’!
In my view, the world would indeed be a better place if the wisdom contained in this book (ah! I think my ‘confirmation bias’ is showing…) were to be acted upon. But I fear that those who might have the power to commission the actions required to measure noise, and then do something about it, are precisely those who are most likely to feel that their ‘superior judgement’ is being threatened, and therefore will not wish to. As exemplified by the school exam regulator, Ofqual, who, to their credit, have measured the noise associated with exams, but have as yet failed to take the appropriate action to limit its damage.
I contrast this with another concept, ‘nudge’, as described in the book Nudge – Improving decisions about health, wealth and happiness, published in 2008, and written by Richard Thaler and the same Cass Sunstein who is also a co-author of Noise. The central theme of Nudge is that people can be ‘nudged’ into the ‘right’ behaviours, and away from the ‘wrong’ ones, by subtle cues, such as a form that says ‘unless you explicitly request otherwise, signing this form confirms your consent to [whatever]’ rather than ‘please tick this box if you wish to consent to [whatever]’. Consenting to [whatever] has been deemed ‘good’, and by making that the default option, more people are likely to comply, if only as a result of inertia.
Nudge caught the attention of the powerful, and in 2010, the Cameron-Clegg Coalition Government set up the ‘Behavioural Insights Team’, colloquially known as the ‘Nudge Unit’, within the Cabinet Office. Many ‘nudges’ can be identified by a small number of ‘clever people’, and they can be implemented relatively easily, perhaps at quite modest cost, and often effectively too.
Implementing the suggestions of Noise, however, is, I fear, very much harder, as discussed in the book’s later chapters. But if noise causes ‘rampant injustice, high economic costs, and errors of many kinds’, then striving to reduce and eliminate it must be a ‘good thing’, even if difficult to do.
But how might those who have the authority to make this happen be persuaded to do so? How can the bosses be ‘nudged’ to read Noise?
Very interesting topic and thank you for bringing it to my attention.
My partner is a judge and I am not sure how she will respond.
Having been a Justice of the Peace (magistrate) I saw this “difference of opinion” based on the same facts quite often on fundamental issues such as “Guilt” or “Innocence” as well as on the specific sentence to be imposed in individual cases.
As the article and book imply, in circumstances where one (or several) people are involved in assessing/judging/diagnosing/making decisions about other people and things of importance that will have serious consequences, the chance of “getting it right” (even if there is overwhelming agreement on what “right” is) is a bit of a lottery.