Rob Cuthbert is Emeritus Professor of Higher Education Management at the University of the West of England and Managing Partner of the Practical Academics consultancy. He is the author of the 2020 HEPI blog, ‘A-Levels 2020: What students and parents need to know’. You can find Rob on Twitter @RobCuthbert.
We expect A-Level and GCSE exam grades to be reliable and fair. They aren’t.
In his new book, Missing the Mark: Why So Many School Exam Grades are Wrong – and How to Get Results We Can Trust (Canbury Press), Dennis Sherwood tells us why, how this happened and what we should do about it. His book is a case study of incomprehension, incompetence, arrogance and hubris in an examinations policy that gets it wrong and damages the life chances of hundreds of thousands of young people every year. And it will be the same this year, next year and every year until the policy changes.
Understanding grades is not as easy as ABC, because those ‘actual’ grades could be anywhere from BCD to A*AB. Dennis Sherwood was tackling this issue long before the 2020 exams fiasco. In a January 2019 blog for HEPI, Sherwood wrote that one grade in four was wrong, ‘a fact that has been ignored by the relevant authorities … hence this book’. Nothing has changed. The 2020 debacle brought the issue of unreliability of grades into very sharp focus, and Sherwood’s campaign almost single-handedly forced Ofqual’s Chief Regulator Glenys Stacey to admit to the Education Select Committee in 2020 that the exam system is reliable only ‘to one grade either way’.
Missing the Mark begins with some contextual history. In 2005 an AQA report said that to protect their integrity, marking and grading systems ‘should routinely report the levels of unreliability’. In September 2020, the author of the report, Michelle Meadows, by then a senior Ofqual officer, told the Education Select Committee that she took ‘some solace’ from in the fact that 98% of A-level grades and 96% of GCSE grades are accurate plus or minus one grade, which Sherwood rightly scorns as ‘not reliable enough’ (p.20).
The book explains how the examination and grading system works (in England, but with many similarities in other parts of the UK). Marks and grades are too often confused but are not the same – marks are what markers give to scripts while marks are converted into grades by Ofqual. A slow but steady increase in grades before 2010 reflected a criterion referencing approach but thereafter Ofqual clamped down on supposed ‘grade inflation’ by shifting to norm referencing based on the 2012 cohort. However, the proportion of challenges and appeals grew exponentially, and Ofqual in 2016 decided to change the appeal system, removing the right to request a re-mark and confining appeals to cases of procedural error. In 2019 this approach reduced 280,000 ‘challenges’ to just 746 ‘appeals’. Nevertheless, the possibility of ‘appeal’ against a grade continued to feature strongly in official rhetoric, and questions to Ofqual about the reliability of grading have too often been diverted into comments about the reliability of marking, or even the paucity of successful appeals.
Marking is not the problem, but the book shows how Ofqual repeatedly tries to deflect criticism or blame by implying that it is. Reliability is the problem, and Sherwood has used Ofqual’s own research to show – conclusively, to my mind – that ‘Overall, school exam grades have a reliability of 75%’ (p.76, his emphasis). The problem is the ‘fuzziness’ of subjects and Ofqual grading policy. Fuzziness, ‘the book’s central concept’ (p.65), refers to the fact that ‘different, equally-qualified examiners can legitimately give the same script different marks’ (p.65), as Ofqual now admits. Discursive subjects like English and History are fuzzier than Mathematics. Sherwood draws on Ofqual’s updated research to show – once again, I believe, conclusively – that fuzziness inheres in subjects not their marking, and he is able to quantify fuzziness for each subject. In a very fuzzy subject like History, what Ofqual sometimes calls the ‘definitive’ mark (a questionable concept) for a script might be ‘legitimately’ marked in a wide range, +/-5 or more around the definitive mark. When that +/- range of 11 marks straddles a grade boundary it makes the assigned grade unreliable, and Sherwood’s analysis shows convincingly that because of this the overall reliability of grades is only 75 per cent. Mathematics, the most reliable (least fuzzy) subject is taken by very large numbers, which helps raise the overall average, but for some subjects like English, reliability is near to 50 per cent. And these are conservative estimates: overall reliability may be only about 67 per cent, according to Sherwood’s analysis. That would mean one in three grades are wrong. If you take three A-Levels, the chances that all three are reliable would on average be only about 30 per cent (eight in 27).
All of this was before the catastrophe of 2020, which has been exhaustively analysed elsewhere. Here, we have the definitive account of ‘The Great Centre Assessment Grades Car Crash’, as Sherwood calls it, which could be viewed in slow motion because it was evident to many observers long before the summer that it could only end in disaster, as many submissions to the Education Select Committee inquiry had shown. In July 2020, Dennis Sherwood and I wrote in a HEPI blog that ‘Ofqual has more to do’, and things went from bad, to worse, to disastrous in a few days in August. Responses to my HEPI blog on A-Level results day made it clear that Ofqual’s policy was still not well understood, even by schoolteachers who had spent long hours agonising over grades and rank ordering every student. Centre Assessed Grades were trumpeted by politicians while being surreptitiously intended to be irrelevant, but circumstances eventually put them centre stage. As Sherwood notes, ultimately, ‘those schools who “obeyed the rules” … stood by and watched those schools who didn’t obey the rules – or deliberately flouted them – benefit’ (p.279). Thus, the outcome was unfair to many students and many schools in 2020 despite a surge in grade inflation, and the repercussions would also seriously affect students and universities in the years to come.
The book goes on to consider the 2021 experience. With COVID persisting, the Government decided early on that teacher-assessed grades (TAGs) would take the place of examinations, in a transparent attempt to ensure the blame for whatever happened would not fall this time on Government and Ofqual. ‘So the scene was set for yet more grade inflation’ (p.306), which, indeed there was, but Sherwood assesses the data and concludes that teachers behaved with integrity, not even going as far as replicating UCAS predicted grades in their TAGs.
What next? ‘There is every likelihood that in summer 2022, about 1.5 million wrong grades will be awarded yet again’ (p.322). Ofqual has already declared its intention to achieve a return to the grade distribution of 2019, with some of the reductions in 2022 and the rest in 2023. Will we, can we, must we return to the appalling status quo in which one in four grades are wrong and there is no meaningful prospect of appeal?
Sherwood has an alternative. In fact, he offers 14 alternatives. He predicts the eventual use of artificial intelligence on a much larger scale, but in the meantime notes that ‘double marking isn’t a solution to the reliability problem at all’ (p.334), whereas grades perhaps are – but that means looking closely at scripts that are close to grade boundaries, as every university examiner knows. He describes some fundamentally different approaches which involve explicit acceptance of fuzziness and incorporating subject-specific fuzziness in each grade, in a way which would both be fair and achieve better than 99 per cent reliability. All it needs is a change of grading policy: schools, exams and students could go on just as they do now. Proper appeals, involving re-marking, could also be reinstated without fear of overload. Sherwood’s own preferred solution would be to abandon grades and publish marks +/- fuzziness for each subject – fine in principle, but in practice I fear it would simply lead too many users – journalists, universities, employers, parents, even students – reinventing grades, oversimplifying the complexity of the outcomes. I prefer Sherwood’s Solution #11, which involves taking the mark and adding the subject fuzziness to determine the grade.
Sherwood wants to stimulate debate to lead to a better solution, but the bureaucratic inertia of the system of government, regulator and exam boards may yet stifle the possibilities. The villains of the piece, not just in the 2020-2021 COVID years, are named and shamed: ‘the DfE, Ofsted and especially Ofqual are collectively culpable’ (p.119). For the most part, the author is concerned to let the evidence speak for itself, but his views are clear, and in places expressed in forthright terms: ‘Ofqual’s … mindset … [was] arrogance, detachment, unrealistic, not looking at the data, not thinking things through’ (pp.185-186). Repeatedly, Ofqual has spoken in terms apparently designed to mislead or to deflect criticism – ‘deliberate obfuscation’ (p.116). When that fails, it has resorted to ad hominem attacks, as in Camilla Turner’s Telegraph report of 25 August 2018, where an Ofqual spokesman was quoted as saying: ‘Mr Sherwood’s research is “entirely without merit” and has drawn “incorrect conclusions”’. Within weeks, Ofqual was issuing its 2018 update report on metrics which in effect admitted that, on the contrary, his research was completely sound.
2020 saw numerous casualties. Ofqual Chief Regulator, Sally Collier, was the first to go, on 25 August, followed the next day by Department for Education’s Permanent Secretary, Jonathan Slater, no less. These two high-profile victims were clearly in the line of fire, but for Slater, at least, the suspicion remains that part of his crime was simply to disagree too often and too fundamentally with Gavin Williamson and his advisers. Nevertheless, they had comparatively soft landings, and that was even more true of the later departures. Ofqual Chair, Roger Taylor, was and continues to be Chair of the Centre for Data Ethics and Innovation and hold other government posts, and really should have known much better than to back a flawed algorithm for so long. He stepped down from Ofqual in December 2020 – free from public criticism perhaps because he knew too much about what the Department for Education had known, and for how long, in the first half of 2020. Michelle Meadows went in September 2021 to an academic post at Oxford University. The long-serving Schools Minister Nick Gibb and Secretary of State Gavin Williamson went in the Government’s September 2021 reshuffle – Williamson, of course, later knighted for his ‘services’.
But 2020 was simply the high-water mark of a flawed system. Ofqual emerges from the book’s devastating account as an organisation so obsessed with preventing grade inflation that it forgot fairness to students, and so lacking in trust for schools and teachers that it persisted with its ridiculous algorithm, despite its manifest unfitness for purpose. Sherwood speculates that one reason might be that the algorithm, which was designed to replicate earlier grade distributions, had only been tested against the prevailing (low) 75 per cent reliability threshold. But the failure was built into the design, which for the first time set student against student, in each school subject cohort, for the limited number of grades available. In 2020, as panic set in, the Government tried to blame Ofqual, Ofqual tried to blame teachers and schools, and the Prime Minister tried to blame a ‘mutant algorithm’ (which never mutated) but there was no excusing the systemic failure and no excuse for the true miscreants.
Data guru Dennis Sherwood is a former partner in Deloittes and Coopers and Lybrand and was an executive director at Goldman Sachs. As a consultant he was once an insider who sat around the table in crucial policy discussions. After privately advising that the emperor had no clothes, but being ignored, he was driven to declare it in public – his reward has been to be vilified, but never refuted.
Nick Hillman said, ‘Everyone in UK education should reflect upon the problems identified in this powerful book – and then decide what to do about them’. Ofqual’s Chief Regulator is now Jo Saxton, a former adviser to Gavin Williamson and appointed by him; the book shows how she, too, continues to confuse marks and grades. Ofqual flouted the Education Select Committee’s firm request to publish its algorithm sooner than results day, despite its promises to the inquiry. The Committee should now look harder at whether Ofqual is failing in its statutory duty to ensure reliability in grading, and to ensure public trust in the system. If Ofqual remains resistant to change, then universities need to act instead, by changing how they select students. For many universities, offers linked to grades have been a convenient way to control access, which seemed fair. Now we know it isn’t, admissions officers need to rethink their approach. It may be as simple as making an offer based on the ‘actual’ grade, plus or minus one. Higher education institutions are already learning for 2022 admissions how to be more careful in their offers, as UCAS evidence shows. Doing nothing is no longer an option. As Melvyn Roffe, Chair of the Headmasters’ and Headmistresses’ Conference, says: ‘much of what we think we know about school exams is based at best on wishful thinking and at worst on wilful misrepresentation’.
At present, the school examinations system has a fail grade. As an examiner might say, it ‘must do better’. This book shows the way.