A guest blog kindly contributed by Dennis Sherwood, who runs the Silver Bullet Machine consultancy.
Last August, 811,776 A level and 5,470,076 GCSE grades were awarded to candidates in England, Wales and Northern Ireland. What happened next was sometimes jubilation (“B in history! Fantastic! I’m in!”); sometimes despair (a 3 in GCSE maths can close many doors). Grades are important. We all know that.
But perhaps rather less well known is that of those 811,776 A level grades, about 200,000 were wrong; and of those 5,470,076 GCSE grades, more than 1.3 million were wrong. Those numbers aren’t typos – they really are 200,000 and 1.3 million.
Expressed rather differently, on average, across all subjects at both A level and GCSE, about 1 grade in every 4, as awarded every year, is wrong.
You might not believe that, so my purpose here is to present the evidence. Importantly, please don’t take my word for it. Look at the sources yourself, and discuss this with your colleagues – especially those who feel comfortable with statistics and can scrutinise the details.
The key source document, Marking consistency metrics – An update, was published by Ofqual just a few weeks ago, on 27th November 2018. It reports the results of a comprehensive study, carried out by Ofqual over the last several years, in which a (very!) large number of school exams were blind double-marked. In each mark pair, one mark was given by an examiner drawn from the general pool, and the other by a senior examiner, whose mark was designated ‘definitive’ – or, as you and I might say, ‘right’. Each of the two marks was then assigned the corresponding grade, so enabling the two grades to be compared. The study was cohort-wide across whole subjects, and so not biased towards scripts marked close to grade boundaries, as are statistics based on appeals.
On page 21, you will find this chart
It’s rather cluttered, but for each of the 14 subjects, the important feature is the heavy vertical black line within the darker blue box. This line answers the question “If the entire subject cohort’s scripts were to be fairly re-marked by a senior examiner, for what percentage of scripts would the grades corresponding to the marks given by both examiners be the same?” The answer to this question is important, for it defines the probability that an originally-awarded grade would be confirmed by a senior examiner. This is, in essence, a definition of the reliability of GCSE, AS and A level grades, and this is the first time that measures of the reliability of school exam grades have ever been published.
You might expect that the reliability of school exam grades should be at, or close to, 100% for all subjects. As the chart shows, this is (almost) the case for (all varieties of) mathematics, at 96% (expressed on the horizontal axis as a probability of 0.96), but increasingly less so for other subjects – for economics, for example, the reliability is about 74%; for history, about 56%.
But if, for economics, about 74% of originally-awarded grades correspond to those that would have been awarded had a senior examiner marked the scripts, and are therefore right, then the remaining 26% of grades must be wrong. Maths is better: 96% right, 4% wrong. But history is worse – about 56% right, 44% wrong. The report does not give an overall average measure of grade reliability across all subjects, but when the numbers shown in this chart are weighted by the corresponding subject cohort sizes, the average comes to about 75% right, 25% wrong.
That’s the evidence that, on average, across all subjects, at GCSE, AS and A level, about 1 grade in every 4 is wrong. But as I’ve already said, please don’t take my word for it. Think it through, and check the sources. Yes, this figure is based on data for only 14 subjects, not including, for example, French, Spanish and art. The 14 subjects chosen, however, represent over 60% of the total number of grades awarded, and even if all the remaining subjects were as reliable as maths (96% right, 4% wrong) – which they are surely not – the overall result is still only about 82% right, 18% wrong. So about 75%/25% across all subjects is, I think, a reasonable estimate.
And, yes, I know that ‘right’ and ‘wrong’ are emotive words, and that the concept of the ‘right’ grade is slippery. But more fundamentally, the Ofqual research states conclusively – taking economics once more as a concrete example – that, for every 100 scripts, about 74 would be awarded the same grade when marked by two different examiners, whilst about 26 would be awarded different grades. For these 26% of scripts for which different grades are awarded, rather than being distracted by a debate as to which of the two grades might, or might not, be ‘right’ or ‘wrong’, let’s focus on what’s important: the fact that the two grades are different. For in reality, a candidate’s script is marked by just one examiner (or by one team, if different questions are marked by different people), and awarded the corresponding grade. That’s the grade on the certificate; that’s the grade that determines the candidate’s future. But had that script been marked by someone else, or by a different team, there is, for economics, about a 26% chance that the grade would be different. The originally-awarded grade is therefore unreliable, in that it depends on the lottery of which particular examiner (or team) happened to mark the script. If the candidate were to appeal, and if that appeal were to result in a fair re-mark by a senior examiner, then this probability, 26%, represents the likelihood that the originally-awarded grade would be changed – in which case I consider it very understandable that the candidate would regard the originally-awarded grade as wrong. So I’ll continue to use ‘wrong’ rather than the much clumsier ‘grade that would be different had the script been marked by another examiner’.
If, on average, about 1 grade in 4 is indeed wrong, there are some important consequences: for example
- Every candidate sitting 4 A levels is likely to be awarded 1 wrong grade.
- Every candidate sitting 8 GCSEs is likely to be awarded 2 wrong grades.
Also, drawing directly on the data shown in the chart
- For every 100 candidates sitting A level maths, further maths and physics, about 81 receive a certificate on which all 3 grades are right, whilst about 19 are awarded at least 1 wrong grade.
- For every 100 candidates sitting A level English language, English literature and history, about 20 receive a certificate on which all 3 grades are right, whilst about 80 are awarded at least 1 wrong grade.
Given the cohorts that take these combinations, does that suggest a gender bias?
And here is one further consequence: towards the bottom of page 4 of the report’s summary, you will find these words
The probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications, with many at or very close to 1.0 (ie suggesting that 100% of candidates receive the definitive or adjacent grade in these qualifications).
Or adjacent grade? In plain English, I think this means “Published grades are OK one grade either way”. So perhaps university offers should be of the form “BBC, please, but CCD is fine”.
Those, then, are the facts. The appeals process does not right all these wrongs – nor should it: the appeals process is the last-chance-saloon, and grades should be right-first-time. Why aren’t they?
The ‘obvious’ answer is because of ‘marking error’. But however ‘obvious’ that might appear, that’s not the reason. Marking is of very high quality, and although marking errors do happen, they are relatively rare, and the exam boards and Ofqual have many quality control procedures to avoid them in the first place, and to correct them when discovered.
The point is not that marking is sloppy or error-prone. The point is that marking is not precise – as a 2016 Ofqual blog explicitly states, different, equally qualified, examiners can give the same script (slightly) different marks. As we all know very well, a script does not have a single, precise, mark of, say, 54; rather, any of the marks 53, 54 and 55 could well be legitimate. But if the 5/6 grade boundary is 55, then a mark of 54 is grade 5 whilst 55 is grade 6 – one mark can, and does, make all the difference. Even when marking is perfect, grading can still be unreliable.
The fundamental truth is that all marks are ‘fuzzy’, with some subjects being inherently ‘fuzzier’ than others – so explaining the sequence of subjects in the chart. But this ‘fuzziness’ is ignored when GCSE and A level grades are assigned. Most importantly, when the range of ‘fuzzy’ marks straddles a grade boundary, the grade awarded depends on the lottery of who happened to mark the script first. The resulting grade is therefore unreliable, and this is deeply unfair.
Grading, of course, was originally invented (apparently at Yale University in 1785) to solve the problem of ‘fuzzy’ marks: in the ‘old days’, university professors knew that it was impossible to distinguish sensibly between a mark of 54 and 55, so the solution was to give all scripts marked between 50 and 59 a lower second (or whatever). And the professors were also wise enough to know that a script marked 59 might actually be an upper second – so they got together, and carefully reviewed that script to ensure fairness. Grading is a good, pragmatic, idea – but only when two conditions are simultaneously fulfilled: firstly, the number of border-line cases must be relatively small; secondly, every border-line case must be fairly reviewed.
For school exams, however, neither condition holds. The number of border-line cases is huge: about one quarter of scripts have ranges of legitimate marks that straddle grade boundaries. This results in far, far too many border-line cases to be fairly reviewed.
The school exam grading system is broken, no longer fit-for-purpose, and must be changed. Grades as currently used are not the only way to recognise a candidate’s achievement: there are several other totally practical possibilities, one being no longer to award grades, but for certificates to show the candidate’s standardised UMS mark, associated – importantly – with a measure of the subject’s ‘fuzziness’. We therefore need to carry out a thoughtful and robust project to identify all the possibilities, to evaluate each wisely, and to determine the best.
A final thought. I have heard it said that “Wrong grades are wrong both ways, and so although any GCSE certificate with 8 grades might have two wrong grades, one is too high and the other too low. So it all comes out in the wash. It doesn’t matter.”
I find this distressingly cynical. Whilst examinations play a role in the assessment of young people, I think the reliability of the examination results does matter. Very much. And to me, awarding grades such that “the probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications, with many at or very close to 1.0” just isn’t acceptable.
If you agree, please post a comment, discuss this with your colleagues, and share this blog: the more people that know about this, the louder the voice for reform.