Skip to content
The UK's only independent think tank devoted to higher education.

1 school exam grade in 4 is wrong. Does this matter?

  • 15 January 2019
  • By Dennis Sherwood

A guest blog kindly contributed by Dennis Sherwood, who runs the Silver Bullet Machine consultancy.

Last August, 811,776 A level and 5,470,076 GCSE grades were awarded to candidates in England, Wales and Northern Ireland. What happened next was sometimes jubilation (“B in history! Fantastic! I’m in!”); sometimes despair (a 3 in GCSE maths can close many doors). Grades are important. We all know that.

But perhaps rather less well known is that of those 811,776 A level grades, about 200,000 were wrong; and of those 5,470,076 GCSE grades, more than 1.3 million were wrong. Those numbers aren’t typos – they really are 200,000 and 1.3 million.

Expressed rather differently, on average, across all subjects at both A level and GCSE, about 1 grade in every 4, as awarded every year, is wrong.

You might not believe that, so my purpose here is to present the evidence. Importantly, please don’t take my word for it. Look at the sources yourself, and discuss this with your colleagues – especially those who feel comfortable with statistics and can scrutinise the details.

The key source document, Marking consistency metrics – An update, was published by Ofqual just a few weeks ago, on 27th November 2018. It reports the results of a comprehensive study, carried out by Ofqual over the last several years, in which a (very!) large number of school exams were blind double-marked. In each mark pair, one mark was given by an examiner drawn from the general pool, and the other by a senior examiner, whose mark was designated ‘definitive’ – or, as you and I might say, ‘right’. Each of the two marks was then assigned the corresponding grade, so enabling the two grades to be compared. The study was cohort-wide across whole subjects, and so not biased towards scripts marked close to grade boundaries, as are statistics based on appeals.

On page 21, you will find this chart

It’s rather cluttered, but for each of the 14 subjects, the important feature is the heavy vertical black line within the darker blue box. This line answers the question “If the entire subject cohort’s scripts were to be fairly re-marked by a senior examiner, for what percentage of scripts would the grades corresponding to the marks given by both examiners be the same?” The answer to this question is important, for it defines the probability that an originally-awarded grade would be confirmed by a senior examiner. This is, in essence, a definition of the reliability of GCSE, AS and A level grades, and this is the first time that measures of the reliability of school exam grades have ever been published.

You might expect that the reliability of school exam grades should be at, or close to, 100% for all subjects. As the chart shows, this is (almost) the case for (all varieties of) mathematics, at 96% (expressed on the horizontal axis as a probability of 0.96), but increasingly less so for other subjects – for economics, for example, the reliability is about 74%; for history, about 56%.

But if, for economics, about 74% of originally-awarded grades correspond to those that would have been awarded had a senior examiner marked the scripts, and are therefore right, then the remaining 26% of grades must be wrong. Maths is better: 96% right, 4% wrong. But history is worse – about 56% right, 44% wrong. The report does not give an overall average measure of grade reliability across all subjects, but when the numbers shown in this chart are weighted by the corresponding subject cohort sizes, the average comes to about 75% right, 25% wrong.

That’s the evidence that, on average, across all subjects, at GCSE, AS and A level, about 1 grade in every 4 is wrong. But as I’ve already said, please don’t take my word for it. Think it through, and check the sources. Yes, this figure is based on data for only 14 subjects, not including, for example, French, Spanish and art. The 14 subjects chosen, however, represent over 60% of the total number of grades awarded, and even if all the remaining subjects were as reliable as maths (96% right, 4% wrong) – which they are surely not – the overall result is still only about 82% right, 18% wrong. So about 75%/25% across all subjects is, I think, a reasonable estimate.

And, yes, I know that ‘right’ and ‘wrong’ are emotive words, and that the concept of the ‘right’ grade is slippery. But more fundamentally, the Ofqual research states conclusively – taking economics once more as a concrete example – that, for every 100 scripts, about 74 would be awarded the same grade when marked by two different examiners, whilst about 26 would be awarded different grades. For these 26% of scripts for which different grades are awarded, rather than being distracted by a debate as to which of the two grades might, or might not, be ‘right’ or ‘wrong’, let’s focus on what’s important: the fact that the two grades are different. For in reality, a candidate’s script is marked by just one examiner (or by one team, if different questions are marked by different people), and awarded the corresponding grade. That’s the grade on the certificate; that’s the grade that determines the candidate’s future. But had that script been marked by someone else, or by a different team, there is, for economics, about a 26% chance that the grade would be different. The originally-awarded grade is therefore unreliable, in that it depends on the lottery of which particular examiner (or team) happened to mark the script. If the candidate were to appeal, and if that appeal were to result in a fair re-mark by a senior examiner, then this probability, 26%, represents the likelihood that the originally-awarded grade would be changed – in which case I consider it very understandable that the candidate would regard the originally-awarded grade as wrong. So I’ll continue to use ‘wrong’ rather than the much clumsier ‘grade that would be different had the script been marked by another examiner’.

If, on average, about 1 grade in 4 is indeed wrong, there are some important consequences: for example

  • Every candidate sitting 4 A levels is likely to be awarded 1 wrong grade.
  • Every candidate sitting 8 GCSEs is likely to be awarded 2 wrong grades.

Also, drawing directly on the data shown in the chart

  • For every 100 candidates sitting A level maths, further maths and physics, about 81 receive a certificate on which all 3 grades are right, whilst about 19 are awarded at least 1 wrong grade.
  • For every 100 candidates sitting A level English language, English literature and history, about 20 receive a certificate on which all 3 grades are right, whilst about 80 are awarded at least 1 wrong grade.

Given the cohorts that take these combinations, does that suggest a gender bias?

And here is one further consequence: towards the bottom of page 4 of the report’s summary, you will find these words

The probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications, with many at or very close to 1.0 (ie suggesting that 100% of candidates receive the definitive or adjacent grade in these qualifications).

Or adjacent grade? In plain English, I think this means “Published grades are OK one grade either way”. So perhaps university offers should be of the form “BBC, please, but CCD is fine”.

Those, then, are the facts. The appeals process does not right all these wrongs – nor should it: the appeals process is the last-chance-saloon, and grades should be right-first-time. Why aren’t they?

The ‘obvious’ answer is because of ‘marking error’. But however ‘obvious’ that might appear, that’s not the reason. Marking is of very high quality, and although marking errors do happen, they are relatively rare, and the exam boards and Ofqual have many quality control procedures to avoid them in the first place, and to correct them when discovered.

The point is not that marking is sloppy or error-prone. The point is that marking is not precise – as a 2016 Ofqual blog explicitly states, different, equally qualified, examiners can give the same script (slightly) different marks. As we all know very well, a script does not have a single, precise, mark of, say, 54; rather, any of the marks 53, 54 and 55 could well be legitimate. But if the 5/6 grade boundary is 55, then a mark of 54 is grade 5 whilst 55 is grade 6 – one mark can, and does, make all the difference. Even when marking is perfect, grading can still be unreliable.

The fundamental truth is that all marks are ‘fuzzy’, with some subjects being inherently ‘fuzzier’ than others – so explaining the sequence of subjects in the chart. But this ‘fuzziness’ is ignored when GCSE and A level grades are assigned. Most importantly, when the range of ‘fuzzy’ marks straddles a grade boundary, the grade awarded depends on the lottery of who happened to mark the script first. The resulting grade is therefore unreliable, and this is deeply unfair.

Grading, of course, was originally invented (apparently at Yale University in 1785) to solve the problem of ‘fuzzy’ marks: in the ‘old days’, university professors knew that it was impossible to distinguish sensibly between a mark of 54 and 55, so the solution was to give all scripts marked between 50 and 59 a lower second (or whatever). And the professors were also wise enough to know that a script marked 59 might actually be an upper second – so they got together, and carefully reviewed that script to ensure fairness. Grading is a good, pragmatic, idea – but only when two conditions are simultaneously fulfilled: firstly, the number of border-line cases must be relatively small; secondly, every border-line case must be fairly reviewed.

For school exams, however, neither condition holds. The number of border-line cases is huge: about one quarter of scripts have ranges of legitimate marks that straddle grade boundaries. This results in far, far too many border-line cases to be fairly reviewed.

The school exam grading system is broken, no longer fit-for-purpose, and must be changed. Grades as currently used are not the only way to recognise a candidate’s achievement: there are several other totally practical possibilities, one being no longer to award grades, but for certificates to show the candidate’s standardised UMS mark, associated – importantly – with a measure of the subject’s ‘fuzziness’. We therefore need to carry out a thoughtful and robust project to identify all the possibilities, to evaluate each wisely, and to determine the best.

A final thought. I have heard it said that “Wrong grades are wrong both ways, and so although any GCSE certificate with 8 grades might have two wrong grades, one is too high and the other too low. So it all comes out in the wash. It doesn’t matter.”

I find this distressingly cynical. Whilst examinations play a role in the assessment of young people, I think the reliability of the examination results does matter. Very much. And to me, awarding grades such that “the probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications, with many at or very close to 1.0” just isn’t acceptable.

If you agree, please post a comment, discuss this with your colleagues, and share this blog: the more people that know about this, the louder the voice for reform.

25 comments

  1. Alice Prochaska says:

    This analysis is disturbing. I have always believed that in this country we rely far too heavily on exam results: at university as well as in schools. A system of well monitored assessment, both continuous during the duration of a course and with some spot-checking by exam, produces fairer results. It also — and this cannot be over
    emphasised — reduces the stress associated with exams. In a climate where educators are rightly concerned about the mental health of students at all levels, how do we continue to justify such a stressful method of assessing students’ performance and potential? This article calls into question a policy so long entrenched in British education that it is tough to change. But it would be good to see some systematic calls for change.

    (I was formerly Principal of Somerville College Oxford, and previously worked at Yale University, so I have some experience of two very different systems.)

  2. albert wright says:

    A very interesting article, which needs to be shared as widely as possible.

    Whether anything can or should be done to “improve” accuracy is more debatable.

    Given that we are all different individuals it is not surprising we make different decisions based on the same data. Consider how the 12 members of a jury can differ over guilt and innocence or how interview panels of 3 or more people can be split over candidates applying for University or a job.

    Personally, I think I can live with the fact that the different grades in most cases are only one grade apart.

    I accept that on an individual basis this can have a major impact in terms of the subjects that might go on to be studied after GCSE or the course or university that may be available after A level.

    1. Dennis Sherwood says:

      Thank you, Albert, and, yes, I agree that ‘accuracy’ is very hard to improve. But might I suggest that improving ‘reliability’ is much more feasible – and very valuable too? I find it helpful to distinguish between ‘accuracy’ (which I think is fundamentally about ‘truth’ and ‘rightness’) and ‘reliability’ (to me, the likelihood that an original assessment is unchanged should a script be re-marked), largely because ‘reliability’ can easily be measured: just give, say, 100 scripts originally marked, say, 54, to 100 different markers, and see what the distribution of re-marks looks like – for example, a range of marks from 50 to 58. So rather than giving the candidate grade 5, the certificate might show 54 ± 4. The likelihood that any further re-mark will be less than 50 or greater than 58 is therefore very low (and can be statistically measured), suggesting that the assessment of 54 ± 4 is reliable. My hunch is that the ± 4 is a property of the exam (that’s the ‘fuzziness’ I referred to), rather than the individual script, but that needs to be verified by an appropriate statistical study. Pragmatically, awarding results of the form 54 ± 4 means that a candidate awarded 54 ± 4 is in essence indistinguishable from one awarded 56 ± 4. That’s a huge improvement on the current system, which would assign grade 5 to 54, and grade 6 to 56.

  3. Jane Artess says:

    Thank you for this summary of a very important issue.

    I tend to agree with you that the view “…it will all come out in the wash” is simply inadequate.

    Young (and older) peoples’ futures hang upon the achievement of specific grades; the margins for error described here in some subjects are simply too wide to be justifiable in terms of planned-for career trajectories..

    Entry to courses (and apprenticeships) may become yet more dependent on the achievement of specific grades if the current attempt to allocate places post-qualification becomes a reality.

    Well done for airing this.

  4. Neil Sheldon says:

    Quote:

    “If, on average, about 1 grade in 4 is indeed wrong, there are some important consequences: for example

    Every candidate sitting 4 A levels is likely to be awarded 1 wrong grade.
    Every candidate sitting 8 GCSEs is likely to be awarded 2 wrong grades.”

    The figures quoted for the average number of wrong grades per candidate are correct, but the variation from that average is important too.

    –Only 32% of A level candidates sitting 4 subjects will get the correct grade in all 4 subjects
    –42% will get 1 wrong grade
    –And 26% will get 2 or more wrong grades

    –Only 10% of GCSE candidates sitting 8 subjects will get the correct grade in all 8 subjects
    –27% will get 1 wrong grade
    –31% will get 2 wrong grades
    –And 32% will get 3 or more wrong grades

  5. Bernard Simon Minsky says:

    Dennis

    You make such excellent points. When the volatility of marking and the impact of grade boundaries are considered (given the newly published data we can see how great the volatility is) the risk of mis-grading and the devil-may-care attitude of the regulator (how difficult if not impossible it is to get a remark) the situation is a disgrace.

    I would suggest that the current approach is both mean-spirited and unfair to the candidates. I hope by stimulating the debate a fairer system that would give students confidence in the system can be developed.

  6. Kathryn says:

    The problem is that subjects with low reliability tend to be subjects which are subjective so individuals will have to interpret an answer. As an A level examiner myself I know there is a rigorous system to avoid different interpretations however there is always going to be some points which seem more valuable by some examiners than others. This does not make one interpretation wrong and one correct. Every student answer to an essay will be different and every examiner will have a different view of its value that is because we are all individuals. The only way you are going to avoid this is to only examine subjects with right and wrong answers like Maths. Our current GCSE and A level system is testing ability almost purely by examination even subjects which allow for individual expression of views and interpretations. Even a very thorough markscheme cannot cover every possible interpretation so there has got to be some subjectivity applied by examiners. Imagine asking 2 reviewers to give a film or restaurant a score out of 100 – how often do you think you would get the same score? And if they were different would you call one of the wrong?

  7. Yes, you are right: as you say, ‘every examiner will have a different view’. The BIG PROBLEM is that the current grading policy doesn’t recognise this. This problem has two types of solution. One type is to constrain the exam and ‘only examine subjects with right and wrong answers’, as might be achieved by replacing all essay-based exams by multiple choice. The other type of solution is to accept that marks are fuzzy, and to embed the existence of fuzziness within the policy by which the assessment is determined from the original, necessarily fuzzy, mark. There are at least six, pragmatic, ways of doing this – I hope to outline some in a future blog.

  8. Donna Chadwick says:

    My son recently completed his GCSE exams and was 1 grade short to get the 4 he needed to enter a LVL 3 college course.

    It’s dumbfounding to me that, if this article is correct – and I have no doubt it is – that he could have asked for it to be remarked and possibly got the 4 he needed to get on the LVL 3 course he originally planned to go on.

    Unfortunately, it’s too late now though as he’s had to resit his exams in college and his anxiety has prevented him from progressing due to the pressure put on him by “support tutors”.

  9. Dear Donna

    Thank you for this post; I understand. Which subjects were involved?

    I don’t want to make matters worse, but it is unlikely that you would have been allowed a remark. In 2016, Ofqual changed the rules on appeals (see, for example, https://www.gov.uk/government/news/fairness-at-the-heart-of-proposed-changes-to-marking-reviews-and-appeals-system) so that a remark is allowed only if you can show that a ‘marking error’ has happened.

    My thoughts on this injustice are here: https://www.silverbulletmachine.com/single-post/2018/10/28/Biting-the-poisoned-cherry—why-the-process-for-school-exams-is-so-unfair. And you will find some other materials elsewhere on my website http://www.silverbulletmachine.com.

    Let me assure you, though, that you and your son are not alone. But how can our collective voice be heard?

  10. Donna Chadwick says:

    Hi Dennis,

    Yes the subject was English which he got 2x grade 3 for but yet all his other results were 5 and over.

    My son was gutted but life will always through a spanner in the works won’t it?

    I’ll take a look at the links you’ve sent.

  11. Donna Chadwick says:

    Hi Dennis,

    I’ve just read the other blog and about how to get our collective voices heard? Surely it’s down to those that have requested a re-mark?

    If the grade comes back altered then that’s grounds to make this known surely?

  12. Hi again Donna

    Thank you once more. And as you can see from the chart, English exams are among the most unreliable.

    One of the important issues with all this is that most of the candidates who are awarded a wrong grade don’t know that this has happened – how would they? So they don’t complain or appeal, and remain stuck with their original, wrong, grade.

    Also, very few people know about the problem, so the more people that read this blog – and the two accompanying blogs

    https://www.hepi.ac.uk/2019/02/25/1-school-exam-grade-in-4-is-wrong-thats-the-good-news/

    https://www.hepi.ac.uk/2019/03/04/yes-the-grade-reliability-problem-can-be-solved/

    – the better!

  13. Donna Chadwick says:

    Hi Dennis,

    Yes I agree. Knowledge is power and the faster the word gets out the better. I will surely be sharing this blog and encouraging others to share it too.

  14. Donna Chadwick says:

    You’re very welcome.

  15. Tom Harvey says:

    I think your article slightly misrepresents the parameters of the study.

    It was not a blind remarking of whole scripts, but data collated from the marking of “seed” items within exam scripts. My experience of marking is that it’s not uncommon for these seed items to cause disagreement: indeed, that’s part of their purpose, as it allows ‘drift’ of markers to be checked.

    Also, exams are no longer marked as whole scripts – all exam boards split question papers up into items, with items deliberately being assigned to different markers in order to reduce variability that was caused when whole scripts were marked by examiners that were either “harsh” or “generous”.

    My experience – having been examining since 1997 – is that mark changes during re-marking are small; and smaller now on most questions (except those using levels descriptors) than for previous GCSE / A level iterations.

  16. Hi Tom

    Thank you. Yes, in my blog, I certainly paraphrased Ofqual’s methodology, but I don’t think I have misrepresented the study’s key results – the average reliabilities of the grades awarded for each of 14 subjects. And it’s the final grades, as awarded, that count, for those grades determine the candidates’ futures.

    The chart shows the probability that a candidate sitting a given subject will be awarded the ‘definitive’ grade – where Ofqual use the word ‘definitive’ as a synonym for “as corresponds to marking by a senior examiner” (see, for example, page 4 of their 2018 report Marking Consistency Metrics – An update, https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/759207/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf)

    Furthermore, on page 20 of Ofqual’s November 2016 report, Marking Consistency Metrics (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/681625/Marking_consistency_metrics_-_November_2016.pdf), we see the sub-heading “Probability of definitive (‘true’) grade”. This implies that Ofqual are taking the ‘definitive’ grade to be the ‘true’ grade, or (my word) the ‘right’ grade.

    My reading of this chart is that, on average across GCSE, AS and A level, for every 100 candidates sitting Physics, about 85 are awarded the ‘definitive’ = ‘true’ = ‘right’ grade, from which I infer that 15 are awarded the ‘non-definitive’ = ‘false’ = ‘wrong’ grade; for History, about 56 right, 44 wrong.

    Weighting each subject by the corresponding cohort suggests that the overall reliability, across all levels, for the 14 subjects shown, is about 75% right, 25% wrong.

    That’s the evidence. The BIG QUESTION is “does it matter that, on average, abut 1 grade in 4 is wrong?”.

    What do you think?

  17. Hi again – a correction, if I may: it’s Biology that is about 85 right, 15 wrong; Physics is about 88/12. Sorry about that!

  18. S. Smith says:

    Hi, read your comments with great interest. Known for a long time some exam papers not marked accurately. Number of times challenged the exam board but they are not interested. Going through another appeal and will see what happens. Ofqual data suggests that there is 100% chance that grades do not change at A Level, so why do they have appeals? Not fair on the students.

  19. Ben Chua says:

    “Wrong grades are wrong both ways, and so although any GCSE certificate with 8 grades might have two wrong grades, one is too high and the other too low. So it all comes out in the wash. It doesn’t matter.”

    Hi, I wanted to say that this comment is problematic because it is statistically speaking the equivalent of utter bullshit. It fails to take into account variability in the distribution of occurrences across students. A student might be lucky, securing one or more (perhaps even all) higher “wrong” grades. Another might be unlucky enough to secure all lower “wrong” grades. Yet another might receive all “accurate” grades. Consider as well the poor sucker who takes History, English Language and Literature, and Economics – the probability that there will be no errors in awarded grades is unacceptably low.

  20. D says:

    Hi Ben

    Thank you for your comment! Yes…and I could have thought of some other ‘equivalents’ too!!!

    As you will see from the blog, I introduced that statement with the phrase ” I have heard it said that…”.

    That’s code for “I was at a meeting where someone said that…”.

    The meeting was held under ‘Chatham House’ rules, which I am obliged to honour, so I am unable to disclose whose words they are…. suffice it to say that meetings held under ‘Chatham House’ rules tend to attract interesting people…

  21. Matt says:

    A mostly very good article but your throwaway line: “Given the cohorts that take these combinations, does that suggest a gender bias?” rather takes away from the otherwise well reasoned argument. When talking about statistics its best not to make links based on nothing but the impression of a correlation. The far more likely explanation for differences in marking reliability between subjects is the inherent nature of the subjects dictating how easy it is to assign a representative mark. At the simplest level a Maths question (at A level or GCSE) will have a correct answer, easily identifiable and the paper is made up of a series of such questions. An english paper will have series of long answer questions each very hard to assign an exact mark to, even very good answers are likely justifiable to knock down a couple of marks. What would the gender bias explanation be? Inherent bias against girls means examiners give less certain marks? On the issue of marks though I would say your last point is important, about where the idea of “Grades” came from. Discrete grades should only be used where the people assigning the grade have sight of a large sample of the cohorts answers and are then able to discuss those close to grade boundaries, as happens at some universities. When what we have is a continuous marks system I think it makes far more sense to award the UMS score (a mark out of 100), with the uncertainty of that score published. This is already what many universities actually end up looking at anyway, when making decisions on candidates who have just missed their grades.

  22. Hi Mark – thank you; my apologies – I didn’t explain my ‘gender bias throwaway line’ clearly.

    Firstly, you are absolutely right in your assertion that different subjects are associated with different intrinsic degrees of grade reliability, as fundamentally related to the underlying material. That explains the sequence of subjects in the chart, with Maths having the most reliable grades, English and History the least. And within each subject, I do not suggest any gender bias whatsoever.

    The next question I wished to pose was “What is the probability that a student taking three subjects will be awarded three correct grades?”. That depends on the subject combination, with Maths, Further Maths and Physics having a higher probability of three correct grades than English Language, English Literature and History.

    It’s at that point I leap into gender stereotyping – if boys are more likely than girls to take Maths, Further Maths and Physics; and girls more than boys to take English Language, English Literature and History, then those boys are more likely to receive three correct grades than those girls. I wasn’t too sure of my ground, though – which is why, instead of making a bald assertion, I ask the question “does that suggest a gender bias?”.

    There is some data on A level choices: for example, this HEPI blog (https://www.hepi.ac.uk/2018/08/18/perspectives-2018-level-results/) shows that in 2018, some 59,000 boys took Maths, compared to 38,000 girls; about 53,000 girls took A level English, about 19,000 boys. And there’s a report for 2017 from Cambridge Assessment too (https://www.cambridgeassessment.org.uk/Images/518880-uptake-of-gce-a-level-subjects-2017.pdf).

    Once again, apologies for not being clear in the first instance – I do hope this explanation is clearer!

Leave a Reply

Your email address will not be published. Required fields are marked *