This piece is the latest in a series by Dennis Sherwood, who has been tracking the developments around Ofqual and public exams this year. You can find Dennis on Twitter @noookophile .
On Tuesday, 11 August at 10am, HEPI is hosting an ‘In Conversation’ session with Clare Marchant, Chief Executive of UCAS. Places are limited but you can sign up here.
The great grade unreliability cover-up
Do you remember how we all laughed at Donald Trump when he ordered that the US should slow down testing for Covid-19 because too many cases were being discovered? Of course! If you stop testing, you don’t find any cases, so there aren’t any! Problem solved!
With that in mind, here are two charts relating to school exam grades changed as the result of an appeal (the official term is ‘challenge’) in England from 2009 to 2019: on the left, AS and A level; on the right, GCSE.
Grades changed in England, 2009-2019 as % of total of grades awarded
The numbers of grades changed each year, expressed as a percentage of the corresponding total number of grades awarded, are shown in blue; the red line is my extrapolation of the 2009 to 2015 trend.
Very briefly, two points of detail.
- First, because the number of grades awarded has in general been falling (AS and A levels from over 2 million awarded in 2009 to less than 900,000 in 2019; GCSE from more than 5.6 million, down to just less than 5 million in 2016, and up to about 5.2 million in 2019), the percentage of grades changed provides a more meaningful year-on-year comparison than the corresponding absolute number.
- Secondly, my extrapolation is not linear, but exponential, for which there is good, but unpublished, evidence: please post a comment if this might be of interest, and I will give further details.
As the two charts show, between 2009 and 2015, the percentage of grades changed rose year-on-year, from rather more than 0.3% to about 1.2%, increasing by a factor of nearly 4 over six years. Then, in 2016, this pattern went into reverse, and ever since, the percentage of grades changed on appeal has remained below the extrapolation of the 2009 to 2015 trend, substantially so by 2019.
What happened in 2016?
That was the year the exam regulator Ofqual changed the rules for appeals.
Before 2016, a candidate who was unhappy with an awarded grade could request a re-mark – so, for example, a script originally marked, say, 59 might be re-marked 59, or perhaps 58 or 60. That’s not because the original examiner made a mistake; it’s a consequence of legitimate differences in academic judgement, as acknowledged by Ofqual in their statement that:
There is often no single, correct mark for a question.
The re-marks of 58, 59, and 60 can therefore all be valid, and if grade C is from 57 to 62 inclusive, the original grade C is confirmed by all three. But if grade C is from 55 to 59, and grade B from 60 to 64, then a re-mark of 60 results in an up-grade to B, implying that the original grade C was unreliable.
But since 2016, appellants have been required to demonstrate that an original mark was ‘unreasonable’, or that there has been a ‘marking error’, such as a failure by an examiner to comply with the mark scheme. ‘Re-marks on request’, in which a second marker might give a script a different mark, perhaps resulting in an up-grade, have explicitly been disallowed. Or, to use Ofqual’s own words:
It is not fair to allow some students to have a second bite of the cherry by giving them a higher mark on review, when the first mark was perfectly appropriate.
As a result, eyes have been diverted away from a very important ball.
The ball that GCSE, AS and A level grades are unreliable.
By restricting the grounds for appeal, fewer grades will be changed, and the problem of unreliable grades will go away! Problem solved!
Perhaps Donald Trump would approve.
Exam grades are unreliable – the evidence
As time has passed, this has unravelled. Most importantly, in November 2018, Ofqual published Marking consistency metrics – An update, presenting the findings of an extensive research project in which the reliabilities of the grades of 14 subjects were measured, as shown in the report’s Figure 12, reproduced here:
This chart shows the results of marking whole-subject cohorts twice: once by an ‘ordinary’ examiner, once by a ‘senior’ examiner. The grades corresponding to each of the two marks were then compared (with the senior examiner’s grade deemed ‘definitive’), so answering the question ‘For an entire subject cohort, what percentage of candidates are awarded the same grade by both an ‘ordinary’ examiner and a ‘senior’ examiner?’
You might expect that the answer would be ‘Pretty close to 100% for all subjects’.
Referring to the heavy lines within the darker blue boxes, on average, for Maths (all varieties), about 96% of grades are the same; Psychology, 78%; Geography, 65%; History, 56%.
These numbers are important. If the grades as determined by the mark of an ‘ordinary’ and a ‘senior’ examiner are the same, that grade is reliable. But if they are different, then the originally-awarded grade must be in doubt, for no one knows which examiner marked the script. That implies that about 4% of Maths grades are unreliable; 22% for Psychology; 35% for Geography; 44% for History. In practice, of course, no one knows which specific candidates have been awarded unreliable grades – and the only way to find out is to appeal, and request a re-mark. But since 2016, ‘re-marks on request’ have been disallowed …
Weighting the subject percentages by the corresponding subject cohorts shows that the average grade reliability across these 14 subjects is about 75%. Putting that the other way around, 25% of grades are unreliable. Or as a candidate actually awarded such a grade might say, ‘On average, about 1 grade in 4 is wrong’. More vividly, since about 6 million grades are announced each August, that’s the ‘award’ of about 1.5 million ‘wrong’ grades every year.
This has an important implication: the oft-used statement that only ‘about 1% of all awarded grades are changed’ is true, but perhaps misleading, in that the unwary might infer that the remaining 99% of unchanged grades are therefore ‘right’. This inference, however, is false, for the approximately 1% of grades changed on appeal represent but a small proportion of the true population – about 25% – of unreliable grades.
About a year after ‘Figure 12’ was published, Ofqual confirmed that exam grades are significantly unreliable in an announcement posted to their website on 11 August 2019:
… more than one grade could well be a legitimate reflection of a student’s performance …
This statement is both unquantified and unqualified, and so – presumably – applies to all grades in all subjects at all levels. And since there is only one grade on the certificate, how many other ‘legitimate reflections’ might there be? Are any higher? Or lower? So how (un)trustworthy, how (un)reliable, is the one grade that is actually awarded? These are important questions; questions that remain to be answered.
Grades are indeed unreliable, and have been for a long time, as also verified in Ofqual’s announcement of 11 August 2019:
This is not new, the issue has existed as long as qualifications have been marked and graded.
Furthermore, here is an extract from page 70 of a report published by the exam board AQA fifteen years ago in 2005:
However, to not routinely report the levels of unreliability associated with examinations leaves awarding bodies open to suspicion and criticism … because reporting low reliabilities and large margins of error attached to marks or grades would be a source of embarrassment to awarding bodies. Indeed it is unlikely that an awarding body would unilaterally begin reporting reliability estimates or that any individual awarding body would be willing to accept the burden of educating test users in the meanings of those reliability estimates.
Yes, that is a somewhat inelegant split infinitive, and, yes, that does say ‘source of embarrassment to awarding bodies’. In which case, surely it is the duty of the regulator to ‘routinely report the levels of unreliability associated with examinations’, especially since, under Section 22 of the Education Act 2011, Ofqual has a statutory obligation:
…to secure that regulated qualifications give a reliable indication of knowledge, skills and understanding…
Those who draft legislation are exquisitely careful with their choice of words – and the specific word chosen here is ‘reliable’.
As we have seen, however, grades are not reliable, nor does Ofqual ‘secure reliability’. Which is not only a surprise in its own right, but particularly so given that the lead author of AQA’s 2005 report from which that extract has been selected is currently Ofqual’s Executive Director for Strategy, Risk and Research. And let me flag that word ‘risk’ – you’ll see why shortly.
Grades have been unreliable for a long time, and Ofqual knows all about it. Isn’t about time this problem were fixed?
The key problem – marks are inherently ‘fuzzy’
But what, precisely, is the problem? Why are grades so unreliable?
There are two possible answers.
Firstly, because of ‘marking errors’.
Secondly, because marking is inherently ‘fuzzy’, as expressed most lucidly in a 2016 Ofqual blog:
In long, extended or essay-type questions it is possible for two examiners to give different but appropriate marks to the same answer. There is nothing wrong or unusual about that.
An ‘ordinary’ examiner can therefore give a mark which is legitimate, but different from that given by a ‘senior’ examiner. This is not a problem if both marks correspond to the same grade. But although Ofqual claim that ‘There is nothing wrong or unusual about that’, I think that there is something very wrong indeed if, as already discussed, those two marks are on different sides of a grade boundary, resulting in an unreliable grade. As is the case for about 1 grade in every 4. Ofqual are therefore surely right in saying ‘There is nothing… unusual about that’. 1 in 4 is hardly ‘unusual’.
These two possible causes of grade unreliability are important, for they imply two, very different, solutions.
If grades are unreliable primarily because of ‘marking errors’, then the solution is to blitz the quality of marking.
But if it’s because marking is ‘fuzzy’, the solution is to change the policy by which grades are determined from inevitably ‘fuzzy’ marks, so that ‘fuzziness’ is taken into account and does not penalise the candidate.
Once again, Ofqual’s announcement of 11 August 2019 is very helpful, for it resolves this dilemma in favour of the second explanation. The unreliability of grades primarily results from:
… the implications of there not being a single, right mark for every answer given in every subject …
Certainly, ‘marking errors’ happen, making matters worse, but it is ‘fuzziness’ that has the much greater impact on grade (un)reliability. And even if marking is of the highest possible quality, as long as human beings do the marking, ‘fuzziness’ will always be there, with some subjects being intrinsically ‘fuzzier’ than others – that explains the subject sequence in the ‘Figure 12’ chart.
Looking ahead – grades must be reliable and trustworthy
There are many aspects of the educational system that many believe could – and should – be substantially better, from the curriculum to the scrapping of GCSE, from the nature of assessment to the iniquity of ‘The Forgotten Third’. But whatever the future might hold, it is likely that exams will play some part. In which case, it is imperative that the resulting grades are reliable and can be trusted.
Amid all the anxiety concerning this year’s process for awarding exam-free grades, Ofqual surely has many matters to attend to, not least the recent recommendations of the Education Select Committee, as well as planning robustly for the 2021 exams.
But if there is to be just one good educational outcome of the current disruption, I sincerely hope that it will be to fix the problem of unreliable and untrustworthy grades, and to ensure that exams, in whatever form they might take in the future, have reliable and trustworthy assessments.
And there is an ideal opportunity on the table to do just that. Or rather not quite on the table, but nearly so, for Ofqual’s Corporate Plan 2020-21, published on 17 July, lists the key risks they have identified, and the actions they intend to take to address them.
All risk registers, however, suffer from the same problem: the biggest risk is probably ‘the risk we haven’t thought of’ – to which recent experience might well bear testimony.
The compilers of a risk register might be forgiven for not thinking of every possibility. But rather different is the failure to include a risk that is known, but – presumably deliberately – ignored.
For example, page 5 of the Plan lists Ofqual’s five statutory objectives, of which this is the first:
In brief, they are to:
1. Secure qualifications standards.
Brief indeed. And, as far as it goes, true; albeit an ‘interesting’ paraphrase which just happens to omit any reference to that all-important adjective ‘reliable’, despite that word’s prominence in the legislation.
And on page 9, under the heading ‘Address systemic risks’, the first three (of seven) items are:
1. Review the effectiveness of new security measures proposed or adopted by exam boards to secure delivery of GCSE, AS and A level exam papers.
2. Continue to require improvements to exam boards’ quality of marking, including research into our expectations for the quality of marking and assessing the potential for artificial intelligence to improve the marking process.
3. Continue our review of current practices for moderating teachers’ marking, and potential areas for improvement.
Yes, we can all nod in agreement. Ensuring exam papers aren’t ‘lost’ is sensible, and it’s good to improve the quality of marking.
But what is not mentioned, that might have been?
How about ‘ensure that grades, as awarded, are reliable and trustworthy’?
Ofqual’s Corporate Plan talks of improving marking quality, which is indeed sensible, for the more that ‘marking errors’ are reduced, the better. But that won’t solve the far greater, and more important, problem of grade unreliability – a problem that the plan does not even mention.
That’s a puzzle, for only by ensuring that grades are reliable can Ofqual deliver two of their plan’s five priority outcomes:
— Grades awarded are as fair as possible.
— Students, schools, colleges, higher education institutions and employers are confident in the grades awarded.
Right now, everyone’s attention is focused on the imminent announcement of this year’s exam-free grades, which, despite all the problems with the process, could still be fairer than those resulting from exams.
But looking to the future, we cannot revert to the unreliable past. Perpetuating ‘1 grade in 4 is wrong’ is wrong.
So why isn’t ‘change the policy by which grades are awarded to ensure they are fully reliable and trustworthy’ item no. 1 on Ofqual’s ‘to do’ list?
I can assure you as a school exams officer that appellants haven’t had to demonstrate anything at all to get their paper’s marking reviewed!
The difference is that it is now only changed if the original mark was “out of tolerance”.
Agree about the inherent fuzziness being the problem but I am not seeing any feasible way to address this. I’ve marked exact subjects and fuzzier ones (projects). The latter tend to have a “levels” or “bands” approach which you might hope would address it – but it is still possible for different markers to put things in entirely different bands
The issue is, up until 2015, years and years of students have been able to obtain a higher grade when there were a couple of points in it, and why not? If one examiner marked the script a couple of marks higher and it merits the higher grade, I don’t see any issue with that. My child was a couple of marks out of a higher grade in five subjects at GCSE in 2018. I obtained the scripts, after a review, and you could see where the ‘extra’ marks could have been awarded. We asked the school to appeal but they refused. We went through the schools appeals process but they still refused. Schools have neither the budgets to draft lengthy appeals documents, nor the staff resources to do it. After the A level debacle of this year, how important might those higher grades have been when a teacher was assessing the likely A level grade? So, it just compounds things. It is also manifestly unfair, when, if my child had been born a few years earlier I have no doubt we would have higher grades in another four subjects. What is worse in our situation is we were wrongly advised by the school that my child couldn’t appeal/review the composition element in music as only the whole cohort could do this and other students may have been marked down, so the school wouldn’t do this (they admitted in writing this advice was incorrect). By the time Edexcel confirmed this information was wrong, the time limit had passed so my child couldn’t do anything about it, where they could have done had they had the right to lodge an appeal, direct. That subject may have gone up by two grades. There are also specific issues where the outcome in this subject is a complete travesty. It should be a right of any student to lodge their own appeal; it should not be that only the Head of Centres can do this. I don’t know of any other exam system where a direct appeal by an individual is prevented. Ofqual must change this immediately.
> These two possible causes of grade unreliability are important, for they imply two, very different, solutions.
In the linked post you propose “fewer grades” as one possible solution for this year’s shambles.
But why not adopt the “fewer grades” system permanently? The fact that marks are unreliable is a strong indication that the current division is too fine-grained: we’re pretending that we can measure something with a degree of precision that simply isn’t possible in practice.
Students are taught to give answers to an appropriate degree of precision (number of significant figures). Exam boards should surely do the same. For subjects where marking is fuzzy we should reflect the fuzziness with a few broad grade bands: pass/fail would often be sufficient.
Cath – thank you; yes, “out of tolerance” is likely to be evidence of a ‘marking error’, and so comes within the post-2016 rules for a re-mark. To me, though, the BIG PROBLEM arises when “tolerance” straddles grade boundaries, as it does for about 1 in 4 scripts.
And yes, you’re right that the problem-to-solve is how best to address the real issue of fuzziness, recognising that some subjects are intrinsically more fuzzy than others.
Banding is one possibility, as Jeremy points out. If there are just two band – pass/fail – and if scripts marked close to that boundary are very carefully reviewed, then that can work.
Another possibility which I think is well-worth thinking about is to replace pass/fail or graded exams with a “passport”, describing what a student can do, rather than highlighting what he or she can’t.
This is vividly described in ASCL’s report “The Forgotten Third”, referring to the ⅓ of students who are destined to be ‘awarded’ GCSE grade 3 or lower because that is what “no grade inflation” demands: ⅓ of all students have ‘failed’ before they have even stepped into the exam room. That is terrible.
There’s a link to the ASCL report in the blog; there is also a newly-published book, by Roy Blatchford, which breaks truly new, exciting, ground – https://www.amazon.co.uk/Forgotten-Third-third-thirds-succeed/dp/1913622029.
I write this on 20 August, GCSE results day… thank you all for your comments… this blog might have some relevance in the days, weeks and months to come… I hope so…