- This blog was kindly provided by Mary Curnock Cook CBE. Mary has long experience in exams and assessments including during her time as a Director at the Qualifications & Curriculum Authority in the early 2000s. She is also a Trustee of HEPI. You can find Mary on Twitter/X @MaryCurnockCook.
- Sign up for our free webinar with UCAS Chief Executive Clare Marchant, taking place this morning at 11am.
In this blog, I want to provide some context and challenge to two erroneous statements that are made about exam grades:
- That ‘one in four exam grades is wrong’
- That grades are only reliable to ‘within one grade either way’
For teenagers, exam results days mark major milestones. For most GCSE, A level and BTEC students, the grades they get awarded this summer represent a passport which will either provide open visas to all sorts of opportunities or, in some cases, will restrict their movement up the ladder of opportunity. For admissions professionals in universities, A level results day is critical in determining the number and ‘quality’ of students admitted and the impact this has on a university’s finances and league table rankings. In short, grades are important. It is therefore vital that grades are reliable and trusted by all those who use them – students, universities and colleges, and employers.
The ‘one in four grades is wrong’ claim – a gross misunderstanding
The claim that one in four grades is wrong is derived from a 2018 Ofqual research report on Marking Consistency Metrics. This report uses two descriptors for marks awarded – ‘definitive’ marks and ‘legitimate’ marks. A ‘definitive mark’ is described by Ofqual as ‘the terminology ordinarily used in exam boards for the mark given by the senior examiners at item level’. Such marks are seeded into the mass marking of exam questions to ensure that no markers are routinely marking too leniently or too harshly. In other words, the ‘definitive mark’ is a proxy ‘correct’ mark used for quality assurance processes only.
The probability of being awarded a ‘definitive’ grade at qualification level varies from almost 100% for subjects like mathematics, to nearer 50% for the more subjective judgements used in subjects like English or History. If you were to weight these probabilities across subjects to derive an average, you might arrive at a probability of roughly one in four marks not being the ‘definitive’ mark. Critics have extrapolated this to the idea that one in four grades must therefore be wrong, which is a gross misrepresentation of the position, not least because no-one seriously advocates for an assessment regime where there are only correct or incorrect answers to all exam questions.
It is therefore important to understand the concept of a ‘legitimate’ mark which arises because our system uses a variety of assessment approaches for different subjects to ensure that each exam is a valid way of assessing what we expect the candidate to know and be able to do in a particular subject. In many subjects there will be several marks either side of the definitive mark that are equally legitimate. They reflect the reality that even the most expert and experienced examiners in a subject will not always agree on the precise number of marks that an essay or longer answer is worth. But those different marks are not ‘wrong’.
Mathematics assessment, for example, is very reliable at the ‘definitive’ level because there is more likely to be an objectively correct answer to many questions. In contrast, English is less reliable at the definitive level because of the more frequent (and desirable) use of longer, extended response questions to which there is no objectively ‘correct’ answer. As Ofqual’s paper says: “To have one benchmark against which all components from subjects are compared would be a very blunt tool for comparison,” adding that it would “neither set a standard for marking consistency that was achievable […] nor work to improve components in those subjects.” The ‘definitive’ mark was used in Ofqual’s paper because those were the only marks captured by the exam boards that were suitable for this kind of statistical analysis.
We can all accept that there is inherent imprecision in many assessments, including in Higher Education. For this reason, and using the internationally accepted benchmarks for reliability of assessments, regulators use the concept of a legitimate grade, defined as 95% or more grades being within one grade either side of the ‘definitive’ grade. Ofqual’s analysis shows that “the probability of being awarded a grade within one grade of the definitive grade is 1 or nearly 1 (ie 100% or near 100% probability) for nearly all subjects”, as shown in Figure 14 from their report below.
Ofqual’s analysis (see Fig 16 below) shows that our exam system has been able to mark reliably within this internationally accepted definition very consistently over time. We can therefore count our system as amongst the best in the world for reliable assessment.
The ‘reliable to one grade either way’ claim misunderstands the context
This internationally accepted benchmark for exam marking reliability is the context for the former Ofqual Chief Regulator’s comment at an Education Select Committee hearing after the summer exams in 2020 that grades are “reliable to one grade either way” (Q1059). Some commentators have chosen to weaponise this statement in a way that shows poor understanding of the concepts underpinning reliable and valid assessment and risks doing immense damage to students and to public confidence in our exam system.
Public confidence and trust in our exam system is vital for a functioning education system and to support students to progress to further learning or employment. This was sadly confirmed during the two years of the pandemic when exams where cancelled and replaced by teacher assessed grades. During the UPP Foundation Student Futures Commission which I chaired in 2021, I was dismayed to hear students describe themselves as the ‘cohort with the fake grades’. My own research in 2021 highlighted an unexplained difference between the teacher-assessed A level grades for boys and girls, and numerous press articles argued about perceived unfairness between different schools, different pupil characteristics, and different contexts during the years when exams were cancelled. Only a national exam system with a standardised and quality-assured marking system that meets international benchmarks can put assessment on a level playing field for all students.
Without such a national system of assessment, we would be unable to run a fair university admissions process, and visibility of attainment gaps between different groups of students would disappear with the consequent risks to social justice and social mobility.
If it were true that one in four exams grades is ‘wrong’, there would be a national outrage. Instead, the vast majority of grades are accepted as legitimate with only a tiny proportion contested. Last year out of a total of 846,885 AS and A level grades awarded in England, only 9,910 grades were changed following administrative errors, reviews of marking or reviews of moderation, according to Ofqual. This represents just over 1%.
Since schools have visibility of exam scripts and marks after grades are awarded, they have full access to their pupils’ performance in the exam and the way this was judged by markers. Schools and colleges can request a review of marking if they think there is an error in the marking, and if they are not satisfied with the outcome of that, they can appeal on a number of grounds, including academic judgement.
Take a typical sixth form of 200 A level candidates, each taking three A levels, thus 600 grades awarded. If there really were one in four or 150 ‘wrong’ grades awarded in each school like this, I have no doubt that the number of challenges would explode, regardless of the cost. Instead, the reality is that 99% of students, their teachers and their parents, with access to their exam scripts and marks as evidence, accept their grades as a legitimate measure of their performance.
Those who question their grades are often those students whose marks put them at the borderline between grade boundaries. The reason that so few are regraded on appeal is that a student’s marks are derived from often multiple components (papers) and multiple individual items (questions) within those components. Since the majority of exams are marked entirely anonymously at item or question level (as opposed to markers marking whole scripts from each student) the chances of harshness or leniency are likely to be levelled out at each stage of aggregation of marks – which is why qualification level grades are more reliable than component level or individual item marks.
Some have called for individual marks to be stated on exam certificates alongside a confidence interval, but assessment outcomes expressed at the level of marks rather than grades simply create more cliff edges and borderlines for students – a particular challenge for highly competitive university admissions. There are enough criticisms of ‘teaching to the test’ already without adding summers filled with mark-harvesting. Broad grade widths are always going to be more reliable than marks, and users accept that a grade is a legitimate categorisation of what a student knows, understands and can do in respect of a particular specification, domain, or subject.
We know this because it works. Students get their grades and are able to use them effectively to support their progression to further learning, training, or employment. Universities have ample evidence about the relative success rates for students with different grades and continue to be confident in setting their course entry requirements accordingly.
None of this is to say that exam grading reliability cannot be improved. Good assessment design is essential, as is markers’ use of the full range of marks for a given question. The rate of ‘seeding’ control items into markers’ caseload and the tolerances allowed by the exam boards also contribute to the quality of marking. Mass double marking, currently unaffordable in terms of time and financial constraints, could become standard using artificial intelligence engines to provide effective, unbiased second marking quality assurance for millions of items at a fraction of the cost.
I would also not wish my confidence in the marking of the current exams to be misconstrued as unvarnished support for our current schools’ assessment approach. I’ve been encouraged to read of Ofqual chief regulator Dr Jo Saxton’s openness to moving towards more digitally-enabled approaches which would support both innovation in assessment methodology and superior data to use in further improvements to reliability and validity of assessment.
The assertion that ‘one in four school exam grades is wrong’ is a gross misrepresentation of reality and represents a naïve understanding of assessment. No one wants an assessment regime that reduces everything assessed to ‘right’ and ‘wrong’ answers and that’s why stakeholders in the education and employment ecosystem continue to accept that legitimate grades really are legitimate and represent meaningful currency for progression to further learning or employment for the vast majority of students.