WEEKEND READING: How might Ofqual have avoided this turmoil?

This blog was kindly contributed by John Craven, Chief Executive of upReach, the award-winning social mobility charity that works in partnership with universities to support the progression of disadvantaged students into highly skilled jobs. You can find John on Twitter @upReachJCraven .

Ofqual were set a nearly impossible task: Provide students with the grades that they would most likely have achieved had they sat their exams, while maintaining overall national standards to prior years, and protecting groups from being systematically advantaged or disadvantaged. The methodology arguably failed on all three aims. The situation we’re left with is in many ways worse. But could this all have been avoided had the methodology been better designed and communicated?

Pre U Turn

The four biggest issues with their methodology that forced Ofqual to change course can be summarised as follows.

Firstly, the algorithm’s reliance on historical schools data, meant that outstanding students at schools where few people had ever got top grades were unfairly capped, and have missed out on top university places as a direct result. Evidence includes that the Attainment Gap between students eligible for free school meals and the rest widened for A* and A grades, reversing years of hard won progress by schools.
Secondly, this reliance on historical grades, together with the mechanism by which prior attainment data have been taken into account, seems to have particularly disadvantaged many students at schools that have been improving in recent years, had a particularly strong cohort in a subject, or that had weaker prior attainment data.
Thirdly, the methodology had a bias that favoured those in small subject cohorts, in that they were excluded (partially or totally) from the algorithm that had a tendency to downgrade. This favoured smaller schools, and those doing niche subjects. Overall, this favoured private schools, where A* and A grades rose 4.7%, compared to only 0.3% at sixth form colleges. One small private school saw the proportion of A*/A grades increase from 18% to 48%. As reported widely, subjects commonly taught in small cohorts had the severe grade inflation that Ofqual had been tasked to avoid, with 16.5% more A*/A in Music, 13.3% in German and 10.6% in Classical subjects – compared to 2.4% overall, and as little as 0.2% in Sociology, where my research showed that in 2019, thirty times as many entries were from Sixth Form and FE colleges than from private schools. My model estimates that this “small subject” effect alone meant that entries from independent schools were 7-8 times more likely not to be subjected to the algorithm than those at sixth form colleges where I estimate students were 20% more likely to have their centre assessment grades downgraded. The Observer reported my research on how GCSEs might be similarly impacted.
Fourthly, the design of the process meant that the algorithm would be blamed for any disappointing grades, even though teacher rankings were a key input. In her recent blog, Mary Curnock Cook commented that “when a computer model generates those results rather than the pupils themselves, it becomes the target of understandable moral outrage.“ As I explain later, this reduced the legitimacy of the results awarded, and could have been avoided. To meet their aims, Ofqual required the algorithm to be complex, but this made it hard to explain exactly why a particular student had not achieved the grade they had expected. Combining results that often seemed inaccurate or unfair with this complexity made it hard for the public to accept the results.

Post U Turn

Once the extent of the unfairness became clear, many organisations, including upReach, the Fair Access Coalition and the Fair Education Alliance, called for centre assessment grades to be awarded if they were better than calculated grades, considering it to be the least worst option given the unfairness inherent in the algorithm and where we had got to. However, there were many adverse consequences:

The enormous grade inflation, that has seen the proportion of A*/A increase by more than half, from 25.2% to 38.1%, destroys the absolute and relative value of the grades for the students that have received them. It makes the grades incomparable to those achieved in other years for all stakeholders. Rightly or wrongly, many graduate employers still look at A Level results to assess applicants, who they may now struggle to differentiate.
Inequity – As Ofqual noted, some schools were more conservative than others when submitting their centre assessment grades (CAGs), so switching to them has favoured students at schools who were the most optimistic. Indeed, this is why moderation is normally used in coursework assessment, and why standardisation was used by Ofqual.
Less-advantaged groups – Furthermore, Annex Q of the Ofqual Technical Report shows how students with higher socio-economic status saw bigger grade improvements versus 2019 compared to those with lower socio-economic status. Less advantaged students may have benefited less from the grade inflation associated with the centre assessment grades that Ofqual had called ‘over-optimistic,’ and hence could lose out in relative terms now that students are awarded the higher of this and their algorithm calculated grade. Evidence that high-attaining students from disadvantaged backgrounds get under-predicted relative to others drove Ofqual to give guidance to teachers on avoiding unconscious bias when submitting their grades and rankings. We urgently need Ofqual to release more data, showing the grade distribution that has resulted, broken down by centre type, free school meal eligibility and socio-economic background.
University admissions have been dramatically affected, with tens of thousands more students making their first choice offer, causing disruption at the most selective universities. Unless financial support is forthcoming, this potentially increases financial pressures on some universities, especially those with lower entry tariffs. Prior to the u turn, some universities seemed to have outperformed their access targets by accepting disadvantaged students who missed a grade over others. Many welcomed this, on the grounds of contextualisation, unfairness of the algorithm or ensuring targets in university Access and Participation Plans were achieved. Now they’re forced to accept all applicants that made the grade, this is reversed.
The cohort effect may put those getting grades in 2021 at a triple disadvantage. Not only are they missing out on learning due to Covid, but they will be held to a higher standard in being awarded grades than those in 2020, and depending on the number of 2021 places and how many defer places from 2020, may risk greater competition for university places.
Giving students the better of their Calculated Grade and their CAGs has resulted in even more grade inflation than had Ofqual just relied on CAGs alone. The proportion receiving A* increased from 7.7% in 2019, to 8.9% with Ofqual’s Calculated Grades, compared to 13.9% for the CAGs alone, and to 14.3% now we take the better of the two. Altogether, around 2% of grades seem to have been boosted by Ofqual’s algorithm upgrading the assessed grades submitted by teachers, and anecdotally, even the most academic schools benefited from this. The table below combines Table 9.1 and 9.6 of Ofqual’s interim report and data they released on 20th August.

A better methodology

No standardisation process is perfect, but might Ofqual have been able to create an algorithm that was fairer, as part of a better designed overall methodology that was perceived as legitimate enough to be accepted?

There were five key changes they could have made to minimise the issues associated with the original methodology, increasing fairness and minimising grade inflation.

1) Include year on year change in UCAS predicted grades in the algorithm as a better indicator of cohort strength.

To remove the unfairness associated with outstanding achievers in a low performing school having their grades capped, they needed a better way of adjusting historical grade data to reflect the strength of the 2020 cohort. The most recent and unbiased indicator of this would be the year-on-year change in predicted grades submitted to UCAS by subject. While UCAS predicted grades consistently over-predict the actual grades – with some schools explicitly requesting their teachers to do so – adding the change in predictions in 2020 into the algorithm would have resulted in outstanding students no longer being capped by historical grades. It seems clear that how Ofqual used GCSEs as a measure of prior attainment failed to do this adequately. Just like CAGs were considered by Ofqual to be ‘over-optimistic,’ predicted grades sent to UCAS are too. At a subject level, Everett & Papageorgiou (2011) found that 41% of grades were over-predicted, 51.7% were accurate, and only 6.6% were under-predicted, while more recent research suggested average overprediction by approximately 0.6 of a grade. But considering the extent to which the UCAS predictions increase or decrease compared to a prior year provide a teacher-based assessment of the relative strength of the cohort. The diagram at the end of the blog shows how it might work in practice. Mark Corver (DataHE) has also suggested UCAS predicted grades have important informatin that could have been used, for example, in appeals.

2) Let schools determine grades that individual students get based on grade allocations

Every year, thousands of students are disappointed by their exam results. What was different this year, is that rather than it being blamed on exam performance (or as Dennis Sherwood has previously blogged here, inaccurate marking), it was an algorithm, an inhumane black box, that could be blamed for any result that was below expectation. The design of the algorithm was impossible for most to comprehend, especially as Ofqual could not release details until Results Day (otherwise some schools could have worked out their results early). Despite the key role that teacher rankings played in the grades being awarded, the focus on the algorithm calculating grades reduced the legitimacy of the results. Regardless of improvements to the design of the algorithm, without grade inflation, there would always be disappointed students, and they would find it hard to accept results they deemed to come from a machine rather than from knowledge of their abilities. This could have been avoided by giving schools a provisional allocation of (maximum) grades by subject, asking them to assign students to a grade, and give rankings as before. Variations of this have been proposed before.

3) Build in an early appeals process

Offer the opportunity for schools to appeal their grade allocations at the point of submission in June or July, but setting a high bar. This could be focused on top performing students who are blocked from higher grades by the algorithm and that they were demonstrably capable of achieving. Given the use of UCAS predicted grades as an input, the onus would be on schools to demonstrate why their students were ‘better’ than a prior year cohort yet their UCAS predictions were not. Based on the level of appeals, the provisional grades proposed by teachers could then be moderated by Exam Boards or Ofqual to reflect appeals, and final grades issued on Results Day.

4) Reduce or eliminate the small group bias

Given the extra predictive power of using UCAS predicted grades, the opportunity for early appeals, and the desire to reduce the known inequality effect of the small group exception, the algorithm could have been expanded to a greater number of students, by lowering the thresholds for small groups, from a harmonic mean (a complicated statistical term defined on page 276 of Ofqual’s interim report) of 5 and 15 to (say) 3 and 6. The reduction in overall grade inflation that this could provide could allow these higher grades to be more evenly distributed by the algorithm, rather than being concentrated at smaller schools or in niche subjects and at independent schools.

5) Add extra information to the grade awarded to show candidates where the algorithm suggested uncertainty

An indicator, such as a ‘+’ or ‘! could be added to the grades awarded in any case where a student is ranked at the top, or the top 25% overall, of their school in that subject at that grade, indicating those closest to the grade boundary. This extra information would allow stakeholders such as universities to identify candidates ranked highest within a grade or who may have been disadvantaged by the algorithm. It would be a helpful flag for those in small teaching groups who are now being subjected to the algorithm, due to the greater potential inaccuracy for smaller cohorts.

The benefits of the approach would have included:

The additional information provided by the change in UCAS predicted grades would allow the small group threshold to be reduced, or even removed, eliminating the bias that favoured those in small subject cohorts over those in larger cohorts (and hence the big difference in increase in award of top grades between those in Independent school, Sixth Form colleges, and other state school divide).
Since predicted grades for UCAS applications were made before Covid, the change in these predicted grades from prior years should be an unbiased indicator of the extent to which a school had a strong cohort or had improved during the year in most cases.
Far fewer outstanding students in low performing schools would have been capped.
Improving schools would not have been capped to the same degree.
Grades determined by teachers, based on allocations of grades, with moderation, would have greater legitimacy and acceptance than those “awarded by an algorithm”.
Running an earlier Appeals process, based on schools appealing grade allocations, would have reduced issues on the day. Students unhappy with their grades would still be able to appeal, subject to the support of their school, but it would have been far fewer in number.
The control on grade inflation would have maintained the longer-term credibility of A-Levels and ensured the number of students achieving their offers at each university was in line with expectations.

What are the possible limitations?

No algorithm is perfect, and there would still have been students and schools that would have been disadvantaged by it.
UCAS may not have been willing or able to provide the data. The DfE publish very detailed databases – could this include more UCAS data in the future?
Like other variations of the model, this would have needed to be tested. It is unclear, without access to the data, the extent to which UCAS predicted grades by centre correlate with actual results each year. Perhaps others, such as DataHE, have already investigated this.
Not all of these changes could be applied to GCSEs, but a variation could have been, perhaps with a greater reliance placed on prior attainment data and use of average school-based value-add scores.

Looking forward, it is critical that the Ofqual model is thoroughly evaluated by an independent external body. In the meantime, our attention must shift towards ensuring that 2021 entrants are treated fairly.

Simplified explanation of how UCAS predicted grades could be incorporated into the algorithm:

The table below shows an example of a possible distribution of UCAS predicted grades and actual grades for a given subject in 2019, together with predicted grades in 2020. Around 60% of grades are over-predicted, in line with evidence. In this example, the school had an outstanding student in 2020, and predicted an A*, unlike the prior year. The final row shows the difference in predicted grades between 2019 and 2020. This could be incorporated into the algorithm to identify strong cohorts. This could have reduced the issue whereby high-attaining students at low performing schools had their grades ‘capped’ by the historical performance of the school/college.

5 comments

Jeremy says:
22nd August 2020 at 07:38
Ofqual’s algorithm wasn’t ultimately rejected because of a few subtleties, and it didn’t just need a few tweaks to be acceptable. The whole idea was thoroughly flawed from the beginning.
On average, pupils expect to score higher in exams than they actually do. For this reason, replacing exam grades with fiat grades must either (1) leave huge numbers dissatisfied, or (2) introduce grade inflation. There aren’t any other options.
Dennis Sherwood says:
22nd August 2020 at 12:53
May I disagree, please, with your opening words “Ofqual were set a nearly impossible task”?
In my view, not only was the task easy, but also Ofqual succeeded, triumphally, in achieving it. They therefore deserve the corresponding accolade of acclaim. Well done, Ofqual!
The task Ofqual was set is defined in these words to be found in the Executive Summary to the recently published specification of the algorithm (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/909035/6656-2_-_Executive_summary.pdf):
“On 18 March 2020, the Secretary of State announced1 that the summer 2020 exam series would be cancelled in order to help fight the spread of coronavirus (COVID- 19) and that students due to sit the exams would be awarded a grade based on an assessment of the grade they would have been most likely to achieve had exams gone ahead.”
The task is therefore about the grades “most likely to achieve had exams gone ahead”, which must refer to exams as taken in the past.
The most important relevant information about past exams is to be found in Ofqual’s announcement of 11 August 2019 (https://www.gov.uk/government/news/response-to-sunday-times-story-about-a-level-grades), in which these words are especially relevant:
“…more than one grade could well be a legitimate reflection of a student’s performance…”.
Ofqual’s task was therefore to determine grades which were not ‘spot on’, but rather compliant with that somewhat vague yardstick of “more than one”.
Which they duly achieved, for further down in the Executive Summary (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/909035/6656-2_-_Executive_summary.pdf), we read:
“Across all subjects and all centres, 96.4% of final calculated grades are the same as, or within one grade of the CAG submitted. ”
Which is in fact rather better than that rather vague “more than one”. Ofqual’s model hit the target of 96.4% grades being ‘right’ or only one grade adrift! How good is that?
So I would argue that Ofqual achieved their task with flying colours.
The problem, of course, is that they either chose themselves, or were instructed, to achieve the wrong task.
The task should not have been to replicate the 1-grade-in-4-is wrong (https://www.hepi.ac.uk/2019/01/15/1-school-exam-grade-in-4-is-wrong-does-this-matter/) “standard” as achieved in the past.
The task should have been to deliver fair grades to all students – the grades they each truly deserved.
Unfortunately, many teachers believed that the exercise was about the award of fair grades, and – in my view – were misled into that belief by the wording of many of Ofqual’s documents. Therein lies the tragedy.
And one other quick thing if I may… many thanks for the name check, but may I ask, please, for the source you used for the statement that you attribute to me about “inaccurate marking”? Thank you.
Dennis Sherwood says:
22nd August 2020 at 17:12
…oops… I think that should be “triumphantly”… sorry about that…
Huy Duong says:
22nd August 2020 at 22:38
The most absurd thing about this year’s Exceptional Arrangements is not the downgrading of 39% of CAGs. That level of downgrading would be justified if Ofqual downgraded the correct students. What’s more absurd is using the model when you know you will get 25% to 50% grades wrong (and you will even get more than that wrong because the ranking is not perfect as in your testing). But what’s most absurd is human beings wasn’t going to be able to appeal against a lazy, simplistic, crude and unreliable model. Since when was humanity subjected to the absolute rule of artificial unintelligence? Surely an organisation with the name “Department for Education” must treat students better than that?
Dennis Sherwood says:
26th August 2020 at 17:25
Hi John – yes, it’s been a very busy week, so I can understand you’ve not had time to respond to my question about the source for my alleged statement relating to “inaccurate marking”.
I searched the text of the blog you cited (https://www.hepi.ac.uk/2019/01/15/1-school-exam-grade-in-4-is-wrong-does-this-matter/) – thank you – for the words “accurate” and “inaccurate”. I couldn’t find them, nor could I in the text in any other blog, although, as I’ll describe shortly, they do appear in some comments.
In fact, I am very careful not to refer to “accurate” or “inaccurate marking” for, as I note in several blog comments, these words require that there is a “right” mark, which “accurate marking” replicates, and “inaccurate marking” doesn’t.
Except for totally unambiguous multiple choice tests, there is no “right” mark, and so to me the concepts of “accurate” and “inaccurate” marking have no meaning.
That does not refer to what I might call “lazy” marking or “sloppy”marking which is just bad quality; rather, it refers to the fact, long acknowledged by Ofqual, that “it is possible for two examiners to give different but appropriate marks to the same answer” (https://ofqual.blog.gov.uk/2016/06/03/gcse-as-and-a-level-marking-reviews-and-appeals-10-things-you-need-to-know/). I refer to this as ‘fuzzy marking’ – all the marks within the ‘fuzzy range’ are equally legitimate, equally right.
May I refer to my replies to comments by Albert Wright on https://www.hepi.ac.uk/2019/01/15/1-school-exam-grade-in-4-is-wrong-does-this-matter/; by Andrew B on https://www.hepi.ac.uk/2019/07/16/students-will-be-given-more-than-1-5-million-wrong-gcse-as-and-a-level-grades-this-summer-here-are-some-potential-solutions-which-do-you-prefer/; and also to pages 49 to 56 of https://docs.wixstatic.com/ugd/7c5491_cf799488bf8b4bc5b9887673794eac09.pdf.

5 comments

Leave a Reply Cancel reply