An A* in Reputation Management? Looking back at last summer's results row – and ahead to this summer's coming one

This blog on last year’s results fiasco has been kindly written for HEPI by Dennis Sherwood, whose past articles for HEPI can all be found here.

***DON’T FORGET TO BOOK YOUR PLACE AT THIS YEAR’S HEPI ANNUAL CONFERENCE, TAKING PLACE THIS THURSDAY, WHICH IS KINDLY SPONSORED BY LLOYDS BANK AND UPP AND WHICH WILL INCLUDE THE LAUNCH OF THE HEPI / ADVANCE HE 2021 *STUDENT ACADEMIC EXPERIENCE SURVEY****

History, they say, is written by the winners. But occasionally, a loser gets a look in too, usually along the lines of ‘But it wasn’t my fault!’.

Cue the report Is the algorithm working for us? Algorithms, qualifications and fairness, published on 14 June 2021, and written by Roger Taylor, who was Chair of Ofqual, the regulator of school exams in England, before, during, and for a shortish while after, the summer 2020 exam grade fiasco.

This report, without doubt, deserves a ‘teacher assessed grade’ of A* for this summer’s A level in Reputation Management, as awarded in accordance with the grade descriptors defined by the Joint Council for Qualifications for A level History (Ancient) – in particular the requirements to ‘demonstrate relevant and accurate knowledge and understanding’ and to ‘reach reasoned … evidence-based conclusions about historical events’.

So let’s examine some evidence relevant to the report’s Appendix, entitled ‘Explaining the Ofqual decision-making process’. This process, without doubt, is important, and Mr Taylor writes that:

Ofqual put forward two possible ways forward that were consistent with its primary objective: hold exams in a socially-distanced environment or, alternatively, use ‘non-qualification’ leaving certificates to issue grades, while making clear they were not equivalent to A-level grades.
…
The view of the government was that neither approach recommended by Ofqual would command public confidence.

My understanding of these statements is that the Government rejected the two proposals recommended by Ofqual, and that the fatal decision to use the ‘mutant algorithm’ was the Government’s alone. This paints Ofqual, and by implication its Chair, in a benevolent light.

Is this an ‘accurate understanding’ of ‘historical events’?

To answer that, we need to find an original source. Let me suggest looking, in the first instance, at the transcript of the hearing of the Parliamentary Education Select Committee hearing of 2 September 2020, at which Roger Taylor was a key witness.

In response to Question 984, one of Mr Taylor’s colleagues stated that Ofqual had offered the Department for Education its advice on the options as to how summer 2020’s grades might be awarded. This led the Committee to request sight of what those options were, resulting in the disclosure on 9 September 2020 of an ‘official sensitive’ Ofqual document, dated 16 March (that’s two days before Boris Johnson’s announcement that the summer 2020 school exams would not take place), entitled ‘Contingency planning for Covid-19 – options and risks’.

Study of this primary source shows that Ofqual’s preferred option, Option A, was ‘to continue with business as usual, with the exam timetable operating as published but with additional papers prepared as a contingency in a small number of subjects’, so validating the first part of the statement just quoted.

The second part, however, is more troublesome, for the paper discusses a total of 11 options, of which three were short-listed, and the remaining eight ‘not presented in the main paper because they were less likely to meet our objectives’. Or, in simpler terms, dropped. And at the very end of the reject list is Option K, ‘Issue a standardised leaving certificate’, two of the ‘arguments against’ being:

Schools are also likely to expect a refund of exam fees.
This would call into question the future of GCSEs.

Most prescient. And yes, the option of a leaving certificate had been considered by Ofqual. But discarded. So it is ‘interesting’ that this thrown-in-the-bin possibility is highlighted in Mr Taylor’s report as being ‘recommended by Ofqual’, but rejected by Government.

There was another option that Mr Taylor’s report does not mention at all:

Option C: Issue grades based on teacher estimates which have been statistically moderated at a centre / cohort level to bring them into line with previous years’ results.

That seems to me to be quite a good, if brief, description of what actually happened, and this option was on Ofqual’s short-list of three. Yet Mr Taylor implies that the ‘mutant algorithm’ was an ex cathedra imposition from on high.

It is of course possible that Ofqual had been heavily influenced in the days prior to 16 March, and that Option C originated elsewhere. I don’t know. But even if that’s what did happen, Ofqual’s document of 16 March suggests that the use of an algorithm had Ofqual’s approval and endorsement, even if it was not their first choice of ‘exams as usual’ – which, after all, is not a surprising first choice for an organisation existing solely in the context of exams.

Before any algorithm can be used for any purpose, it is of course essential that there is proof that it delivers trustworthy outcomes. Mr Taylor discusses this in a section entitled ‘The problem with bias and accuracy’, in which he states that:

The problem of accuracy in this much larger number of results was known from the outset. Ofqual raised the problem publicly in its consultation documents in the spring and at its summer symposium in June. It explained why lowering grades through moderation would leave many candidates with lower grades than they would have got in an exam, while others would get higher grades. Unfortunately, there was no way of knowing who they were and so there was nothing that could be done about it.

To me, this reads as if Ofqual not only performed a public service in flagging an important problem, but did so well before the announcement of the A level results on 13 August.

Ofqual’s consultation was indeed published in the spring, on 15 April, but the issue of accuracy is discussed much more in relation to teacher marking and assessment rather than the algorithm. And yes, some limited information about the operation of the algorithm was presented at Ofqual’s summer symposium. But this took place not in June but on 21 July 2020, just three weeks before the A level results came out.

I mention that because one of the presentation slides was featured in a HEPI blog dated 26 July. This attracted 86 comments, including several from Huy Duong, whose subsequent analysis of that slide’s data led him to estimate that nearly 40% of A level grades would be down-graded. This prediction was reported in the Guardian on 7 August, and turned out to be correct. If Huy Duong was able to do that on his own initiative from the fragmentary information available to him on 21 July, then surely Ofqual could have done much more, much sooner.

Furthermore, it wasn’t until 13 August that Ofqual revealed that throughout the summer they had been testing the algorithm against historic results they knew to be only 75% reliable. No wonder there ‘was a problem with accuracy’. I am therefore singularly unconvinced by Mr Taylor’s statement that ‘there was nothing that could be done about it’.

I must also take (great) issue with Mr Taylor’s claim that there was:

… broad consensus in advance that it was the right thing to do … The consensus crossed party lines: Labour in Wales, SNP in Scotland, Conservatives in England and the Northern Irish administration all supported the approach. Teachers’ leaders, universities, schools and colleges also supported the approach. Even students, in advance of the results, could understand why it seemed the sensible thing to do. When a misjudgment happens on this scale it warrants reflection. How could quite so many people be so wide of the mark?

So many people were ‘so wide of the mark’ for a very simple reason. There was no ‘mark’.

That is because no one knew what the algorithm was going to do, or how it was going to do it. Myself included. In March, when I read that exams were cancelled and that teachers were being asked to submit grades based on their expert judgement, I assumed there would be a process to check a school’s submission against its history, for which an ‘algorithm’ would, of necessity, be used. This would identify outliers, prompting the exam board to engage in a dialogue with the school accordingly. But by May, I came to realise that my assumption was wrong, for there were hints that Ofqual was using not a simple sense-checker but a much more complex algorithm intended to predict each individual’s grades. It could be argued, validly, that I was part of the original ‘broad consensus’. But that was based on my making a totally false assumption, itself based on minimal information.

That might have been all my fault. But let me point out that I was not alone in the dark. On 11 July 2020, the Education Select Committee published a reportpresenting their findings to that time as regards ‘the fairness, transparency and accessibility of this year’s exam arrangements’ . This is paragraph 28:

Ofqual must be completely transparent about its standardisation model and publish the model immediately to allow time for scrutiny. In addition, Ofqual must publish an explanatory memorandum on decisions and assumptions made during the model’s development. This should include clearly setting out how it has ensured fairness for schools without 3 years of historic data, and for settings with small, variable cohorts.

To me, that’s a clear, and direct, instruction. With which Ofqual refused to comply. Point blank. So we all had to wait for A level results day, 13 August, to see the details of the algorithm. If that information had been made public in March, when the so-called ‘broad consensus’ was allegedly built, my belief is that the more likely outcome would have been an uproarious ‘NO!!!’.

As regards appeals, words fail me, for I just don’t know what to say in response to this statement in Appendix 1:

People are understandably mystified as to why Ofqual allowed some results to be awarded knowing that they would need to be changed on appeal. The reason for this was very strong legal advice that to make changes in advance of the award would quite likely result in the whole approach being rejected by the courts following one of the many judicial reviews that a number of law firms planned to request.

Well, ‘mystified’ might be one word; nor am I enlightened by this legalistic explanation. But the appeals process has been highly problematic ever since 2016, when Ofqual changed the rules to make it harder to appeal, with consequences that could well cause great trouble this August.

One final point.

Towards the end of the paper, Mr Taylor states ‘the key error was misjudging what people would accept’.

Indeed. And in my view the lion’s share of that misjudgement in on Ofqual’s shoulders. If that ‘misjudgement’ happened in the spring and summer of 2020, what confidence can anyone have that a similarly tragic-only-with-hindsight ‘misjudgement’ has not taken place in the spring of 2021, and is unfolding now? And who is being held to account? Mr Taylor, in winning his A* in Reputation Management, is clearly pointing his finger at the Government. My finger is pointing somewhere else.

5 comments

Charlie says:
22nd June 2021 at 08:45
Thank you, enlightening as always.
Huy Duong says:
22nd June 2021 at 10:37
Hi Dennis,
Thank you for your article.
I think you were too generous to Ofqual with this statement “Furthermore, it wasn’t until 13 August that Ofqual revealed that throughout the summer they had been testing the algorithm against historic results they knew to be only 75% reliable.”
After reading Ofqual’s interim report,
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/909368/6656-1_Awarding_GCSE__AS__A_level__advanced_extension_awards_and_extended_project_qualifications_in_summer_2020_-_interim_report.pdf ,
I wrote this piece for HEPI,
https://www.hepi.ac.uk/2020/08/18/how-bad-were-ofquals-grades-by-huy-duong/ ,
in which I wrote,
“In testing, compared with grades awarded by exams, their best model got A-level Biology grades wrong around 35% of the time and French grades wrong around 50% of the time, while for GCSEs, it awarded around 25% wrong Maths grades and around 45% wrong History grades.”
“Superimposed [on the chart], for comparison, is Ofqual’s estimation of the probability that a 2020 grade awarded by its chosen model is correct. Alarmingly, this probability ranges from 50% to 75%, implying that Ofqual’s 2020 grades had a 25% to 50% chance of being incorrect.”
“For example, 2020 A-level Biology grades awarded to a cohort of 49 students had almost a 35% chance of being incorrect compared to grades awarded by exams, but if the cohort size is just 24, this rose to 45%.”
Ofqual knew that from their own testing long before the results came out. They knew that their algorithm was getting 25% to 50% of grades wrong, but they still wanted to use it to downgrade 40% of grades. This information is vital for decision making, yet they assiduously kept it from the public, so how could the public decide in an informed way? One has to ask if they also kept it from the DfE for the latter’s decision making? It they had shared that information with the public, there would not have the “broad consensus” that Roger Taylor is referring to, and disaster would have been avoided.
From April, I started to contact Ofqual with warnings such as what I sent to the Education Select Committee,
https://committees.parliament.uk/writtenevidence/8239/html/
“Data from Matthew Arnold School in Oxford (MAS) suggests that for A-levels the Exceptional Arrangements as published so far has virtually no chance of providing grades to the students in a way that satisfies the double criteria of being fair to the individuals and controlling grade inflation nationally. This problem affects every A-level subject at MAS, and is likely to affect most A-level subjects at hundreds of comparable schools across the country. The risk to the students is that fairness to the individuals might be sacrificed.”
but Ofqual would not listen.
It is as if the pilot of a ship that ran aground hid from the passengers, crew and ship owner the chart showing rocks at depths that are too shallow for the ship and showing the course that he has plotted, which goes straight into those rocks, and ignored warnings of rocks from passengers, and then that pilot innocently asks, “Why was there such broad consensus?”, implying, “It was everyone’s fault”.
Roger Talylor suggests that it was his and Ofqual’s naivety in thinking that the public would accept. If they honestly thought that the public would accept, they should have been transparent about the fact that their algorithm was getting 25% to 50% of grades wrong, but they still wanted to use it to downgrade 40% of grades, and let the public decide whether it would accept or not. But they hid that information from the public. That does not look like naivety and thinking that the public would accept.
I think there is are deep problems at Ofqual that Roger Talyor is not admitting to.
Dennis Sherwood says:
22nd June 2021 at 12:24
Thank you Charlie – I’m glad you are enlightened rather than mystified!
And thank you Huy for your contribution too. The blog you cite is important, even more so your submission to the Select Committee. Indeed, I suspect that the submissions, which date back to April 2020, are a treasure trove of insights, concerns and commentaries.
Helen Thorne says:
22nd June 2021 at 14:17
Hi Dennis
An insightful analysis as always. One thing which has been overlooked in retrospect are the warning signals from Scotland about the risks of using school performance data to moderate calculated grades. Whilst the SQA took a different approach, similar concerns about fairness and accuracy were raised by the Scottish Education and Skills Committee in May 2020.
I’d also take issue with the statement that students thought that using calculated grades was the sensible thing to do. There’s plenty of published data from the Student Room, the Sutton Trust and the Scottish Youth Parliament from April and May 2020 demonstrating that students were deeply concerned about the fairness of calculated grades, particularly about the impact of using school performance data. Of course it’s easy to point the finger with hindsight and we can only hope that the strenuous efforts by teachers, schools and colleges to determine assessed grades this summer will deliver outcomes that sit better with pupils and parents, although the expected grade inflation is going to create a whole set of new issues. Looking forward to your future posts on that.
Dennis Sherwood says:
22nd June 2021 at 18:30
hi Helen
Thank you too.
Yes, Scotland has often led, with England following behind, notably in this context in cancelling their algorithm. And I see that Scotland has just taken a decision to replace their regulator, the SQA… (https://www.tes.com/news/sqa-be-replaced-education-secretary-reveals).
Your point about student unease is truly important, and thank you for citing those references. Bland, unsubstantiated, assertions about ‘broad consensus’ must be challenged!

An A* in Reputation Management? Looking back at last summer’s results row – and ahead to this summer’s coming one

***DON’T FORGET TO BOOK YOUR PLACE AT THIS YEAR’S HEPI ANNUAL CONFERENCE, TAKING PLACE THIS THURSDAY, WHICH IS KINDLY SPONSORED BY LLOYDS BANK AND UPP AND WHICH WILL INCLUDE THE LAUNCH OF THE HEPI / ADVANCE HE 2021 *STUDENT ACADEMIC EXPERIENCE SURVEY****

5 comments

Leave a Reply Cancel reply

***DON’T FORGET TO BOOK YOUR PLACE AT THIS YEAR’S HEPI ANNUAL CONFERENCE, TAKING PLACE THIS THURSDAY, WHICH IS KINDLY SPONSORED BY LLOYDS BANK AND UPP AND WHICH WILL INCLUDE THE LAUNCH OF THE HEPI / ADVANCE HE 2021 STUDENT ACADEMIC EXPERIENCE SURVEY***

5 comments

Leave a Reply Cancel reply

***DON’T FORGET TO BOOK YOUR PLACE AT THIS YEAR’S HEPI ANNUAL CONFERENCE, TAKING PLACE THIS THURSDAY, WHICH IS KINDLY SPONSORED BY LLOYDS BANK AND UPP AND WHICH WILL INCLUDE THE LAUNCH OF THE HEPI / ADVANCE HE 2021 *STUDENT ACADEMIC EXPERIENCE SURVEY****