How bad were Ofqual’s grades – by Huy Duong

18 August 2020
By Huy Duong

This guest post has been kindly contributed by Huy Duong, who has been described by the Guardian as ‘the father who foresaw A-level algorithm flaws’. Huy thanks Professor George Constantinides, Professor Rob Cuthbert, Professor Mike Larkin and Mr Dennis Sherwood for important discussion and help in writing this article.

Ofqual’s calculated grades, which have been scrapped, were flawed for at least four reasons.

First, the input for the grade calculation consisted of the Centre-Assessment Grades and rankings, which will have some errors.
The second input was historical performance at the school-subject cohort level, which only has weak association with the ability of the 2020 cohort and furthermore can be volatile.
The third input was the 2020 cohort’s prior attainment, but Professor George Constantinides found that corrections for prior attainment are made in a dubious way, resulting in anomalous grades.
Finally, the appeal procedure to protect the students against these three fundamental flaws was itself flawed.

‘How confident is Ofqual that its algorithm does not downgrade the wrong students?’, or alternatively, ‘What is the probability that an awarded grade is correct?’ The answers to these questions are vital for public debate and policy making. Yet Ofqual only disclosed them to the public on A-level results day

Notwithstanding those flaws, the bottom-line question was, ‘How confident is Ofqual that its algorithm does not downgrade the wrong students?’, or alternatively, ‘What is the probability that an awarded grade is correct?’ The answers to these questions are vital for public debate and policy making. Yet Ofqual only disclosed them to the public on A-level results day, in Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report. Did it disclose such information to the Department for Education for its decision making? Strangely, the Education Secretary pleaded igorance. That seems wrong on many levels.

In testing, compared with grades awarded by exams, their best model got A-level Biology grades wrong around 35% of the time and French grades wrong around 50% of the time, while for GCSEs, it awarded around 25% wrong Maths grades and around 45% wrong History grades.

Ofqual’s interim report (Figure 7.25 on p.81) shows the probabilities that 2018 exam grades in different subjects are correct. For example, a 2018 Maths grade had a 96% chance of being correct, 74% for Economics and 61% for English Language – see Dennis Sherwood’s latest HEPI blog for more information. Superimposed, for comparison, is Ofqual’s estimation of the probability that a 2020 grade awarded by its chosen model is correct. Alarmingly, this probability ranges from 50% to 75%, implying that Ofqual’s 2020 grades had a 25% to 50% chance of being incorrect.

**Figure 1***: Ofqual’s Figure 7.25 on p.81 of the Interim Report*

The accuracy of Ofqual’s 2020 grades depended on the subject and the cohort size. For example, 2020 A-level Biology grades awarded to a cohort of 49 students had almost a 35% chance of being incorrect compared to grades awarded by exams, but if the cohort size is just 24, this rose to 45%.

Therefore even Ofqual’s best model significantly worsened grade accuracy for most A-level subjects when the cohort size is below 50, which is common (almost 62% of the total in 2019). For GCSEs, even with larger cohorts, the best model would have worsened the grade accuracy for Maths and Sciences. A very conservative figure of 25% of wrong grades would have amounted to 180,000 wrong A-level grades and 1.25 million wrong GCSE grades.

With so many wrong grades awarded in 2020, Ofqual was never going to maintain the currency and integrity of grades anyway. Reducing A-level grade inflation (for example, from 12% to 2% for A-levels) would have meant very little against the backdrop of grade inaccuracies. Ofqual’s claim of making grades consistent between schools and between years is also questionable when so many grades would have been wrong. Very little would have been achieved at the cost of injustice, disruption, distress for hundreds of thousands of students and loss of faith in Ofqual and the Department for Education from teachers.

In the end, the ill-advised ‘standardisation’ left two legacies.

The first, ironically, is an increase in grade inflation due to the upgrading that took place during its operation.
The second is some Centre-Assessment Grades that are wrong – for example, because some teachers consciously or subconsciously tried to make the 2020 grade distribution similar to historical data – which the students might not be able to appeal against.

There have always been two extremes for this year’s grading: either aggressively keeping grade inflation down to a few percent at high risk of injustice, or using CAGs, which entails higher grade inflation but lower risk of injustice. There should have been a rational debate to find a compromise point and to devise a safety net for those who are failed by that compromise. Instead, the Department for Education and Ofqual’s dogmatic insistence on the first extreme, their blusters and lack of transparency made that debate impossible and led the country into the crisis.

After some inept handling of that crisis, they collapsed and lurched to the second extreme, but still leaving students without a well-designed appeal process as a safety net against that extreme’s limitations.

28 comments

Tania says:
18th August 2020 at 09:55
Great article Huy, thanks.
The HEPI blog has been a lovely place to come for clear, informed and thoughtful discussions during this sad exams fiasco.
Reply
Alison says:
18th August 2020 at 10:24
Another excellent article. HEPI seem to have such good writers on the subject as everything is so clear and concise and helps you understand all the nuances of the issue. I have found your commets to other Blogpost to add considerably to my understanding in a really straightforward way. Thank you! As the parent of one of the forgotten student without appeal to the historical data/strong cohort issue I thank you for raising that too.
Reply
albert wright says:
18th August 2020 at 11:53
Thanks again to HEPI for help with better understanding of key educational news items and policies.
Reply
Alison says:
18th August 2020 at 13:14
On second thoughts … couldn’t the teachers and school advocate with courage and honesty by appealing their own CAGs on the basis that if they had not been asked to mirror previous cohort data they would have submitted different results? Ones that that only recognised each childs ability. Instead, in many of the schools who did pre moderate down to fit 3yr average, the beaurocratic machinery has ground into gear and into defence mode. I feel so sad for this group of disillusioned and let down students many of whom are wondering if it is worth it to work so hard when you cannot trust the system to recognise and reward achievement.
Reply
Clare says:
19th August 2020 at 10:39
Thank you, HEPI, for all your insight. It has been very helpful for me, also in the same ‘forgotten’ position others have described. My child’s school tried to ‘second guess’ the Ofqual model (to reduce the risk to the school of an overall ‘blanket downgrade’ in grades), therefore applying its own moderation to the pupils’ grades before submission to Ofqual. This moderation was largely done by averaging the last three years’ results by subject, and overlaying this years’ rankings. This resulted in the downgrading of able students before submission to Ofqual. These students therefore are stuck with lower grades which they do not deserve (ie determined erroneously by their school, not by Ofqual), which now look even lower compared to their cohort across the country and are now erroneously assumed by the world at large (and universities/future employers) to be a fair reflection of what their teachers believed they would have achieved. There MUST be a way for these students to seek redress.
Reply
Huy Duong says:
19th August 2020 at 14:33
Thank you, everyone, for your kind comments. I have learnt a lot from the blog articles on HEPI, especially Dennis’ ones and from your comments.
Alison, I came across this observation, which has been sent to some schools, “And there does not appear to be any intention to allow students to appeal on the grounds of their school or college having internally moderated teachers’ assessments more robustly than other schools or colleges, or for centres to resubmit their CAGs.”
I don’t know how accurate that is, but it seems likely. It seems that for an appeal to succeed the school would need to accept that a CAG is wrong, then provide evidence to the board to prove that something went wrong. So the situation looks very difficult under the current rules.
The alternative is to have a change in the appeal rules. It’s unlikely that Ofqual will make such a change without massive pressure. Given what it took to pressure Ofqual to change what is so obviously wrong and unjust, I am not optimistic.
Reply
Rachel says:
20th August 2020 at 06:59
Thank you for this article. Is there any data on how this model has impacted those from a widening participation background? I’d like to know if the inaccuracies have been equally distributed or if those from a WP background has been further disproportionately disadvantaged? (I say further as research indicates that those from private school often have inflated predicted grades compared to those from schools in disadvantaged areas) Many thanks.
Reply
Huy Duong says:
20th August 2020 at 08:05
Hi Rachel,
I’ve not seen such data, but clearly the more variable a school’s level of attainment in public exams is, the more unreliable the model is going to be for that school.
Take 2 real schools. Westminster College: between 2017 and 2019, its overall A* rate varied by a factor of 1.1, so you would expect that the model will give it a relatively reliable number of A*s. Matthew Arnold School in Oxford: over the same period its A* rate varied by a factor of 3, so the model is not going to be as reliable for it, so students at MAS are more likely to be given wrong grades. “Wrong” here could mean up or down.
Reply
Huy Duong says:
20th August 2020 at 23:30
https://www.theguardian.com/education/2020/aug/20/ofqual-chief-to-face-mps-over-exams-fiasco-and-botched-algorithm-grading-system
Reply
Dennis Sherwood says:
21st August 2020 at 15:01
Huy – is there any information on who, precisely, designed the overall method, who wrote the specification for the algorithm, who did the coding, and who the testing? Who was the overall project manager?
Was it all done by Ofqual’s own staff? Or perhaps by one or more of the exam boards? And if the boards, were they working independently or collectively? Do Ofqual and the boards have the appropriately skilled resources, and in sufficient quantity?
Were any contractors used? Was any of this ‘outsourced’? If so, to whom?
And if ‘outsiders’ were involved in any capacity, how were they selected? Was there an open tender? What were they paid?
I was wondering if you, or any others, might have some answers to these questions – questions which, in the light of what has now happened, seem to me to be quite important…
Reply
Dennis Sherwood says:
21st August 2020 at 15:24
And perhaps those questions are more “interesting” than I had previously thought.
For a long time, Ofqual’s communications have been carried out by Richard Garrett, Director, Policy and Strategic Relationships (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/731163/OrgChart2018_final.pdf) – he featured, for example, on some of the early briefing sessions run by The Student Room.
So I was surprised to read this article in the Guardian, https://www.theguardian.com/education/2020/aug/20/firm-linked-to-gove-and-cummings-hired-to-work-with-ofqual-on-a-levels, about the non-competitive appointment of Public First, apparently run by what the Guardian refers to as “associates of Michael Gove and Dominic Cummings”, and who have been “working behind the scenes…to assist Ofqual with communicating its A-level and GCSE results plan to help secure public confidence in the strategy”.
“Public confidence”.
They did a pretty good job at that, then…
Back to the algorithm…
Reply
Huy Duong says:
21st August 2020 at 15:54
Hi Dennis,
I think it was either Priti Patel of Williamson who first specified the concept that the 2020 grade distributions should be similar to previous years’. That concept is not inherently wrong, but the key is is in the interpretation of “similar”.
Interpreting it as “grade inflation within 2%” is clearly stupid and is the root of all the injustice and chaos. So, to me, any enquiry should find out where/who that interpretation came from.
I think who designed the DCP approach is not important (maybe I’m naive). I think it is just a simplistic, lazy, crude approach that many people can come up with. Eg, I think the Scottish method is much better. What’s more important is why it was chosen. Ofqual’s documents claims that it was chosen because it gave the best results. However, there is a BBC article which said that it just happened to be the approach that was best at “cheating” in Ofqual’s tests.
I don’t think who did the coding is important (maybe I’m naive here).
I think the other important question is who knew that the model performed so miserably in testing?
Who understood that with that kind of miserable trustworthiness, haughty claims and justifications such as “maintaining the currency and integrity of grades”, “making grades consistent and fair between centres and between years” are just a load of rubbish.
For the last question, surely Sally Collier and Roger Taylor must have known about the miserable level of trustworthiness. So the question is why were they still feeding the public rubbish such as “maintaining the currency and integrity of grades”, “making grades consistent and fair between centres and between years” ? Why did Willamson and Gibb do the same right near to the end?
Reply
Huy Duong says:
21st August 2020 at 21:45
Hi Alison, Clare,
In case you don’t know, in the comments on Dennis’ blog here, https://www.hepi.ac.uk/2020/08/18/cags-rule-ok/, there is an interesting discussion on CAGs being pre-standardised.
Reply
Jac says:
26th August 2020 at 06:09
Hi Dennis
Your HEPI blog has (almost) kept me sane!
Do you think there will be any further changes? Either to the appeals process or grades submitted?
David Blows article in Schools week made a good case for schools resubmitting unmoderated Teachers Grades and no one has yet explained why UCAS predictions are any less value than the CAGS we currently have? Some universities have chosen their own assessments (based on GCSES, personal statement and UCAS predictions) as stronger indicators of an applicants ability than the CAGS .
My daughter is an outlier as you call it, UCAS predictions AAB, told by her teachers ‘to go for AAA. ‘ Her CAGS are BBC, not what her teachers wanted to give her apparently but they have confirmed no input error and the only appeal would be malpractice . So zero support from school at this horrendous time. Bristol are holding her place open until 7th September if she can submit CAGS that match her offer by then.
Is the resignation of OfqualChief Sally Collier a door being left ajar for another change? And is there any evidence schools are fighting this? Our school has told us nothing and I really want to know they are fighting.
Melissa and Melissa stories above are beyond sad. It’s devastating to watch our young people go through this on top of everything else the pandemic has done to them. Thanks. Jac
Reply
Dennis Sherwood says:
26th August 2020 at 16:06
Hi Jac… thank you. I’ve just seen this article by Geoff Barton of ASCL, which is asking exactly that question – and he has influence: https://schoolsweek.co.uk/ascl-writes-to-ofqual-over-deflated-grades-fears/.
I see that as a good sign.
But I find those comments at the end absolutely despicable:
An Ofqual spokesperson said schools and colleges “were asked to provide holistic, evidence-based judgments of the grade they believe a student would have achieved if teaching, learning and assessments had gone ahead as planned”.
“We provided guidance on the process and heads of centre were asked to sign a declaration to confirm that the grades submitted honestly and fairly represent what the students would have been most likely to achieve.”
How dare Ofqual blame teachers for the disaster for which they alone are responsible?
Reply
Jac Squire says:
26th August 2020 at 21:54
Thanks Dennis. Great article. Let’s keep everything crossed that fairness will be a consideration.
Reply
Huy Duong says:
27th August 2020 at 00:09
Unfortunately,
https://schoolsweek.co.uk/schools-that-deflated-own-grades-cant-appeal-says-ofqual/
So there is no redress for the students who have been disadvantaged by the principles behind the “mutant algorithm” (https://www.theguardian.com/politics/2020/aug/26/boris-johnson-blames-mutant-algorithm-for-exams-fiasco).
Reply
Mark says:
27th August 2020 at 17:19
I wonder if Centres could argue they had submitted the “wrong” data if the veracity of that data had been influenced by a third-party? There is a sentence in one of their blog posts that alludes to this when it says “Your judgements are important, and should always be your own, including where external organisations may be offering to help you to produce centre assessment grades for your students.” If the pre-standardisation was carried out with help from a third-party (such as FFT) centre heads could appeal on the basis that, on reflection, the final judgement was not their’s alone? Another possibility for those that used “external” data sources such as FFT Aspire to effect a pre-standardisation, the guidelines state “a centre assessment grade for each student – the judgement submitted to the exam board by the Head of Centre about the grade that each student is most likely to have achieved if they had sat their exams. This professional judgement is derived from evidence held *within* the centre and which has been reviewed by subject teachers and relevant heads of department/subject leads” It could be argued here that by relying on external data to moderate the original CAGs, the centre had unwittingly submitted results based on data not fully held within the centre. I guess the proof is in the appeals themselves and what gets upheld. It will only take one successful appeal and the floodgates with open. Hence OFQUAL’s aggressive and unsympathetic stance in relation to a problem of its own making.
Reply
Mark says:
27th August 2020 at 17:34
Also, submitting confidential CAG data to a third-Party such as FFT ASPIRE could be in breach of GDPR. I noticed in the OCR advisory of CAGs it says “Do not share centre assessment grades or rank orders with students, parents, carers, or anyone outside the centre.” If over half of state schools did just that, what are the implication?
Reply
Jac Squire says:
27th August 2020 at 18:53
oh dear, that is devasting. So….no appeal, & these grades have been submitted in two different processes? A FAILURE of Guidance by DfE. But the reasoning that schools who submitted deflated grades can’t appeal even though not everyone did it, because, wait for it….” it was not an error”. They followed the guidance and are penalised for it. What sort of logic is that? Have I woken up in Communist China ?
Reply
Mark says:
27th August 2020 at 20:27
Jac. It creates a moral hazard where those that took the greatest risk in inflating their grades against OFQUAL’s own explicit guidance are rewarded at the expense of those that conscientiously worked within that guidance. As it stands, OFQUAL have failed their number one goal; “Goal 1: Regulate for the validity and safe delivery of general qualifications” They need to sort this out or the select committee will.
Reply
Jac Squire says:
28th August 2020 at 06:58
Mark – I hope you are right. It’s a beautiful ideal that OFQUAL will remember their stated goals. But do you think they will sort it out for this years cohort? I don’t think the MPs have a firm enough grasp of what happened. My MP said that my school should not have moderated grades? Inconsistent in the least.
Reply
Mark says:
29th August 2020 at 07:12
When evidence of the role the FFT Aspire system played in systematically downgrading the GCSE CAGs of over 2000 schools is made clear to MPs on the select committee hopefully they will act. FFT Aspire is the backbone of much of the DoE and OFQUALs oversight of education so for it to have been used unwittingly to “punish” so many by pre-standardising submitted CAGs before OFQUAL withdrew the principle of standardisation of grades, casts a shadow of its use for predication and analysis going forward. We will see.
Reply
Dennis Sherwood says:
29th August 2020 at 08:00
Hi Mark – mmm… thank you… for people like me are rather less aware, please tell us more about “FFT Aspire is the backbone”…
Schools were required to submit their CAGs between 1 and 12 June; immediately afterwards, on 15 June, FFT published their report on draft GCSE submissions (https://ffteducationdatalab.org.uk/2020/06/gcse-results-2020-a-look-at-the-grades-proposed-by-schools/).
When I read that, I felt there was something missing – information that they had but hadn’t declared. So, as you’ll see, on 15 June, I asked:
“1. Will you be publishing stats for A level?
2. For GCSE, and supposing that the boards intervene to place the grade boundaries so that the 2020 distribution matches 2019, is it possible to estimate the % of centre assessed grades that would still be confirmed, the % down-graded, and (if any) the % upgraded?
3. You’re very straight about what you did and how you did it, and of the limitations, and you make quite clear that the only comparison is 2019. Do you have any feel for what, if anything, might be different if the boards were also to take 2018 into account, for those subjects graded 9, 8, 7…?”
To which, on the same day, they replied:
“A few quick responses 1- no, we’ve not done a similar exercise for A-level. 2- This is tricky. If you were to start at grade 9 and then lower approx 37% of the 2020 grades we collected you would get something close to the 2019 distribution across all subjects. But we’d expect the Ofqual statistical moderation process to be more complex than this and so might end up raising some grades in some subjects in some schools and lowering others. We’ve not got a feel for how this might work in any real detail so can’t really comment any further. The same would apply to taking account of 2018 results as well.”
As early as 15 June, they were indicating that the level of down-grading might somewhere in the range 35% to 40%…
…and it was this answer that triggered my writing “Have teachers been set up to fail?”, posted on 18 June (https://www.hepi.ac.uk/2020/06/18/have-teachers-been-set-up-to-fail/), in which, to play safe, I refer to “more than 25%”.
Does any of that provide any more clues as to “Who knew what, when?” ?
Reply
Mark says:
29th August 2020 at 16:14
Hi Dennis,
That is interesting. I originally came across FFT through the blog post you mentioned but Aspire is the widely-used front-end for reporting and data that, when data exchange is enabled, allows a relatively seamless integration of DofE data with schools’s MIS. So by “backbone” I meant for schools The FFT GCSE Statistical Subject Moderation was carried out though-out May and this is where they got all the stats from about grade-inflation in predicted grades (as 1900 schools uploaded their “raw” CAG data in csv fies) FFT crunched it against 2019 and all the Prior Attainment KS2 data they have access to from their NPD concession. so it was basically like a stripped down version of the OFQUAL algorithm and the data-crunch was to all intents and purposes “smoke testing” what was going to happen in June. It seems like an extraordinary amount of trouble to go to in order to produce a few blog posts about how teachers cant predict grades along historical lines so I do wonder. Also the context at the time was all about how teachers were getting it “wrong” and little was said about the reports the algorithm generated and returned to schools. This was described by “validation and benchmarking” by the FFT blurb but undoubtedly many schools used this as the ASCL stage 7 moderation/Standardisation. I found a few letters online from HofC to parents explaining they had done this, referring to it as “External Moderation.” Ultimately the effect was to “hard-bake” the OFQUAL standardisation into the CAGs before they left the school. With the result that on the 20th August students at FFT-moderated centres who had applied the report recommendations, found their CAGs and OFQUAL Moderated grades identical. Because to all intents and purposes they were. People who mention OFQUALs headline increase in final GCSE grades this year need to remember that this includes all those CAG from independents, free schools and the 50% of state schools that didn’t do the FFT Moderation exercise. The FFT headlines from May/June will have been for schools who subsequently received a report highlighting where their predictions were out of step with historical and national data and by how much. How many of those schools altered their CAGs on that advice, affecting their cohorts when those metrics were discarded on 17th Aug remains to be seen.
Reply
Dennis Sherwood says:
29th August 2020 at 16:22
Hi again Mark – I’m amazed: I had no idea that FFT were so integrated into ‘the system’ – I had assumed they were simply offering a service to calculate three-year averages to any schools that wanted them, or to check a school’s drafts against those averages. You tell a deeper, and murkier, story…
…especially about “students at FFT-moderated centres who had applied the report recommendations, found their CAGs and OFQUAL Moderated grades identical”.
Have you talked to any journalists about that?
Reply
Mark says:
29th August 2020 at 16:42
I have written to FFT on twitter but have yet to receive a reply. My interest in their approach was that they used the NPD which is what OFQUAL would have used for their algorithm so would have access to the same centre level data (prior attainment at KS2, national historic data, previous local cohort data) Therefore, with the raw CAGs for the centre they could simulate a possible standardisation. Certainly the report pages they sent back could have been used to implement ASCL Stage 7 moderation and using language like “validation and benchmarking” would appeal to HofC struggling to “do the right” thing by their students and OFQUAL at a time when there were no clear guidelines. I am not for a minute suggesting that FFT did anything untoward. At the time of the moderation exercise OFQUAL were still fully intending to do a similar national moderation using similar metrics. So at that point what they were offering was of real value to schools. When OFQUAL pulled post-standardisation however, doing the “right thing” suddenly became a major disadvantage, creating a national cohort split between moderated and unmoderated submitted CAGs. There is a useful summary of the FFT Moderation Service, as presented to LAs at pages 5-10 of this pdf from Derby City Council. https://schoolsportal.derby.gov.uk/media/schoolsinformationportal/contentassets/documents/wsc/10may/FFT%20LA%20Partner%20Meeting.pdf
Reply
Dennis Sherwood says:
29th August 2020 at 16:44
Thank you – if you have a moment, please drop me a note: [email protected].
Reply

28 comments

Leave a Reply Cancel reply