This blog was kindly contributed by Dennis Sherwood. Dennis has reporting for HEPI on this years A-level exams since March and the A-level system long before that!
It’s not all over yet
This summer’s exam fiasco is rapidly fading from the headlines, and the deadline, 17 September 2020, for appeals – which remain limited to very narrow technical grounds – is fast approaching. This year’s process will, I am sure, never be repeated, and I could well imagine that those who have been in very hot seats indeed are holding their breath, hoping it will all go away, and thanking their lucky stars that they have not – at least as yet – suffered the fate of former Chief Regulator, Sally Collier, and former Permanent Secretary, Jonathan Slater.
But it hasn’t all gone away quite yet, not least for those young people who are still trying to come to terms with unfair consequences. And at the political level, this year’s problems continue to be debated in Parliament, and on 16 September the Education Select Committee will be holding an ‘accountability hearing’ at which the Secretary of State for Education, Gavin Williamson, will be accompanied by two officials, Susan Acland-Hood (Jonathan Slater’s successor) and Michelle Dyson (Director for Qualifications, Curriculum and Extra-Curricular), so adding to the evidence provided on 2 September by Ofqual’s Roger Taylor, Dr Michelle Meadows, Julie Swan and Dame Glenys Stacey.
Four significant documents have been published since 2 September:
- the transcript of the Select Committee meeting;
- a letter from Ofqual to Robert Halfon MP dated 8 September;
- a submission from Cambridge Assessment (the owner of the OCR exam board) to the Select Committee, published on 9 September;
- an Ofqual document dated 16 March 2020, entitled Summer 2020 GCSE and A/AS level exam series – Contingency planning for Covid-19 – options and risks.
These provide the context for this blog.
The Cambridge Assessment bombshell
The submission from Cambridge Assessment contains many ‘behind the scenes’ revelations, but to me the most important is this chart:
The left-hand column shows the ‘Centre Assessment Grades’ (or ‘CAGs’) submitted by this particular college, and the second, the corresponding three-year historic average. As can be seen, the two match identically – the college had faithfully followed the ‘rules’.
The two right-hand columns show the results of two Ofqual-imposed ‘adjustments’: one for ‘prior attainment’, the second, a ‘national correction’. The consequence was to down-grade the college’s results from about 72.5% A*, A and B to about 68.5%.
At this point, I can only quote the submission verbatim:
20. Our analysis, which we discussed with Ofqual as we did it, suggested that all centres, of whatever size, were affected by the application of the national standards correction. In large centres, the effect was particularly obvious given the large numbers of students close to the grade thresholds before the correction was applied. The consistency of the effect suggested that the model was behaving as predicted. But the size of the effect on results suggested that this input to the model was the cause of the unexpectedly depressed results of some centres.
21. The impact of the national standards correction was not immediately obvious to many schools and colleges because exam boards had not been required to provide them with details of it when distributing results, only the historic averages and prior attainment data employed – in effect, Chart 1 without the far right hand information. This made it very challenging for some schools and colleges to understand their results and damaged public confidence.
These down-grades are not the result of ‘game playing’, ‘over-optimism’, or even rounding errors. Rather, they are a direct consequence of the algorithm, the results of which not even the most conscientious and rule-abiding schools could ever have anticipated.
The original options
Algorithms, of course, don’t ‘go mad’ all by themselves, so the origins of the algorithm, and the thought-patterns of its authors, are of great importance. Some further light has been thrown on these by Ofqual’s ‘options and risks’ document, dated 16 March (that’s two days before Boris Johnson’s announcement that the 2020 exams would be cancelled), the existence of which was revealed in Julie Swan’s response to Question 983 at the meeting of 2 September.
Ofqual’s options paper tables 11 possibilities for this year’s process, of which three were short-listed:
- ‘Additional papers’, in which the exams take place as usual, at the originally-scheduled times, with ‘additional papers’ available for students who were unable to sit the ‘normal’ exams.
- ‘Delay’, whereby the exams would be scheduled for a suitable later date.
- ‘Estimate grades’, this being the award of ‘grades based on teacher estimates which have been statistically moderated at centre / cohort level to bring them, as far as possible, into line with previous years’ results’.
Ofqual’s document explicitly states that their preferred option is the first, so at some time between 16 March and 18 March, someone somewhere over-ruled that preference in favour of the third. That mystery was resolved by Roger Taylor, in his answer to Question 947:
It was the Secretary of State who then subsequently took the decision and announced, without further consultation with Ofqual, that exams were to be cancelled and a system of calculated grades was to be implemented.
The description of the third, selected, option – so few words – clearly states what we all now know: ‘statistically moderated at centre / cohort level to bring [teachers’ estimated grades], as far as possible, into line with previous years’ results’. What a profound pity it didn’t also say ‘and allowing teachers to provide robust evidence of any “outliers” that do not conform to the historical pattern’. That would have made a huge difference.
Notice too that there is a short phrase that does not appear in this description, nor in the statement made on 18 March by Gavin Williamson, but is in his announcement of 20 March, and also in Roger Taylor’s reply to Question 947: ‘calculated grades’. That word ‘calculated’ is, I think, significant, for ‘a system of calculated grades’ is surely substantially different from ‘teachers’ estimated grades moderated according to previous years’ results’.
So perhaps there is an opportunity for some further clarification by the Secretary of State at the meeting on 16 September. What actually happened between 16 March and 18 March? And what were the origins of the idea to devise a ‘system of calculated grades’?
And let me also mention the lost opportunity of one of the options that didn’t make it to the short list, but was alluded to by Roger Taylor, also in his answer to Question 947:
Issue a standardised leaving certificate detailing teachers’ estimates of grades with brief commentary – this would enable students to progress to further education or HE.
As I read that, I was immediately reminded of the ‘Passport in English’ advocated by Roy Blatchford as a solution to the socially catastrophic problem of ‘The Forgotten Third’. To me, this is a very good idea. ‘Good’, though, is judgemental and evaluative, from which my values, beliefs, biases and prejudices are easy to discern; as are those of whoever rejected this option and who presumably placed rather less weight than I would on this bullet point to be found in Ofqual’s ‘arguments for’:
- Will be seen by some as showing trust in the teaching profession during these difficult times.
and rather more on these ‘arguments against’:
- Schools are also likely to expect a refund of exam fees.
- This would call into question the future of GCSEs.
Indeed so. Some might regard these last two not as ‘arguments against’, but rather as ‘arguments for’ – and I wonder what might have happened had this option been on ‘offer’ in Ofqual’s April consultation…
The Select Committee meeting of 2 September
I watched the meeting of 2 September on television, and good viewing it was too. Robert Halfon was a lively, decisive and probing Chair, the members of the Committee asked relevant and often perceptive questions, and the Ofqual responses were appropriately brief and for the most part informative.
So, for example, Questions 946 to 955 unravelled the events of Saturday 15 August, when Ofqual posted an announcement on its website about the use of mock exam results in appeals, only to withdraw it a few hours later, so leading directly to the decision to abandon the algorithm and award all grades as determined by teachers’ Centre Assessment Grades.
On a lighter note, Question 990 asked: ‘At what point did the algorithm mutate?’, to which Dr Meadows replied: ‘I don’t believe that the algorithm ever mutated.’
One recurrent theme concerned transparency and the disclosure of all communications and minutes between Ofqual and the Department for Education. This is proving to be somewhat troublesome as to whether it is for Ofqual or the Department to disclose, and a motion to achieve full disclosure, tabled by Kate Green, Labour’s Shadow Secretary of State for Education, on 9 September, was rejected in a division which followed party lines exactly: 325 Conservative MPs and one independent voting ‘no’; 236 MPs, none of whom were Conservative, voting ‘aye’.
But one particular disclosure issue appears still to be at a loose end: in reply to Question 985, Roger Taylor committed to release all submitted school data to a ‘trusted third party … for a deep forensic analysis’.
This is not mentioned in the letter from Dame Glenys Stacey to Robert Halfon dated 8 September, nor was it within the parliamentary motion rejected on 9 September. So, presumably, that still needs to happen – which is a ‘good thing’, for it will answer some important questions: Why were so many Centre Assessment Grades over-ruled by the algorithm? Which schools were ‘over-optimistic’? Which ‘over-bids’ are more likely to be attributable to year-on-year variability or the need to round fractions to whole numbers? How many down-grades were attributable to Ofqual’s ‘adjustments’ for ‘prior attainment’ and ‘national correction’?
Another recurrent theme concerned the accuracy of the algorithm. In response to Question 974, Dr Meadows stated that:
Every year, we publish marking consistency metrics that report the extent to which grades would change if a different senior examiner had looked at the work. In fact, we looked at that work this year and took some comfort from it, in the sense that the levels of accuracy that we were seeing from the standardisation model were very similar to those levels of accuracy that we see each year through the marking process.
That, ‘comfortingly’, pegs the accuracy of the algorithm to the accuracy of marking, and the mention of ‘marking consistency metrics’ is, presumably, a reference to Ofqual’s November 2018 publication Marking Consistency Metrics – An update, Figure 12 of which presents the measures of the reliability of the grades awarded for each of 14 subjects (my preference being for ‘reliability’ rather than ‘accuracy’). To my knowledge, this is the only document that reports, for qualifications as awarded, ‘the extent to which grades would change if a different senior examiner had looked at the work’, so those initial words ‘every year’ are to me something of a puzzle: perhaps Dr Meadows will be kind enough to direct me to the publications I have missed.
The ‘accuracy of marking’ was subsequently quantified by Dr Meadows in her reply to Question 996:
There is a benchmark that is used in assessment evidence that any assessment should be accurate for 90% of students plus or minus one grade. That is a standard benchmark. On average, the subjects were doing much better than that. For A-level we were looking at 98%; for GCSE we were looking at 96%, so we did take some solace from that.
This was confirmed in the answer to the very last question (1058) given by Dame Glenys Stacey, who, referring back to the response from Dr Meadows, acknowledged that grades ‘are reliable to one grade either way’.
That exam grades ‘are reliable to one grade either way’ has two consequences. The first concerns the reliability of the algorithm. As Dr Meadows stated, the ‘accuracy’ of the algorithm was tested against a benchmark of grades that are ‘reliable to one grade either way’ – hardly a criterion that would build trust and confidence in the outcomes. That alone would, to me, have been a very good reason to scrap the algorithm.
Back to ‘exams as usual’
The algorithm, however, is dead and gone, never to return.
But exams are back, and many students unhappy with this year’s non-exam grades will soon sit this autumn’s ‘real’ exams.
I think these exams are a bad idea.
Not because those students don’t deserve redress. They do.
Not because of the inequities in opportunity for teaching and learning over the last several months, which are profound.
But because – as so explicitly stated by Dr Michelle Meadows and confirmed by Dame Glenys Stacey – the resulting grades will be fundamentally unreliable. Not that they use those words. Rather, their words are the much more reassuring: ‘reliable to one grade either way’ or ‘accurate plus or minus one grade’; even more so when associated with ‘take comfort’ and ‘take solace’.
But what does ‘reliable to one grade either way’ mean?
Suppose that a certificate, resulting from a ‘real’ exam, shows grade B. Might that grade, legitimately, be an A?
Likewise the student ‘awarded’ grade 3 for GCSE English or Maths, who has ‘failed’, who is forced to re-sit, on whom many doors have slammed shut, and whose self-confidence has been shattered. Suppose, just suppose, that the grade should be a 4.
Of course, those who enjoy lotteries are welcome to take their chances. But those who believe that exam grades should be reliable and trustworthy – that grade [X] really is grade [X] and is a fair assessment – might wish to bear in mind that for the last several years the average reliability of GCSE, AS and A-level exam grades has been about 75%.
That means that only 3 grades in every 4, as awarded, truly reflect the candidate’s ability. And 1 in every 4 doesn’t. So of the approximately 6 million grades awarded following each year’s exams, some 4.5 million have been reliable; about 1.5 million, not. But no one knows which, specific, 1.5 million grades, or which, specific, candidates.
Furthermore, the average reliability, 75%, masks a wide variability by subject, and by mark within subject – for example, a script marked at, or very close to, any grade boundary in any subject has a probability of being awarded the right grade of about 50%. The exam board might as well toss a coin.
So I find it deeply disturbing that Ofqual’s Executive Director for Strategy, Risk and Research, Dr Michelle Meadows, ‘takes solace’ that 98% of A-level grades, and 96% of GCSE grades, are ‘accurate plus or minus one grade’.
What confidence can there be in a regulator who ‘takes solace’ in presiding over a process which, year on year on year, has delivered, and will continue to deliver, such unreliable and untrustworthy outcomes?
This unreliability has remained hidden because it is only if an awarded grade is challenged – which is both cumbersome and costly – that there might be a chance of a ‘second opinion’. And in 2016, Ofqual changed the rules to make it harder to discover grading errors, so relatively few challenges are made, and few grading errors discovered.
What happened this year was that Ofqual – presumably inadvertently – created an entire population of ‘second opinions’, the teachers’ Centre Assessment Grades, which, for A-level, could be compared to the results of the algorithm. And if there was a discrepancy, people could ask ‘why?’ Hence this year’s explosion.
The explosion has caused much damage.
Yet there has also been one good consequence – to blow the cover off grades being ‘reliable to one grade either way’.
But exam grades should not be ‘reliable to one grade either way’.
They should be ‘reliable’. Full stop.
For until they are, there is, to me, no point in doing exams at all. Not in a few weeks’ time. Not next summer. Not until exams are truly fit-for-purpose. Not until assessments are fully reliable and trustworthy.
Unfortunately, Ofqual don’t seem to think that delivering reliable grades is important. The need to do this does not feature in Ofqual’s Corporate Plan 2020-2021, nor is it mentioned in Ofqual’s follow-up letter to the meeting of 2 September.
Perhaps, however, the Select Committee might take a different view – especially bearing in mind Ofqual’s statutory obligation under Section 22 of the Education Act 2011 to:
secure that regulated qualifications give a reliable indication of knowledge, skills and understanding.