Dennis Sherwood: Why ‘exams as usual’ are a bad idea

15 September 2020
By Dennis Sherwood

This blog was kindly contributed by Dennis Sherwood. Dennis has reporting for HEPI on this years A-level exams since March and the A-level system long before that!

It’s not all over yet

This summer’s exam fiasco is rapidly fading from the headlines, and the deadline, 17 September 2020, for appeals – which remain limited to very narrow technical grounds – is fast approaching. This year’s process will, I am sure, never be repeated, and I could well imagine that those who have been in very hot seats indeed are holding their breath, hoping it will all go away, and thanking their lucky stars that they have not – at least as yet – suffered the fate of former Chief Regulator, Sally Collier, and former Permanent Secretary, Jonathan Slater.

But it hasn’t all gone away quite yet, not least for those young people who are still trying to come to terms with unfair consequences. And at the political level, this year’s problems continue to be debated in Parliament, and on 16 September the Education Select Committee will be holding an ‘accountability hearing’ at which the Secretary of State for Education, Gavin Williamson, will be accompanied by two officials, Susan Acland-Hood (Jonathan Slater’s successor) and Michelle Dyson (Director for Qualifications, Curriculum and Extra-Curricular), so adding to the evidence provided on 2 September by Ofqual’s Roger Taylor, Dr Michelle Meadows, Julie Swan and Dame Glenys Stacey.

Four significant documents have been published since 2 September:

the transcript of the Select Committee meeting;
a letter from Ofqual to Robert Halfon MP dated 8 September;
a submission from Cambridge Assessment (the owner of the OCR exam board) to the Select Committee, published on 9 September;
an Ofqual document dated 16 March 2020, entitled Summer 2020 GCSE and A/AS level exam series – Contingency planning for Covid-19 – options and risks.

These provide the context for this blog.

The Cambridge Assessment bombshell

The submission from Cambridge Assessment contains many ‘behind the scenes’ revelations, but to me the most important is this chart:

The left-hand column shows the ‘Centre Assessment Grades’ (or ‘CAGs’) submitted by this particular college, and the second, the corresponding three-year historic average. As can be seen, the two match identically – the college had faithfully followed the ‘rules’.

The two right-hand columns show the results of two Ofqual-imposed ‘adjustments’: one for ‘prior attainment’, the second, a ‘national correction’. The consequence was to down-grade the college’s results from about 72.5% A*, A and B to about 68.5%.

At this point, I can only quote the submission verbatim:

20. Our analysis, which we discussed with Ofqual as we did it, suggested that all centres, of whatever size, were affected by the application of the national standards correction. In large centres, the effect was particularly obvious given the large numbers of students close to the grade thresholds before the correction was applied. The consistency of the effect suggested that the model was behaving as predicted. But the size of the effect on results suggested that this input to the model was the cause of the unexpectedly depressed results of some centres.

21. The impact of the national standards correction was not immediately obvious to many schools and colleges because exam boards had not been required to provide them with details of it when distributing results, only the historic averages and prior attainment data employed – in effect, Chart 1 without the far right hand information. This made it very challenging for some schools and colleges to understand their results and damaged public confidence.

These down-grades are not the result of ‘game playing’, ‘over-optimism’, or even rounding errors. Rather, they are a direct consequence of the algorithm, the results of which not even the most conscientious and rule-abiding schools could ever have anticipated.

Cambridge Assessment use the language of the diplomat. I will be more direct. If this isn’t evidence of algorithms-gone-mad, and of teachers having been set up to fail, I don’t know what is.

The original options

Algorithms, of course, don’t ‘go mad’ all by themselves, so the origins of the algorithm, and the thought-patterns of its authors, are of great importance. Some further light has been thrown on these by Ofqual’s ‘options and risks’ document, dated 16 March (that’s two days before Boris Johnson’s announcement that the 2020 exams would be cancelled), the existence of which was revealed in Julie Swan’s response to Question 983 at the meeting of 2 September.

Ofqual’s options paper tables 11 possibilities for this year’s process, of which three were short-listed:

‘Additional papers’, in which the exams take place as usual, at the originally-scheduled times, with ‘additional papers’ available for students who were unable to sit the ‘normal’ exams.
‘Delay’, whereby the exams would be scheduled for a suitable later date.
‘Estimate grades’, this being the award of ‘grades based on teacher estimates which have been statistically moderated at centre / cohort level to bring them, as far as possible, into line with previous years’ results’.

Ofqual’s document explicitly states that their preferred option is the first, so at some time between 16 March and 18 March, someone somewhere over-ruled that preference in favour of the third. That mystery was resolved by Roger Taylor, in his answer to Question 947:

It was the Secretary of State who then subsequently took the decision and announced, without further consultation with Ofqual, that exams were to be cancelled and a system of calculated grades was to be implemented.

The description of the third, selected, option – so few words – clearly states what we all now know: ‘statistically moderated at centre / cohort level to bring [teachers’ estimated grades], as far as possible, into line with previous years’ results’. What a profound pity it didn’t also say ‘and allowing teachers to provide robust evidence of any “outliers” that do not conform to the historical pattern’. That would have made a huge difference.

Notice too that there is a short phrase that does not appear in this description, nor in the statement made on 18 March by Gavin Williamson, but is in his announcement of 20 March, and also in Roger Taylor’s reply to Question 947: ‘calculated grades’. That word ‘calculated’ is, I think, significant, for ‘a system of calculated grades’ is surely substantially different from ‘teachers’ estimated grades moderated according to previous years’ results’.

So perhaps there is an opportunity for some further clarification by the Secretary of State at the meeting on 16 September. What actually happened between 16 March and 18 March? And what were the origins of the idea to devise a ‘system of calculated grades’?

And let me also mention the lost opportunity of one of the options that didn’t make it to the short list, but was alluded to by Roger Taylor, also in his answer to Question 947:

Issue a standardised leaving certificate detailing teachers’ estimates of grades with brief commentary – this would enable students to progress to further education or HE.

As I read that, I was immediately reminded of the ‘Passport in English’ advocated by Roy Blatchford as a solution to the socially catastrophic problem of ‘The Forgotten Third’. To me, this is a very good idea. ‘Good’, though, is judgemental and evaluative, from which my values, beliefs, biases and prejudices are easy to discern; as are those of whoever rejected this option and who presumably placed rather less weight than I would on this bullet point to be found in Ofqual’s ‘arguments for’:

Will be seen by some as showing trust in the teaching profession during these difficult times.

and rather more on these ‘arguments against’:

Schools are also likely to expect a refund of exam fees.

This would call into question the future of GCSEs.

Indeed so. Some might regard these last two not as ‘arguments against’, but rather as ‘arguments for’ – and I wonder what might have happened had this option been on ‘offer’ in Ofqual’s April consultation…

The Select Committee meeting of 2 September

I watched the meeting of 2 September on television, and good viewing it was too. Robert Halfon was a lively, decisive and probing Chair, the members of the Committee asked relevant and often perceptive questions, and the Ofqual responses were appropriately brief and for the most part informative.

So, for example, Questions 946 to 955 unravelled the events of Saturday 15 August, when Ofqual posted an announcement on its website about the use of mock exam results in appeals, only to withdraw it a few hours later, so leading directly to the decision to abandon the algorithm and award all grades as determined by teachers’ Centre Assessment Grades.

On a lighter note, Question 990 asked: ‘At what point did the algorithm mutate?’, to which Dr Meadows replied: ‘I don’t believe that the algorithm ever mutated.’

One recurrent theme concerned transparency and the disclosure of all communications and minutes between Ofqual and the Department for Education. This is proving to be somewhat troublesome as to whether it is for Ofqual or the Department to disclose, and a motion to achieve full disclosure, tabled by Kate Green, Labour’s Shadow Secretary of State for Education, on 9 September, was rejected in a division which followed party lines exactly: 325 Conservative MPs and one independent voting ‘no’; 236 MPs, none of whom were Conservative, voting ‘aye’.

But one particular disclosure issue appears still to be at a loose end: in reply to Question 985, Roger Taylor committed to release all submitted school data to a ‘trusted third party … for a deep forensic analysis’.

This is not mentioned in the letter from Dame Glenys Stacey to Robert Halfon dated 8 September, nor was it within the parliamentary motion rejected on 9 September. So, presumably, that still needs to happen – which is a ‘good thing’, for it will answer some important questions: Why were so many Centre Assessment Grades over-ruled by the algorithm? Which schools were ‘over-optimistic’? Which ‘over-bids’ are more likely to be attributable to year-on-year variability or the need to round fractions to whole numbers? How many down-grades were attributable to Ofqual’s ‘adjustments’ for ‘prior attainment’ and ‘national correction’?

Another recurrent theme concerned the accuracy of the algorithm. In response to Question 974, Dr Meadows stated that:

Every year, we publish marking consistency metrics that report the extent to which grades would change if a different senior examiner had looked at the work. In fact, we looked at that work this year and took some comfort from it, in the sense that the levels of accuracy that we were seeing from the standardisation model were very similar to those levels of accuracy that we see each year through the marking process.

That, ‘comfortingly’, pegs the accuracy of the algorithm to the accuracy of marking, and the mention of ‘marking consistency metrics’ is, presumably, a reference to Ofqual’s November 2018 publication Marking Consistency Metrics – An update, Figure 12 of which presents the measures of the reliability of the grades awarded for each of 14 subjects (my preference being for ‘reliability’ rather than ‘accuracy’). To my knowledge, this is the only document that reports, for qualifications as awarded, ‘the extent to which grades would change if a different senior examiner had looked at the work’, so those initial words ‘every year’ are to me something of a puzzle: perhaps Dr Meadows will be kind enough to direct me to the publications I have missed.

The ‘accuracy of marking’ was subsequently quantified by Dr Meadows in her reply to Question 996:

There is a benchmark that is used in assessment evidence that any assessment should be accurate for 90% of students plus or minus one grade. That is a standard benchmark. On average, the subjects were doing much better than that. For A-level we were looking at 98%; for GCSE we were looking at 96%, so we did take some solace from that.

This was confirmed in the answer to the very last question (1058) given by Dame Glenys Stacey, who, referring back to the response from Dr Meadows, acknowledged that grades ‘are reliable to one grade either way’.

That exam grades ‘are reliable to one grade either way’ has two consequences. The first concerns the reliability of the algorithm. As Dr Meadows stated, the ‘accuracy’ of the algorithm was tested against a benchmark of grades that are ‘reliable to one grade either way’ – hardly a criterion that would build trust and confidence in the outcomes. That alone would, to me, have been a very good reason to scrap the algorithm.

Back to ‘exams as usual’

The algorithm, however, is dead and gone, never to return.

But exams are back, and many students unhappy with this year’s non-exam grades will soon sit this autumn’s ‘real’ exams.

I think these exams are a bad idea.

Not because those students don’t deserve redress. They do.

Not because of the inequities in opportunity for teaching and learning over the last several months, which are profound.

But because – as so explicitly stated by Dr Michelle Meadows and confirmed by Dame Glenys Stacey – the resulting grades will be fundamentally unreliable. Not that they use those words. Rather, their words are the much more reassuring: ‘reliable to one grade either way’ or ‘accurate plus or minus one grade’; even more so when associated with ‘take comfort’ and ‘take solace’.

But what does ‘reliable to one grade either way’ mean?

Suppose that a certificate, resulting from a ‘real’ exam, shows grade B. Might that grade, legitimately, be an A?

Likewise the student ‘awarded’ grade 3 for GCSE English or Maths, who has ‘failed’, who is forced to re-sit, on whom many doors have slammed shut, and whose self-confidence has been shattered. Suppose, just suppose, that the grade should be a 4.

Of course, those who enjoy lotteries are welcome to take their chances. But those who believe that exam grades should be reliable and trustworthy – that grade [X] really is grade [X] and is a fair assessment – might wish to bear in mind that for the last several years the average reliability of GCSE, AS and A-level exam grades has been about 75%.

That means that only 3 grades in every 4, as awarded, truly reflect the candidate’s ability. And 1 in every 4 doesn’t. So of the approximately 6 million grades awarded following each year’s exams, some 4.5 million have been reliable; about 1.5 million, not. But no one knows which, specific, 1.5 million grades, or which, specific, candidates.

Furthermore, the average reliability, 75%, masks a wide variability by subject, and by mark within subject – for example, a script marked at, or very close to, any grade boundary in any subject has a probability of being awarded the right grade of about 50%. The exam board might as well toss a coin.

So I find it deeply disturbing that Ofqual’s Executive Director for Strategy, Risk and Research, Dr Michelle Meadows, ‘takes solace’ that 98% of A-level grades, and 96% of GCSE grades, are ‘accurate plus or minus one grade’.

What confidence can there be in a regulator who ‘takes solace’ in presiding over a process which, year on year on year, has delivered, and will continue to deliver, such unreliable and untrustworthy outcomes?

This unreliability has remained hidden because it is only if an awarded grade is challenged – which is both cumbersome and costly – that there might be a chance of a ‘second opinion’. And in 2016, Ofqual changed the rules to make it harder to discover grading errors, so relatively few challenges are made, and few grading errors discovered.

What happened this year was that Ofqual – presumably inadvertently – created an entire population of ‘second opinions’, the teachers’ Centre Assessment Grades, which, for A-level, could be compared to the results of the algorithm. And if there was a discrepancy, people could ask ‘why?’ Hence this year’s explosion.

The explosion has caused much damage.

Yet there has also been one good consequence – to blow the cover off grades being ‘reliable to one grade either way’.

But exam grades should not be ‘reliable to one grade either way’.

They should be ‘reliable’. Full stop.

STOP!!!

For until they are, there is, to me, no point in doing exams at all. Not in a few weeks’ time. Not next summer. Not until exams are truly fit-for-purpose. Not until assessments are fully reliable and trustworthy.

Unfortunately, Ofqual don’t seem to think that delivering reliable grades is important. The need to do this does not feature in Ofqual’s Corporate Plan 2020-2021, nor is it mentioned in Ofqual’s follow-up letter to the meeting of 2 September.

Perhaps, however, the Select Committee might take a different view – especially bearing in mind Ofqual’s statutory obligation under Section 22 of the Education Act 2011 to:

secure that regulated qualifications give a reliable indication of knowledge, skills and understanding.

17 comments

Jeremy says:
15th September 2020 at 08:37
Cambridge Assessment’s presenting themselves as a noble whistleblower is rather astonishing. For one thing, they themselves had a very significant role in the development of Ofqual’s algorithm. For another, they had their own very similar algorithm, which was significantly worse; for example, it had no lower cut-off for “standardisation”, so that even tiny cohorts with only two students had their grades adjusted.
One more thing about the Cambridge Assessment document is inadvertently revealing: the school they use as an example, Hills Road Sixth Form College, lies a few dozen yards from the Cambridge Assessment offices, and it is very likely that many of the children of Cambridge Assessment staff and their friends attend it. It certainly appears that what finally spurred them to action was discovering that they might be personally affected by the algorithm they had developed and advocated.
On a later point, as I think I’ve said before on here, I think the whole idea of the “accuracy” of the algorithm isn’t meaningful. The algorithm was intended to replicate the grades that students would have achieved in exams, and there’s no way to measure that. That’s quite different from the question of whether exams can be marked consistently, which can be easily measured (and has been).
Regarding the autumn exams: what you’re overlooking is that many thousands of students didn’t receive grades at all this year.
Reply
Syla says:
15th September 2020 at 08:54
I agree with Jeremy, there is 17000 students who were left in the dark. ALevel grades has been refused to resit and external students! Government/OFQUAL doesn’t care about them. Those candidates lost their Uni Offers after taking a gap year to improve their grades. Why nobody listen what they are saying! They are still waiting for an action. They deserve to be treated equally and UCAS predicted grades should be approved for them!
Reply
Huy Duong says:
15th September 2020 at 09:35
Hi Dennis,
Thank you for this analysis of the various testimonies. So, with two students of near identical ability in maths, both at, say, A, and Ofqual gives one of them A* and the other B, Dr Michelle Meadows et al would take solace and comfort in that. That is complete disregard for justice for individuals. With that sort of unreliability in awarding grades, there was not much point in keeping grade inflation down to 2% or lower. With that sort of unreliability, Ofqual’s claim that it was maintaining grade consistency and currency was just a false one.
It’s interesting that Ofqual didn’t mention that when the simulated predicting the 2019 grades and compared with the actual results, the algorithm performed abysmally for STEM subjects (https://www.hepi.ac.uk/2020/08/18/how-bad-were-ofquals-grades-by-huy-duong/). Instead, it uses the less precise metric of “one grade either way” to spin the narrative to suggest that their calculated grades were just as good, if not better than, exams.
Hi Jeremy,
I think the concept of “accuracy/inaccuracy” is still valid for the calculated grades. Eg, if during testing Ofqual’s calculated grades had managed to reproduced the 2019 grades, say, 95% of the time then it would have been fair to say that the 2020 calculated grades were 95% accurate even if we cannot know what the true 2020 grades are. Unfortunately, that testing suggested that the algorithm could only reproduce the 2019 grades approximately 50% to 75% of the time, so it’s fair to say that the 2020 calculated grades would have been inaccurate.
Reply
Dennis Sherwood says:
15th September 2020 at 09:54
Hi Jeremy
Thank you – that’s all very “interesting”…
And let me clarify my views about exams, if I may.
I am not arguing against exams.
But I am arguing against ‘exams as usual’, with unreliable outcomes.
So the ideal for this autumn is to run the exams, but using a different policy for determining assessments, so that they are reliable and trustworthy.
I believe this to be possible – number 22 on https://www.hepi.ac.uk/2019/07/16/students-will-be-given-more-than-1-5-million-wrong-gcse-as-and-a-level-grades-this-summer-here-are-some-potential-solutions-which-do-you-prefer/, for example, is I think quite easy to do, especially since Ofqual already have measures of reliability for 14 subjects, from which estimates of each subject’s ‘fuzziness’ can readily be determined.
Indeed this autumn could be a very good trial for the cohort size is small.
That then shows that reliable exam assessments can be delivered, and sets the scene for more fundamental thinking about the whole assessment process in general. That will – necessarily and understandably – take time. But while this is taking place, no longer will so many of our young people be ‘awarded’ unreliable grades.
Reply
Huy Duong says:
15th September 2020 at 10:13
According to this, https://www.aqa.org.uk/exams-administration/entries/entry-fees, AQA kept 74% of the 2020 exam fees, amounting to £162 million. That seems a lot of money given that there were no exams and not really appeals.
Reply
Jeremy says:
15th September 2020 at 10:13
Dear Denis,
I agree that the autumn exams need careful thought. I’m not so convinced that it’s a good time to introduce anything that would make the awards less comparable to those achieved in regular series, since the purpose of the autumn series is to replace what was lost, not to experiment, even with an approach that’s arguably better.
There are probably enough challenges with the autumn exams already, actually. Because the cohort is small and unrepresentative, Ofqual’s usual approach of choosing grade boundaries to ensure that the same proportion of each grade is awarded from year to year simply won’t work. It’s not clear yet what they’re going to do (but there’ll be a consultation soon).
Reply
Jeremy says:
15th September 2020 at 10:18
> Dear Denis,
My apologies for the misspelling, Dennis!
Reply
Dennis Sherwood says:
15th September 2020 at 10:36
Hi Huy
Yes… thank you for providing such powerful evidence of those two ‘arguments against’ – which certainly are ‘arguments against’ when you are a beneficiary of all that income.
That Ofqual ‘options’ paper of 16 March is a disgrace.
The purpose of an ‘options’ paper is to evaluate a broad range of ideas fairly and wisely. And that paper fails dramatically.
Firstly, the range of options is too limited – others, such as to me the very obvious ones discussed in https://www.hepi.ac.uk/2020/07/23/hindsight-is-a-wonderful-thing/, aren’t there.
Secondly, and much, much more damning, is the framework they used for evaluating the options they did identify.
Those columns headed ‘arguments for’ (or ‘pros’)and ‘arguments against’ (or ‘cons’) suggest that these ‘arguments’ are somehow absolute. Which they are surely not. An ‘argument for’ is not an ‘argument for’; rather it is an ‘argument used by SOMEONE who is for’. So the whole framework is totally biased and unfair. It is indeed most distressing to witness an organisation of such importance using such a flawed way to evaluate options.
A much better way is to identify “benefits” resulting from the successful implementation of the idea, to whatever parties might benefit; and also all “issues to manage to make the idea a success” and “feelings”, which collectively identify all the associated problems.
Reply
Dennis Sherwood says:
15th September 2020 at 10:40
Hi Jeremy – thanks for your courtesy there!
And your points are most valid.
I wonder, though, what the most appropriate way is to explore all this thoroughly and wisely, to identify the best possible approach for both the autumn exams, and all exams thereafter.
Unfortunately, I do not trust Ofqual to do this…
Reply
Huy Duong says:
15th September 2020 at 10:57
Dr Michelle Meadows told the Committee, “In fact, we looked at that work this year and took some comfort from it, in the sense that the levels of accuracy that we were seeing from the standardisation model were very similar to those levels of accuracy that we see each year through the marking process.”
If that is the sort of thing that Ofqual told the DfE about the accuracy of the calculated grades then Ofqual might be guilty of misleading the DfE into making the wrong decision.
What Dr Meadows said is like saying that life expectancy in the UK and Rwanda are both between 60 and 90 so the two countries have similar life expectancy. It is a case of “lies damned lies and statistics”.
What she should have told the Committee was something like, “We knew that our model would get 25% to 50% of the grades wrong. By contrast, exams would get only 5% to 20% of the grades wrong for STEM subjects. So our model is not suitable for awarding STEM grades.” That would have been more honest than the half truth above. It would be equivalent to saying that the UK’s life expectancy is 82 years, and Rwanda’s is 70, and the two are not similar.
So now we have the situation in which Ofqual claims that to told the DfE about the “grading accuracy” and the DfE claims that it did not know before the school. The question is what exactly did Ofqual tell the DfE about “grading accuracy”? Was it the half truth above, or was it a fuller version of the truth?
It’s also strange that Ofqual now refuses to disclose a paper that it sent to the DfE (after saying to the Committee that it would), and its excuse is the the DfE had written that paper.
Reply
Huy Duong says:
15th September 2020 at 11:22
Hi Dennis,
You wrote,
“Those columns headed ‘arguments for’ (or ‘pros’)and ‘arguments against’ (or ‘cons’) suggest that these ‘arguments’ are somehow absolute. Which they are surely not. An ‘argument for’ is not an ‘argument for’; rather it is an ‘argument used by SOMEONE who is for’. So the whole framework is totally biased and unfair. It is indeed most distressing to witness an organisation of such importance using such a flawed way to evaluate options.”
Yes, it does seems that the elements of “by the people, for the people” are missing from Ofqual’s conducts. Sadly, it doesn’t look like Ofqual wants to change from that. It behaves like a dictator who thinks if there are differences in opinions, eg, between CAGs and calculated grades, then it must be the one that’s right. Then it spins and tells half truths, refusing to be transparent and to listen. I don’t know how we can have that kind of conducts in a democratic society.
Reply
simon kaufman says:
18th September 2020 at 17:09
Dear Dennis
Another fascinating article, Dennis – thanks also for linking directly to so many helpful additional documents.
Very interesting to see how Hills Road was adversely affected – the large sixth form college effect – given that it was only a couple of years ago that Hills Road was identified as one of the top eight suppliers of successful applicants to Oxbridge ( all the others as I recall were leading independent schools) which I would have thought says something about its catchment and makes it somewhat surprising that it would be so impacted by standardisation around prior attainment ( which given its very strong outlier attainment in terms of levels of entry to oxbridge and other RG destinations must differ substantially from sixth form centre/CFE norms).
I agree absolutely with your critique of the Ofqual options paper – it should be no business of any examination regulator to highlight as essentially negative concerns such matters as who would fund 18 year olds in the interim if the start of the HE year was put back to January (was this ever seriously considered in Whitehall or only by commentators on HEPI & WonkHE blogs?) or whether focussing on a reduced diet of GCSEs would raise fundamental questions about their purpose – it could, of course, have done so had the opportunity in policy terms arising from school closure/cancelation of exams been taken to open up a serious discussion space about the option of abolishing GCSE as a high stakes 16+ examination (with the leaving age from education and training effectively raised to 18 GCSE can no longer be construed in any sense as a terminal examination).
Turning to the issue you so starkly present of the OFQUAL position around reliability being a measure which operates within one grade band of accuracy this in effect seeks to treat qualification outcomes as ends in themselves rather than what they now in effect are – whether at 16 or 18+ – as progression passports informing/determining progression the next stage of further or higher education.
At 16+ one grade band difference at GCSE grade 3/4 in English and Maths consigns a significant proportion of the age cohort to continuing study of English/Maths through their FE years until such time as each individual achieves a grade 4 outcome yet we are now told with great confidence by OFQUAL that this result is only accurate within one grade band in circumstances where this grade band fundamentally impacts on the nature of the FE experience each individual student receives thereafter.
At A Level the position is perhaps even worse where one grade point difference in outcome as between A* and A in a number of subject areas determines Oxbridge/Medical & Vet school progression and the difference between A*/A, A, B and ABB will nowadays in many instances determine which Russel Group institution a student may attend where as a consequence of this evidence we now are told these distinctions in assessment performance have no or minimal reliability; the educational aspirations and future life chances of current generations of eighteen year olds should not be determined on this wholly fragile and now insupportable basis.
There are also important questions arising from the above as to how student selection in our most selective HEIs is managed given that the evidential base used to ration places at Confirmation of Results has now been shown in OFQUAL’s own words to be so weak.
Importantly as both you and Huy have been saying on these blogs for the last month or more these outcomes do not constitute justice for the individuals concerned (alongside which there are also the issues of broader social equity which arise from the manner in which small cohorts overwhelmingly in the independent sector are favourably treated).
Unless and until these issues are seriously addressed by Ofqual, DfE and other key stakeholders there should be no rush to revert to a now discredited’ examinations as normal’ model.
Finally this whole mess this year could have been averted if only someone in government had taken seriously the option of issuing a Leaving Certificate badged as being prepared specifically for the purposes of facilitating education progression – after all as I have argued here this is now the fundamental usage rightly or wrongly made of our assessment system.
Reply
Jeremy says:
18th September 2020 at 17:22
Simon:
> At 16+ one grade band difference at GCSE grade 3/4 in English and Maths consigns a significant proportion of the age cohort to continuing study of English/Maths through their FE years until such time as each individual achieves a grade 4 outcome yet we are now told with great confidence by OFQUAL that this result is only accurate within one grade band
Did Ofqual really say that maths marking specifically is so inaccurate? That’s rather surprising: I don’t recall hearing it, and it doesn’t match what’s been said before:
https://www.hepi.ac.uk/2019/01/15/1-school-exam-grade-in-4-is-wrong-does-this-matter/
> we now are told these distinctions in assessment performance have no or minimal reliability
Again, this simply isn’t true: marking in mathematical and scientific subjects is very reliable, and these are the subjects that determine progression to medical school and to many Oxbridge courses. In fact, it is because exams are such a reliable method of assessment that many Oxbridge courses have additional entrance exams to help distinguish between the many applicants whose A level predictions exceed the admissions criteria.
> a now discredited’ examinations as normal’ model.
This seems to me exactly the opposite of the situation: it is not exams that are discredited, but the attempt to replace exams with teacher assessments and statistical adjustments.
Reply
Dennis Sherwood says:
18th September 2020 at 19:25
Hi Simon and Jeremy – thank you both for enlivening the debate!!!
And in this context, Simon, you might find this blog from last year “interesting”, especially since the questions remain unanswered to this day – https://www.hepi.ac.uk/2019/08/15/dear-ofqual-%EF%BB%BF/.
And Jeremy, you are right that Maths grades are in general more reliable than, say, History grades – Maths weighs in at about 96% to History’s 56%.
But those figures are double averages – firstly over GCSE, AS and A-level, the significance being that in general GCSE grades are more unreliable than A-level grades by virtue of the fact that GCSE has a greater number of grade boundaries, and grades of narrower width.
The second average is over all marks from 0% to 100% within the subject. And therein lies a far more complex story, as discussed in the follow-up blog to “Does this matter?” – the blog subtitled “That’s the good news”, https://www.hepi.ac.uk/2019/02/25/1-school-exam-grade-in-4-is-wrong-thats-the-good-news/.
As you’ll see, the variability by mark within any subject is highly volatile, especially at or close to any grade boundary: in fact, the grades assigned to all scripts, in all subjects, marked at or close to any grade boundary, have a reliability of 50% at best – so it might be more fair to toss a coin. And that applies to Maths too – as shown in that blog by the reproduction of an extract from Figure 9 from Ofqual’s November 2018 report, Marking Consistency Metrics – An update (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/759207/Marking_consistency_metrics_-_an_update_-_FINAL64492.pdf). The full figure is shown on page 12, and is well worth a look. So Simon’s point about that crucial 4/3 boundary for GCSE Maths is right.
My own opinion is not that “exams” are discredited. Exams have a very valuable purpose, when used (my word) sensibly.
What I think is totally discredited is Ofqual’s policy for mapping marks onto grades – a policy that ignores the reality that marks are ‘fuzzy’. And with so many narrow grades, those ‘fuzzy’ marks can often straddle more than one grade boundary, so the reliability of the resulting grade is dreadful, and is in essence no more than a lottery of which examiner happened to mark the script.
Grades were invented to resolve the problem of fuzzy marking. But when that happened, the inventors also realised they had to pay especial attention to scripts marked close to grade boundaries. So they had relatively few, wide grades, and so few grade boundaries; furthermore, examiners would study scripts marked close to grade boundaries very carefully, re-marking and discussing them to be as fair possible – perhaps even calling the candidate back for, say, a viva.
But for school exams, that care close to grade boundaries disappeared umpteen years ago, and Ofqual have become increasingly careless. Especially when they changed the GCSE grading system from A*, A… to 9, 8… so narrowing the average grade width and increasing the number of grade boundaries, and knowingly making the resulting grades even more unreliable.
And I say ‘knowingly’, for on page 21 of Ofqual’s November 2016 report “Marking Consistency Metrics” (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/681625/Marking_consistency_metrics_-_November_2016.pdf) you will find these words:
“Thus, the wider the grade boundary locations, the greater the probability of candidates receiving the definitive grade. This is a very important point: the design of an assessment might be as important as marking consistency in securing the ‘true’ grade for candidates.”
That report was published BEFORE the change in the GCSE grade structure.
To me it’s not so much that exams are discredited – I think the whole process of assessment needs to be reformed, from the curriculum to the balance between different methods of judgement, such as teacher assessment and exam results. And wherever there is an exam, the outcome must be reliable and trusthworthy.
Does any of that make sense?
Reply
Huy Duong says:
21st September 2020 at 09:49
Hi Jeremy,
“This seems to me exactly the opposite of the situation: it is not exams that are discredited, but the attempt to replace exams with teacher assessments and statistical adjustments.”
To be precise, I would say that teacher assessments and statistical adjustments in the way they were done this year are discredited.
What I mean is traditional teacher assessments and adjustment in the form of non exam assessment/continuous assessment/foreign language speaking tests and moderation are not discredited.
(One of the sophistries used to support this year’s arrangements was to argue that the first was nothing news because it was the same as the second.)
Furthermore, while it’s clear that the statistical adjustment has been discredited, it might not be clear that teachers’ assessments have been discredited. However, the way some centres assigned CAGs, eg, by using the cohort’s prior attainment data, the national transition matrix and other data to come up with the grade distribution a priori, means that those centres themselves didn’t trust the teachers to come up with the grades for their students, which fundamentally undermined and discredited the notion that teachers know their students best. So I agree with you that both teacher assessments and statistical adjustments this year are discredited.
Reply
Dennis Sherwood says:
15th October 2020 at 12:00
Ooops… I’ve just noticed a mistake, apologies.
Where the text says
“But one particular disclosure issue appears still to be at a loose end: in reply to Question 985, Roger Taylor committed to release all submitted school data to a ‘trusted third party … for a deep forensic analysis’.”,
the reference should be to Questions 986 and 997, not Question 985.
Sorry about that…
Reply
Dennis Sherwood says:
15th October 2020 at 17:12
Good grief! That’s 986 and 987!!!
Reply

17 comments

Leave a Reply Cancel reply