This blog by Dennis Sherwood considers the latest twists in school and college pupils’ assessments for 2021.
Ofqual’s newly-appointed Chief Regulator, Simon Lebus, recently stated his commitment to ‘supporting “innovation” in assessment’. Now that that school exams have been cancelled in England, there is an opportunity for him to do just that.
The details for this year’s process remain to be confirmed, but as signalled in the exchange of letters between Gavin Williamson and Mr Lebus on 13 January 2021, and as confirmed in Ofqual’s consultation document of 15 January, a central proposal is that ‘a student’s grade in a subject will be based on their teacher’s assessment of the standard at which they are performing’. In which case, an interesting innovation would be for these teacher assessments to be structured not as 7 A level grades, 6 AS grades and 10 GCSE grades, but as four bands. In higher education, of course, this has been the case for decades. For schools, though, this idea could well be dismissed as stupid, impractical, revolutionary. But is it sensible?
The key question is this: can a teacher of, say, A level Geography distinguish between students reliably enough to assess one candidate as grade C, and another as grade B? Or for GCSE English, grade 6 and grade 5? Yes, most teachers can probably identify the true high-flyers who merit a high A* or 9, and those who unfortunately are at the opposite end of the ability spectrum. But in the middle, it is far, far harder to make these distinctions.
Perhaps the exam system can do better, for, given the much-repeated statement that ‘exams are the fairest way of assessing what a student knows, understands and can do’, then surely exams must be the gold standard, delivering finely-divided grades that are fully reliable and trustworthy.
But are they?
No, they are not. At the Select Committee hearing of 2 September 2020, Ofqual’s then Chief Regulator stated that exam grades are ‘reliable to one grade either way’. That’s rather vague, so to make those words real, this chart shows the results, based on Ofqual’s own research, of the author’s simulation (details of which are available on request) of the grades awarded to the 31,768 candidates who sat A level Geography in England in 2019.
The horizontal axis shows the mark given to any script, on a standardised scale from 0 to 100, and for each mark, the blue line answers the question ‘what is the probability that a script given that mark would receive the same grade, had that script been marked by another examiner?’
This question recognises that the examiner who marks any script (or the composition of the team who collectively mark a single script) is in essence determined by a lottery, and that any script might be marked by another examiner (or team). And since ‘it is possible for two examiners to give different but appropriate marks to the same answer’, it is possible that the resulting grade will be different too. The blue line is therefore a good measure of the reliability of the awarded grade.
The results might be surprising. As can be seen, very high A*s and poor Us are 100% reliable, implying that those grades would be awarded no matter who marked the script. But look at grades A, B, C, D and E, for which the grade reliability bounces around between about 60% and 40%. That means that, in 2019, the nearly 30,000 students ‘awarded’ grades A to E in A level Geography (that’s more than 90% of the total cohort) each had about a 50% chance of being awarded a different grade had their scripts been marked by another examiner. What does that say about the reliability and trustworthiness of the grade on the certificate?
For GCSEs, because there are a greater number of necessarily narrower grades, matters are even worse, as exemplified by this chart showing the results of a simulation of the grades ‘awarded’ to the 707,059 students in England who sat 2019 GCSE English.
As can be seen, grades 4, 5, 6, 7 and 8 are less than 50% reliable, with grades 1, 2 and 3 being somewhat better – but even the 70% reliability reached by scripts marked at the middle of grade 3 is hardly an accolade.
Yes, grades for Maths and Physics are in general more reliable than those for Geography and English, but even these dip towards 50% at the grade boundaries. The overall message, however, is clear: the ‘gold standard’ of reliability isn’t quite as golden as one might have thought, let alone wished. And if the exam system can’t reliably distinguish between a grade C and a grade B, or a grade 5 and a grade 6, why should teachers be asked to make such spurious – and correspondingly unjustifiable – distinctions?
I don’t think they should. To do so implies a wisdom that I don’t think any human being can demonstrate. But the pressures to do so are strong – not just the comforting allure of the familiar but the pragmatic fact that UCAS ‘predictions’, and the corresponding admissions offers for the next academic year, are expressed in terms of conventional A level grades. Those grades, though, are only ‘reliable to one grade either way’, implying that a certificate showing AAB really means ‘any grades from A*A*A to BBC, but no one knows which’. What does that say about an offer of AAB, and the fairness of adhering to it?
For many years, we have all believed that there is a real difference between a grade B and a grade C, between a grade 5 and a grade 6. We now know this to be a myth. So despite the fact that this idea does not appear in Ofqual’s consultation, surely now is an ideal opportunity to bust the myth and to ‘support “innovation” in assessment’. By asking teachers to submit grades in, say, four bands – being especially careful about candidates close to the grade boundaries– and by inviting universities to solve the problem of how to reconcile their offers to this more realistic, and honest, assessment structure.