I write this having just read an article in The Guardian about the condemnation by doctors for the ‘shroud of secrecy’ thrown by ministers and Public Health England over the number of coronavirus tests giving wrong results. Do you remember Matt Hancock’s declaration that ‘no test is better than a bad test’? The article suggests that some 25 per cent to 29 per cent of Covid-19 tests are wrongly declaring people to be virus-free, with the result that they are at liberty, inadvertently, to infect others. Which is bad news indeed.
To determine the reliability of a diagnostic test, medics measure the number of ‘false negatives’ (tests that indicate that the condition is absent when it is in fact present) and ‘false positives’ (the test says ‘yes’ when the condition is not there), often representing the overall picture as a ‘contingency chart’, such as that shown here:
The area of each ‘bubble’ is proportional to the likelihood of each of the four possible test outcomes. A perfect test would be associated with a chart which does not show any blue or yellow bubbles at all, the only feature being two large green bubbles. An unreliable test would be associated with a chart showing large blue and yellow bubbles and small green ones. The test shown here, with small blue and yellow bubbles, indicates that 10 tests in every 1,000 give a false result. This test is good, for it has a high likelihood (99 per cent) of determining the truth. The test results are therefore trustworthy.
A school exam is a test, and the outcomes, the grades. But is an exam a ‘good’ or a ‘bad’ test? It all depends on the numbers of ‘false negatives’ and ‘false positives’. But what do these mean in the context of exams?
To answer that question, consider the Summer 2019 GCSE English exam, taken by 707,059 candidates in England, of which about 123,000 were awarded grade 4 and about 174,000, grade 3. All grade boundaries are important, but this one is particularly so, for it is the implied pass/fail: those awarded grade 3 are denied many opportunities, and have to re-sit. We all tend to assume that all those awarded grade 3 merit grade 3, and likewise grade 4. But is this assumption necessarily true? Might it be possible that some of the 174,000 candidates awarded grade 3 should have been awarded grade 4, and so were actually awarded a grade lower than they merited? And if this might be possible, how many candidates might have been ‘disadvantaged’ accordingly?
These questions can be answered, for in November 2018 Ofqual published a landmark document quantifying the reliability of school exam grades. The research was of the highest quality and – very briefly – studied entire cohorts for 14 subjects, with all scripts marked twice: one mark being given by an ‘ordinary’ examiner (as normally happens for all exams), and the other by a senior examiner (as might happen, for example, on appeal). The two marks were then mapped onto the same grade scale, resulting in the ‘ordinary’ examiner’s grade, and the senior examiner’s grade, which was designated ‘definitive’. These two grades were then compared. You might have expected the comparison to be a 100 per cent match. Not so. For English Language, for example, only 61 per cent of the ‘ordinary’ examiner’s grades were the same as the senior examiner’s ‘definitive’ grade, and the remaining 39 per cent were different – about one half of these being higher grades, and the other half, lower.
To explore this matter further, I have developed some computer simulations that replicate actual grade distributions as well as Ofqual’s findings concerning grade reliabilities. One of the outputs is a diagram such as this representation of the 4/3 grade boundary for 2019 GCSE English candidates in England:
According to my simulation (details of which are available on request), of the approximately 174,000 candidates awarded grade 3 in the exam, some 126,000 (rather more than 70 per cent) would be awarded the same grade by both an ‘ordinary’ and a senior examiner. But about 26,000 candidates (nearly 15 per cent) awarded grade 3 as the result of an ‘ordinary’ examiner’s mark would have received grade 4 had their scripts been marked by a senior examiner. These candidates were therefore awarded a lower grade than the senior examiner’s ‘definitive’ grade, and so might be regarded as ‘disadvantaged’. That leaves about another 22,000 candidates originally awarded grade 3: these candidates would have been awarded grade 2 (about 20,000) or grade 5 (about 2,000) by a senior examiner, and are not shown on the chart (but can be shown on a larger chart with more grades).
Similarly, of the approximately 123,000 candidates in England awarded grade 4, my simulation suggests that about 65,000 (rather more than 50 per cent) would have that grade confirmed by a senior examiner. But about 29,000 (nearly 25 per cent) would be down-graded to 3. These candidates were actually awarded a grade higher than the senior examiner’s definitive grade. They were ‘lucky’. And, as before, there are some other categories not shown on this chart – approximately 27,000 candidates would be awarded grade 5 by a senior examiner, and 2,000 grade 6, both these groups being ‘disadvantaged’, the latter students doubly so.
If the examination system were to award totally reliable grades, all the grades as actually awarded would be confirmed by a senior examiner. The chart would therefore show only green bubbles, and there would be no ‘disadvantaged’ candidates as shown by the blue bubble, and no ‘lucky’ candidates as shown by the yellow bubble. But for 2019 GCSE English Language, the blue and yellow bubbles are there to be seen. And of the approximately 297,000 candidates actually awarded either grade 3 or grade 4, only about 191,000 were awarded the grade they truly merited. So for this test, the overall outcome is about 64 per cent right, 36 per cent wrong.
The analogy between the medical contingency table and the diagrams from my simulation is striking. In the medical chart, the ‘test’ result is either positive or negative, and this is used as an indication of the unknown ‘truth’ as to whether a particular medical condition is present or absent. For exams, the unknown ‘truth’ is the ‘definitive’ grade as would be awarded by a senior examiner, and the ‘test’ result is the actually awarded exam grade as determined by the mark given by whichever examiner happens to mark the script.
As can be seen from my chart, for 2019 GCSE English, there are many false negatives (‘disadvantaged’ candidates, awarded a grade lower than they merit) and false positives (‘lucky’ candidates, who are awarded a grade higher than they merit). Similar charts can be drawn for all the other grade boundaries, and for all the other subjects: in general, the less ‘fuzzy’ the subject (Maths, Physics…), the lower the incidence of false negatives and positives; the more fuzzy the subject (History, English…), the greater the incidence.
In medicine, false negatives correspond to people who test negative but actually have a particular condition, and so may inadvertently infect others or be missing out on treatment, whilst false positives may result in unnecessary interventions. For the exam system, false negatives (‘disadvantaged’ candidates) are denied opportunities and held back; false positives (‘lucky’ candidates) might result in individuals who find it difficult to cope at the next stage, or who make mistakes (I do hope my doctor wasn’t ‘lucky’ in surgery finals! – and the plumber too…).
Which takes me back to the slogan ‘no test is better than a bad test’. You may judge for yourself whether or not my chart is evidence of the goodness or badness of school exams. But this year, the ‘no test’ option is the reality. I have argued elsewhere that this year’s ‘no test’ can potentially deliver fairer results as compared to GCSE and A level, but if this is to happen, then teachers must act with integrity and not attempt to ‘game’ the system, and parents must not pressure teachers.
Ofqual’s recent confirmation that the ‘statistical model’ to be used by the boards will be based on actual results from the last three years is very helpful. Once Ofqual confirm the details, this process can easily be replicated at each school, so enabling schools to ensure that the centre assessed grades they submit already comply with the rules, and so will have a high likelihood of being confirmed, rather than being over-ruled. A key indicator is therefore the ratio of confirmed centre assessed grades to the total number submitted: the closer this is to unity, the more successful the process. This ratio should be measured at every school, for each subject, for each board, and for the process overall.
Unfortunately, the situation in Scotland is very different, for, in a recent letter from the SQA’s Chief Executive and Chief Examiner to the Scottish Parliament Education and Skills Committee, the answer given to the question ‘Will the methodology for moderation be published in advance of teachers submitting estimates to the SQA?’ was ‘no’.
Scottish teachers are therefore throwing darts at a moving dartboard. In the dark. And a dartboard with 13 grades – which, in my opinion, is far too many. I fear that teachers are being set up to fail, which is in no one’s interests. If it is the ‘statistical moderation’ that will ultimately determine the grades, why ask schools to submit grades at all? All they need submit is the rank order – which is what determines fairness – and then the ‘statistical moderation’ process can determine whatever grades it likes.
Overall, though, I still believe that for school grades, it is possible that no test could be better than a bad test. But it is not inevitable. So let’s do everything we can to ensure that this summer, ‘no test’ gives more reliable, and fairer, grades than exams.