No test is better than a bad test

1 June 2020
By Dennis Sherwood

This blog was kindly contributed by Dennis Sherwood. Dennis has previously written on this years’ school exams and Ofqual, among other topics.

I write this having just read an article in The Guardian about the condemnation by doctors for the ‘shroud of secrecy’ thrown by ministers and Public Health England over the number of coronavirus tests giving wrong results. Do you remember Matt Hancock’s declaration that ‘no test is better than a bad test’? The article suggests that some 25 per cent to 29 per cent of Covid-19 tests are wrongly declaring people to be virus-free, with the result that they are at liberty, inadvertently, to infect others. Which is bad news indeed.

To determine the reliability of a diagnostic test, medics measure the number of ‘false negatives’ (tests that indicate that the condition is absent when it is in fact present) and ‘false positives’ (the test says ‘yes’ when the condition is not there), often representing the overall picture as a ‘contingency chart’, such as that shown here:

The area of each ‘bubble’ is proportional to the likelihood of each of the four possible test outcomes. A perfect test would be associated with a chart which does not show any blue or yellow bubbles at all, the only feature being two large green bubbles. An unreliable test would be associated with a chart showing large blue and yellow bubbles and small green ones. The test shown here, with small blue and yellow bubbles, indicates that 10 tests in every 1,000 give a false result. This test is good, for it has a high likelihood (99 per cent) of determining the truth. The test results are therefore trustworthy.

A school exam is a test, and the outcomes, the grades. But is an exam a ‘good’ or a ‘bad’ test? It all depends on the numbers of ‘false negatives’ and ‘false positives’. But what do these mean in the context of exams?

To answer that question, consider the Summer 2019 GCSE English exam, taken by 707,059 candidates in England, of which about 123,000 were awarded grade 4 and about 174,000, grade 3. All grade boundaries are important, but this one is particularly so, for it is the implied pass/fail: those awarded grade 3 are denied many opportunities, and have to re-sit. We all tend to assume that all those awarded grade 3 merit grade 3, and likewise grade 4. But is this assumption necessarily true? Might it be possible that some of the 174,000 candidates awarded grade 3 should have been awarded grade 4, and so were actually awarded a grade lower than they merited? And if this might be possible, how many candidates might have been ‘disadvantaged’ accordingly?

These questions can be answered, for in November 2018 Ofqual published a landmark document quantifying the reliability of school exam grades. The research was of the highest quality and – very briefly – studied entire cohorts for 14 subjects, with all scripts marked twice: one mark being given by an ‘ordinary’ examiner (as normally happens for all exams), and the other by a senior examiner (as might happen, for example, on appeal). The two marks were then mapped onto the same grade scale, resulting in the ‘ordinary’ examiner’s grade, and the senior examiner’s grade, which was designated ‘definitive’. These two grades were then compared. You might have expected the comparison to be a 100 per cent match. Not so. For English Language, for example, only 61 per cent of the ‘ordinary’ examiner’s grades were the same as the senior examiner’s ‘definitive’ grade, and the remaining 39 per cent were different – about one half of these being higher grades, and the other half, lower.

To explore this matter further, I have developed some computer simulations that replicate actual grade distributions as well as Ofqual’s findings concerning grade reliabilities. One of the outputs is a diagram such as this representation of the 4/3 grade boundary for 2019 GCSE English candidates in England:

According to my simulation (details of which are available on request), of the approximately 174,000 candidates awarded grade 3 in the exam, some 126,000 (rather more than 70 per cent) would be awarded the same grade by both an ‘ordinary’ and a senior examiner. But about 26,000 candidates (nearly 15 per cent) awarded grade 3 as the result of an ‘ordinary’ examiner’s mark would have received grade 4 had their scripts been marked by a senior examiner. These candidates were therefore awarded a lower grade than the senior examiner’s ‘definitive’ grade, and so might be regarded as ‘disadvantaged’. That leaves about another 22,000 candidates originally awarded grade 3: these candidates would have been awarded grade 2 (about 20,000) or grade 5 (about 2,000) by a senior examiner, and are not shown on the chart (but can be shown on a larger chart with more grades).

Similarly, of the approximately 123,000 candidates in England awarded grade 4, my simulation suggests that about 65,000 (rather more than 50 per cent) would have that grade confirmed by a senior examiner. But about 29,000 (nearly 25 per cent) would be down-graded to 3. These candidates were actually awarded a grade higher than the senior examiner’s definitive grade. They were ‘lucky’. And, as before, there are some other categories not shown on this chart – approximately 27,000 candidates would be awarded grade 5 by a senior examiner, and 2,000 grade 6, both these groups being ‘disadvantaged’, the latter students doubly so.

If the examination system were to award totally reliable grades, all the grades as actually awarded would be confirmed by a senior examiner. The chart would therefore show only green bubbles, and there would be no ‘disadvantaged’ candidates as shown by the blue bubble, and no ‘lucky’ candidates as shown by the yellow bubble. But for 2019 GCSE English Language, the blue and yellow bubbles are there to be seen. And of the approximately 297,000 candidates actually awarded either grade 3 or grade 4, only about 191,000 were awarded the grade they truly merited. So for this test, the overall outcome is about 64 per cent right, 36 per cent wrong.

The analogy between the medical contingency table and the diagrams from my simulation is striking. In the medical chart, the ‘test’ result is either positive or negative, and this is used as an indication of the unknown ‘truth’ as to whether a particular medical condition is present or absent. For exams, the unknown ‘truth’ is the ‘definitive’ grade as would be awarded by a senior examiner, and the ‘test’ result is the actually awarded exam grade as determined by the mark given by whichever examiner happens to mark the script.

As can be seen from my chart, for 2019 GCSE English, there are many false negatives (‘disadvantaged’ candidates, awarded a grade lower than they merit) and false positives (‘lucky’ candidates, who are awarded a grade higher than they merit). Similar charts can be drawn for all the other grade boundaries, and for all the other subjects: in general, the less ‘fuzzy’ the subject (Maths, Physics…), the lower the incidence of false negatives and positives; the more fuzzy the subject (History, English…), the greater the incidence.

In medicine, false negatives correspond to people who test negative but actually have a particular condition, and so may inadvertently infect others or be missing out on treatment, whilst false positives may result in unnecessary interventions. For the exam system, false negatives (‘disadvantaged’ candidates) are denied opportunities and held back; false positives (‘lucky’ candidates) might result in individuals who find it difficult to cope at the next stage, or who make mistakes (I do hope my doctor wasn’t ‘lucky’ in surgery finals! – and the plumber too…).

Which takes me back to the slogan ‘no test is better than a bad test’. You may judge for yourself whether or not my chart is evidence of the goodness or badness of school exams. But this year, the ‘no test’ option is the reality. I have argued elsewhere that this year’s ‘no test’ can potentially deliver fairer results as compared to GCSE and A level, but if this is to happen, then teachers must act with integrity and not attempt to ‘game’ the system, and parents must not pressure teachers.

Ofqual’s recent confirmation that the ‘statistical model’ to be used by the boards will be based on actual results from the last three years is very helpful. Once Ofqual confirm the details, this process can easily be replicated at each school, so enabling schools to ensure that the centre assessed grades they submit already comply with the rules, and so will have a high likelihood of being confirmed, rather than being over-ruled. A key indicator is therefore the ratio of confirmed centre assessed grades to the total number submitted: the closer this is to unity, the more successful the process. This ratio should be measured at every school, for each subject, for each board, and for the process overall.

Unfortunately, the situation in Scotland is very different, for, in a recent letter from the SQA’s Chief Executive and Chief Examiner to the Scottish Parliament Education and Skills Committee, the answer given to the question ‘Will the methodology for moderation be published in advance of teachers submitting estimates to the SQA?’ was ‘no’.

Scottish teachers are therefore throwing darts at a moving dartboard. In the dark. And a dartboard with 13 grades – which, in my opinion, is far too many. I fear that teachers are being set up to fail, which is in no one’s interests. If it is the ‘statistical moderation’ that will ultimately determine the grades, why ask schools to submit grades at all? All they need submit is the rank order – which is what determines fairness – and then the ‘statistical moderation’ process can determine whatever grades it likes.

Overall, though, I still believe that for school grades, it is possible that no test could be better than a bad test. But it is not inevitable. So let’s do everything we can to ensure that this summer, ‘no test’ gives more reliable, and fairer, grades than exams.

6 comments

Huy Duong says:
1st June 2020 at 18:30
Dear Dennis,
Thank you for your interesting and thought provoking article.
I’m afraid I see this year’s exceptional arrangement as at best an unavoidable bad rather than something good.
You wrote “I have argued elsewhere that this year’s ‘no test’ can potentially deliver fairer results as compared to GCSE and A level, but if this is to happen, then teachers must act with integrity and not attempt to ‘game’ the system, and parents must not pressure teachers.”
Regarding your first condition, a study of 800,000 UCAS applications suggests that teachers’ prediction are accurate only 16 percent of the time https://www.ucl.ac.uk/ioe/research-projects/2020/mar/why-university-applications-should-be-based-actual-achievement. In the majority of cases, teachers over-predict, but they tend to under-predict the grades of high-performing from less affluent backgrounds. The teachers themselves are also faced with a prisoner’s dilemma: the students of those who conscientiously avoid the tendency to over-predict might suffer, while the students of those who don’t avoid this as conscientiously might benefit.
Regarding your second condition, a newspaper article (https://www.theguardian.com/education/2020/apr/19/parents-and-pupils-overwhelm-schools-with-pleas-for-good-grades) reports, “Elsewhere, teachers are receiving a sudden rush of extra work from pupils who fear they have underperformed in class, in the hope of an 11th hour reprieve. One member of staff at a high-performing sixth form college said he had never received so many emails from parents thanking him for teaching their child.”
It seems to me that the main problem that any strategy based on centre assessment has to solve is how do you compare and reconcile the grades between different centres?
This year’s strategy might be acceptably reliable when it comes to ranking the students within a centre (*), but for the final assignment of the grades (ie, for the problem of comparing and reconciling the grades between different centres) it has to use the previous 2 or 3 years’ exam results, so it is not a true “no test” strategy – there is a strong dependency on previous years’ tests as well. Without that dependency, the ranking alone cannot provide an inter centre grade.
(*) I think the strategy of using the student ranking is not without problems. First, I am not a teacher, but I suspect that if I were one when through my career I would think in terms of “Who are of A*, A, B, C, etc, standards”, rather than trying to rank my students, so asking me to rank them might be asking for something that has never been a part of my professional work. I could give a ranking based on test and mock scores but that’s not a prediction. When I need to consider other factors as well, it must become subjective. Second, the ranking misses out some important information. Suppose centre X has 3 students, X1, X2, X3, who all score 99% in a test set and marked by an independent body, while centre Y has 3 students, Y1, Y2, Y3, who score 98%, 94% and 90% respectively in the same test. Looking at the raking alone, you can’t tell that X3 is probably more deserving of an A* than Y3.
As an aside, the choice of this year’s strategy indicates that Ofqual does not trust the teachers to assign the grades, it only only trusts them to rank their own students.
Reply
Dennis Sherwood says:
2nd June 2020 at 13:31
Hi Huy
Thank you. Powerful stuff.
A few thoughts if I may…
That teachers’ predictions are unreliable indicators of actual outcomes is certainly borne out by the research you cite. But I think it’s worth noting that UCAS predicted grades are intended to be ‘aspirational but achievable’, and are given months before the actual examinations, so they are not at all the same as the teacher assessments being submitted now.
Furthermore, in comparing the teacher-predicted grade to the actual grade as awarded, I’m pretty sure that there’s a (tacit) assumption that the actual outcome is reliable, and that any difference is wholly attributable to an error in the prediction. Mmm… Suppose that the outcome is in fact wrong. In which case the prediction might be on-the-button. And maybe the prediction and the outcome are both wrong, but self-cancelling, in which case the “no difference” result is, incorrectly, interpreted as a correct prediction. I wonder if the (I think valid) evidence that 1 awarded grade in 4 is wrong (https://www.hepi.ac.uk/2019/02/25/1-school-exam-grade-in-4-is-wrong-thats-the-good-news/) might overwhelm “prediction error”?
Your points about teacher bias are highly valid, and a source of great anxiety to many, especially those who are concerned about, for example, minority and ethnic groups (see, for example, https://www.eventbrite.co.uk/e/rhs-workshop-on-the-covid-crisis-and-bme-student-admissions-in-higher-ed-tickets-103505501742). But I don’t think that any bias in centre assessment grades, which is what most people are talking about, is the right place to look: since centre assessment grades are advisory only, and can – and will – be over-ruled, unilaterally and without consultation, by the boards in accordance with their statistical models, the submitted grades don’t matter. What does matter – crucially – is the submitted rank order, and that’s where teacher bias can do real damage, and hidden damage too. That’s bad.
Your points about the degree of trust being placed in teachers is most telling too. Over recent weeks, I have been increasingly disappointed and disillusioned about how this has been progressively eroded.
The original announcement from Education Minister Gavin Williamson dated 20 March (https://www.gov.uk/government/news/further-details-on-exams-and-grades-announced), states that teachers will be asked to submit grades which will then be checked against “other relevant data, including prior attainment”, but does not mention the requirement to submit rank orders. This announcement therefore suggests a higher, rather than a lower, degree of trust, and although it doesn’t explicitly include the possibility of a dialogue between a board and a school along the lines of “according to our analysis, your submission doesn’t quite match your school’s history – would you please tell us about that?”, it doesn’t rule it out either.
The Guidance Note published by Ofqual on 3rd April, (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/879627/Exceptional_arrangements_for_exam_grading_and_assessment_in_2020.pdf) sets out the requirement for rank orders, and does not exclude the possibility of a ‘dialogue’; that such a dialogue would not take place was subsequently confirmed in the decisions resulting from Ofqual’s consultation (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/887048/Summer_2020_Awarding_GCSEs_A_levels_-_Consultation_decisions_22MAY2020.pdf).
The situation everyone is now in is therefore one in which teachers have to guess-the-grade-the-board-have-already-thought-of, which is low trust indeed. In fact, since the boards already know the ‘right answers’, a much simpler process would be one which emerged in a discussion I recently had with Rob Cuthbert, editor of SRHE News (http://www.srhe.ac.uk): why doesn’t each board write to every school and say “For this year’s cohort of 43 candidates for A level Geography, our statistical model indicates that your school can be allocated 5 A*s, 9 As… . Please enter up to the corresponding number of names in each grade box. Thank you.”
That’s far simpler, and is guaranteed to give the overall ‘right answer’, so preventing grade inflation, which I continue to believe is Ofqual’s over0rising priority. Schools don’t have to agonise over a rank order for all 43 candidates, but they surely do have to think very hard about all those boundaries – which is of truly critical importance.
That said, the only thing the teachers are being trusted to do is to get the rank order right. Which I continue to be convinced is more reliable than the rank-order-by-lottery associated with exam marks (https://www.hepi.ac.uk/2020/05/18/two-and-a-half-cheers-for-ofquals-standardisation-model-just-so-long-as-schools-comply/).
That’s important, but not enough. We need to ensure that “Isaac” gets a fair grade too (https://www.hepi.ac.uk/2020/05/18/two-and-a-half-cheers-for-ofquals-standardisation-model-just-so-long-as-schools-comply/).
Let’s hope that a successful process this year leads to some much-needed reform.
Reply
Huy Duong says:
3rd June 2020 at 12:28
Hi Dennis,
Suppose in the future we move towards a “no test” strategy, another problem with a strategy that uses intra-centre ranking and the school’s past performance as inputs is it discourages collaboration between the students. Suppose you have five students who have similar abilities and are capable of getting A* in Physics. It is fantastic that they help each other and generally collaborate. Now if they suspect that the exam board will award two or at most three Physics A*s to the school, basing on the school’s past performance and regardless of how good those five students are, they might become less open and generous in collaborating and helping each other.
Reply
Dennis Sherwood says:
3rd June 2020 at 15:49
Hi Huy
Yes, that’s fair, and I agree. And to make my own views rather clearer, I am not advocating “no exams under any circumstances”. Rather, I’m advocating three things.
Firstly, that if exams are to play any role at all, the assessments, as awarded to every student, must all be reliable and trustworthy. Which, for at least the last decade, GCSEs and A levels certainly have not been.
Secondly, my own belief is that an “all-exam” system, even if the assessments are reliable, is not a fair way of assessing, and recognising, a student’s capabilities after years of schooling. To me, a far better approach, in principle, is some mixture of examination and other forms of assessment, of which teacher judgement, however accomplished, is just one. I don’t know what the ‘answer’ is, which leads to…
…thirdly, that this year’s earthquake of not doing exams will provide an opportunity for the question “how might assessment be different?” to be explored thoroughly and professionally, so that something genuinely different, and better, can be implemented in the future.
But for that to happen, there has to be the political will – which is in woefully short supply right now. And I also think that a necessary condition to allow that door to be opened is that this year ‘works’, and that teachers are seen to have done a professional job, and so can be trusted, even within this year’s “no grade inflation” constraint and all that this limitation implies.
If this year doesn’t work well, as measured, for example, by more than about 25% of centre assessment grades being over-ruled, I fear that is bad news, for it gives the reactionaries every excuse to say, “You teachers had the chance. And you blew it.”
I do hope that doesn’t happen.
Reply
Rob Cuthbert says:
4th June 2020 at 09:17
Dennis, yes, it would be simpler for boards simply to invite each school to enter names up to the maximum for each grade/subject permitted by the model. If the sum of all the maxima for all schools is not greater than the national maximum allowed for that grade, then Ofqual will presumably be satisfied. However it will then be obvious that the system itself is limiting the opportunities of individuals, which means disadvantaging various groups and individuals, for example: exceptional students like Isaac in comparatively poorly-performing schools; an exceptional cohort which exceeds the achievements of a school’s recent history; et al.
Of course, the system sets boundaries between grades every year, but in principle anyone might get A*, and whether you get A* or A does at least usually depend on an actual exam performance. (I note what you say about the fuzziness of marking, but Isaac might note that there is significantly less fuzziness in STEM subjects). In 2020 it is no longer true that anyone might get an A*. The system itself denies Isaac that possibility, which is why I said elsewhere that unfairness is ‘baked into’ the system. Since there is in effect no appeal, the only recourse for Isaac and others is to take the Autumn exams – and an enforced gap year, possibly with other disadvantages, depending on the attitude of the university which is the student’s confirmed choice.
In fact, teacher-assessed grades might not even be necessary. Rankings alone would suffice for exam boards to populate each grade and subject, if the above condition about maxima applies. The alternative approach, with schools filling in the names at each grade, would save a lot of work for teachers, as you say; all they have to worry about is the boundary between grades. Still agonising, but less work.
Reply
Denn says:
15th July 2020 at 23:47
A report on TES, posted late on the afternoon of 15 July, states that Ofqual have refused to comply with the requirement of the Education Select Committee to publish the details of their statistical standardisation process (https://www.tes.com/news/coronavirus-ofqual-will-not-confirm-when-grading-details-published).
That is startling.
And makes it even more important for the appeals process to be changed to allow appeals on grounds of unfairness. (https://www.hepi.ac.uk/2020/07/14/halfon-is-right-ofqual-has-more-to-do/)
Reply

6 comments

Leave a Reply Cancel reply