On Friday, the regulator Ofqual released its 133 page report into what happened with GCSE English this summer. I skim read it then, but couldn’t quite believe my eyes at some of what I skimmed, so read it more thoroughly over the weekend. I now believe my eyes; it’s Ofqual I don’t believe.
Ofqual argue, in essence, that thousands of students got lower than their predicted GCSE English grade this summer, not because grading boundaries were radically and unfairly revised between January and June 2012, but because teachers tend to “overmark”. This is the key graph Ofqual uses for that argument (this one is for the AQA exam board but the one provided for the other boards are similar:
The sharp spike just to East of the assumed pass mark (based on January grading), says Ofqual, shows that teachers were deliberately “overmarking” assessments to ensure that their little darlings hovering around the C/D borderline got over that line. This assertion has been translated into teachers “cheating” in the popular press, or “massaging marks” in the more restrained version offered up by Chris Cook at the FT (who also uses this graph as the starting point for his analysis).
Unfortunately for Ofqual, the graph shows nothing of the sort, because there is a perfectly valid alterative explanation for the spike East of C: very good teaching.
Thus, as I noted in my comment on Chris Cook’s piece:
The reader [of Chris's piece] may be left with the impression that teachers in England did actually cheat (or “massage marks”). There is simply no evidence for this, and there is an obvious counter-explanation for the rapid rise to the East of the C line in the graph. That is that teachers are getting very good at using relatively new tracking systems to identify before the controllee assessments where each ‘borderline’ pupil needs to get to in order to get the C (or B, or whatever) and then works hard – using 1:1 TA resources etc – to identify what is needed to get them over that line. This is not .’massaging figures'; it’s massaging CA performance.
I know that’s what happens in the school where I’m English link governor – perfectly honest and actually very commendable, though, as you suggest, all part of a crazy system.
In fact, this is acknowledged) in Ofqual report (para 6.12):
In many schools the prediction process is supported by data analysis which is updated frequently. In a typical school marks from student class work and mock exams in Years 10 and 11 were fed into the tracking system every six weeks, and senior management met every two weeks to discuss them. When students were not making enough progress to achieve their target outcomes, interventions were arranged. This process of targeting teaching and learning support to secure target grades has in many schools been successful in previous years, which added to schools’ surprise when this year’s results were not as predicted. Teachers commonly said that they were used to their predictions being “spot-on”.
Yet, despite the acceptance that such targeted interventions have been “successful in previous years”, they are then totally ingnored as a possible explanation for students’ performance “spiking” just after the point where teachers thought they’d gain a C.
Ofqual seeks to back up this overmarking assertion with two other key pieces of data.
First, it compares the ‘East of C’ spike referred to above a similar spike in graphed data from the Year 1 phonics test first used in the summer of 2012:
The “pass” mark of 32 was made available to teachers before they administered the check. The graph shows how teacher scoring was affected by knowledge of the threshold. The graph…shows how many pupils were given each score: fewer than 10,000 were given a score of 31 compared with over 40,000 being given a score of 32 (paras. 6.73-6.74).
If this “spiking” happens even in such “low-stakes assessment”, says Ofqual, then it’s proof that teachers pre-warned about grade boundaries will always overmark to ensure pass rates are higher. This is, again, nonsense. A perfectly valid alternative explanation for the spiking in the phonics assessment case is that they are indeed “low stakes assessments”, seen as bureaucratic nonsense imposed by the government and at worst (as Michael Rosen has argued powerfully) simply invalid as tests of children’s reading ability. In such circumstances, the urge to simply add a mark or two and move on to the real job of teaching must surely have been tempting – an expression of professionalism in the face of DfE unprofessionalism, indeed – and it really tells us nothing about how secondary school teachers operate in very similar corcumstances.
The second set of data used by Ofqual to corroborate its key “teachers overmarking at the root of all our problems” claim comes in the form of sample data from AQA (the biggest exam board for English) showing how teachers marking differed from moderators marks within the well-established 6% “tolerance” (within which limits schools’ marking will be accepted). The two key tables are below:
As regards the data, it’s important to see it in the context of an important note in the Ofqual report’s supporting information (Appendix 2, p.115):
[T]he tolerance limit is normally no higher than 6 per cent of the total raw mark for the unit/component, rounded to the next whole number above (for example, 4.1 and 4.8 are both rounded to 5).
Now, anyone with a basic knowledge of maths knows that this is not the way we normally round; normally if it’s above 0.5, we go up, and down if it’s below. This is not to say the exam boards agreed to do the rounding back in 2000 is incorrect, as the intention was probably around adding an extra bit of ‘rigour’ into the system rather than using the data for later analysis. Nevertheless, using rounding in this way for the core raw data on which the analyisis is built has significant repercussions. This is most easily shown by readjusting the data on the assumption that half the mean scores in the graphs above should have been in the column to the right (a proxy given the lack of raw data in the report, but one which probably underestimates the real change given the shape of the assumed distribution curve).
If you do this exercise for the top graph, you get a very different result from the one shows. For percentage variances in the way they are normally understood, we suddenly find a much smaller skewing towards ‘overmarking’, with 20% of schools actually undermarking, and 70% of schools marking lower, equal to or at only 1% above the moderator (rising to 86% up 2% above, and 96% up to 3% above).
Standing back and looking at this again, you get a quite different picture; nearly all schools mark within 3% of the moderator, even though the tolerance level – established in the first place because of the inherent subjectivity of marking subjects like this – is twice that. That’s actually quite impressive, and teachers should be congratulated, not condemned.
Then there’s the question of who’s getting the marking ‘wrong’. There is an assumption within the report that if teacher and moderator marking is out of kilter, then it is the teacher who is wrong. The report is careful not to blame teachers personally, focusing instead on the pressures they come under, but pressures operate both ways. There is no consideration at all in the report of potential pressure on moderators to ‘undermark’, even though we know perfectly well that exam boards themselves are under pressure to become seen as ‘hard markers’, or lose out in the coming tenders for single exam boards for each subject. It is at least arguable that the pressure on moderators to undermark might be as great or greater than any pressure on teachers to overmark. This isn’t to say that no teachers at all knowingly overmark – it would be as foolish to assert that as it would be to assert that most moderators undermark – but to take only one side on the potential pressures to do either is not acceptable.
Ultimately, this is a cleverly worded whitewash of a report (and it is clever). All the talk of poor practice in Controlled Assessments, for example (which I won’t cover here for relative brevity’s sake) is little more than a smokescreen, and ignores the fact that the legal action now being taken against Ofqual relates not to a Controlled assessment, but specifically to AQA ‘s ENG1F’s, which saw a 10% rise on the C/D grading boundary between January and June; this has a written paper, not a Controlled Assessment (see p.120 of the Ofqual report).
Perhaps most important of all – and the biggest weakness – is that the Ofqual report fails to address in a satisfactory way the key question of whether ‘comparable outcomes’ are an acceptable mechanism for running the whole exam system. As Chris Cook has also shown, Ofqual have largely ignored the possibility that this process, which effectively limits the number of pupils who can achieve each grade, might be unfair because it has failed to take account of real improvements in learning and teaching in schools over the last few years, even though they acknowledge that it is an important variable. Yet in this report, Ofqual simply rejects any notion that they might have some responsibility in this area:
There are also suggestions that this approach does not recognise genuine improvements in national performance in a subject. That is not the case. If exam boards had evidence that the level of performance was at odds with the statistical predictions (either because performance was better or worse than expected) we would expect them to be able to provide evidence for us to consider. In the time we have been using the comparable outcomes approach, there have been some examples of this – for example in AS level World Development and A level Critical Thinking.
And that’s it! Two example of changes in two relatively obscure AS/A levels, and they feel their job is done. No sense of how or why exam boards, beset by the kind of pressures to downgrade that I’ve referred to above might be even vaguely interested in upsetting the GCSE apple cart by acknowledging that children might be better taught than they were a few years ago. Does Ofqual really expect the exam boards to be pro-active? Of course they don’t, but it gets Ofqual off the hook.
Frankly, this is a disgraceful abdication of responsibility, and coming as it does at the very end of the report, it acts as a handy summary of what the reports really about – blaming everyone else for what’s gone wrong.
And of course, behind the Ofqual screen of supposed independence, sits Gove, smiling at the latest success in his masterplan.