Tuesday, May 15, 2018

A testing glitch on STAAR--and a response to a question

Texas experienced some computer glitches today administering the state testing program (the system shut down for a little more than an hour). A superintendent friend wrote me a note asking about the potential impact on the reliability of the results. I'm posting below what I wrote to him.

Reliability refers to (among other things) the kids doing about the same on parallel tests—but that assumes similar conditions for each test. The conditions for this test compared to another would be different given the interruption—therefore it is reasonable to suspect that the results would differ.

For example, kids who tried hard before the break may think that the grownups don’t care enough to create a system that works and not take the part after the break seriously—you could see that by comparing scores before and after to see if effort decreased. If it did, then that definitely affects the reliability of the scores, since they would very likely perform differently on a parallel form of the test. There are lots of similar issues that could be considered in the same vein: increased stress after the break, teacher stress being seen during the break, possible exposure to items between students (unless the kids were made to sit with their heads on their desks for an hour and a half and not say a word to each other), etc.

If you were testing as a research activity with no pressure on the kids or the schools and this happened (you're not—but bear with me):
  1. You would declare the data suspect until you could perform additional analysis.
  2. You would likely do a study after the fact to determine if the gap had an impact and the degree of the impact.
  3. If you could determine that the impact was either consistent (say 2 points for every student) or there was no impact you may decide to include the data and maybe make some adjustments, but certainly with a big footnote.
  4. If the impact was all over the place and no patterns could be found, you may need to declare the data corrupt, toss it, and figure out a plan B.
  5. In the end, assuming the data were going to be used, a researcher would probably want to repeat at least a sample of the study as a point of comparison—just to be sure. No good researcher is likely to be entirely confident in the results until they could confirm that their conclusions would be the same regardless of whether or not the gap in testing time had occurred.
The point is that as a research activity this would present a mess, but their are tools that can try and make sense of what happened and adjust. Still, this would give a researcher fits and be far from ideal.

State testing is not a study or a research activity, but a high stakes event. If a researcher isn’t going to trust their conclusions when something like this occurs without a whole lot of checking, the same must be true for a state test. What differs, however, is that while any adjustments or manipulations made during research affect data, any adjustments made to the state test scores affect both students and their schools. That cannot be resolved with a footnote.

This is what happens when you ask a screwdriver to do the work of a hammer—the most reliable of these tests are still not designed as instruments to judge schools or kids, but rather, as a useful but limited analytical tool. They should never be used as they are by states as accountability tools. The argument that the results from today's tests are probably going to less reliable than they should be is likely true, but a more reliable result wouldn't solve the underlying problem that these tests are being wrongly used.

Finally, and with feeling, even when a reliable test score is asked to serve as a judge of school quality or student performance, the number of false positives and false negatives in the judgments will be ridiculous. A student who struggles historically may be slightly below the “passing” score as a result of great teaching, while another may be slightly above it because both parents have a PhD and the student is coasting through school. A declaration of failure for the first student or the school is as wrong as the declaration of success for the second student and most certainly the school.

Analyzing both students and their schools is in the design of such tests. Judging either is not.

No comments:

Post a Comment