Wednesday, May 16, 2018

Why state tests make lousy instructional tools

Items are selected for tests according to the test’s purpose. If a teacher is building a unit test from scratch he/she will build items that reflect what they needed students to learn, and the expectation would be that most of the students would answer the majority of them at least partially correct if they paid attention at all: it would be rare that a student who was at least partially present would score a zero. We say that those items signal the learned/not learned moment. The statistics behind those items are not particularly important—an important item that all students answered correctly would signal good teaching and learning, while one they all answered incorrectly would signal the opposite. The point here is to assess learning.

Researchers interested in analyzing people have long known that if you can order human beings on a human trait or characteristic you can detect patterns to explore. In the case of negative patterns, such as an ordering that shows women generally making less than men for the same job, we may want to disrupt that pattern and attempt to do so. Additional orderings can show whether or not such efforts are working. The researcher could perform a future ordering to see if the pattern had dissipated or still existed.

Many years ago researchers wanted to order students in terms of their literacy and numeracy attainment—they wanted to see what patterns existed so they could disrupt those they found to be negative. To do this they invented the basic methodology of standardized testing, which just like a teacher-developed test required items specific to the purpose. Only this purpose was not against the learned/not learned split, but the above/below average split. This can appear to the naked eye to be mere semantics given that both sets of items contain content from a domain, and yet they are very different to the point that they should not be substituted for each other.

Items used in standardized tests are selected for their ability to sort students into above and below average piles, and with enough items sort the students into an ordering from the student furthest below average to the student furthest above average. The only items that can do this are those that about half the students will answer correctly and half incorrectly (or in the case of a four-reposnse multiple choice item, that number is ~63%, since some of the kids will guess it correctly and statisticians want to take that into account—note that every state test I've ever reviewed follows this pattern).

The first item sorts kids into two piles (above or below average). The second item sorts kids into three piles (above average, average, and below average), and then each subsequent item creates more piles until you get enough to be useful (it generally takes around 35-40 items). Each item needs to contribute to this sorting to a finer and finer degree. This is why an item that all students answer correctly or incorrectly at the field trial stage will be eliminated when the test is constructed—such items may well reflect on learning, but they show all the students as being the same, and thus don't contribute to the sorting. Only items that exhibit a very specific pattern of responses are useful for this purpose.

It is this narrow statistical limitation that renders these items inappropriate for informing instruction. Lets say as a teacher I wanted to know if students can multiply two digit numbers: I could ask them to do so with a sheet of problems and check their responses. But as a standardized test developer I would only pick the items that separated the students into an above and below average pile (e.g., from that very narrow statistical range). Out of ten such items on my teacher-developed test, perhaps only one would fall into that narrow range (multiplying some numbers is trickier than others).

Now consider the two different views of a student who completes the ten problems in class and answers, say, eight of them correctly, but misses the one problem selected for the standardized test. The truth is that the student is actually doing pretty well on two-digit multiplication and maybe needs a little coaching, but that will be entirely missed if the instructional inference is only made from the item that fits the standardized test requirements. That inference would be that the student doesn't know two digit multiplication, which is false. The instructional responses would be very different, with one being appropriate to the student's needs, and the other being inappropriate. When I said earlier that trying to make instructional inferences from standardized testing can lead a teacher down the wrong path, this is exactly what I meant.

This principle applies to the whole of standardized testing.

The selection of standardized tests as an accountability tool was surprising at the outset—these are limited analytical tools, not judgment tools. Nevertheless, they produce consistent results given their design and purpose, and that consistency appealed to policy makers even though it has never meant what they think. What really surprised me was when policy makers declared such tests useful instructional tools. That would be like me declaring a screwdriver a hammer—no amount of trying will allow it to be up to the task. Nevertheless, that is where we find ourselves: in a bit of a confusing mess.

Tuesday, May 15, 2018

A testing glitch on STAAR--and a response to a question

Texas experienced some computer glitches today administering the state testing program (the system shut down for a little more than an hour). A superintendent friend wrote me a note asking about the potential impact on the reliability of the results. I'm posting below what I wrote to him.
--------------

Reliability refers to (among other things) the kids doing about the same on parallel tests—but that assumes similar conditions for each test. The conditions for this test compared to another would be different given the interruption—therefore it is reasonable to suspect that the results would differ.

For example, kids who tried hard before the break may think that the grownups don’t care enough to create a system that works and not take the part after the break seriously—you could see that by comparing scores before and after to see if effort decreased. If it did, then that definitely affects the reliability of the scores, since they would very likely perform differently on a parallel form of the test. There are lots of similar issues that could be considered in the same vein: increased stress after the break, teacher stress being seen during the break, possible exposure to items between students (unless the kids were made to sit with their heads on their desks for an hour and a half and not say a word to each other), etc.

If you were testing as a research activity with no pressure on the kids or the schools and this happened (you're not—but bear with me):
  1. You would declare the data suspect until you could perform additional analysis.
  2. You would likely do a study after the fact to determine if the gap had an impact and the degree of the impact.
  3. If you could determine that the impact was either consistent (say 2 points for every student) or there was no impact you may decide to include the data and maybe make some adjustments, but certainly with a big footnote.
  4. If the impact was all over the place and no patterns could be found, you may need to declare the data corrupt, toss it, and figure out a plan B.
  5. In the end, assuming the data were going to be used, a researcher would probably want to repeat at least a sample of the study as a point of comparison—just to be sure. No good researcher is likely to be entirely confident in the results until they could confirm that their conclusions would be the same regardless of whether or not the gap in testing time had occurred.
The point is that as a research activity this would present a mess, but their are tools that can try and make sense of what happened and adjust. Still, this would give a researcher fits and be far from ideal.

State testing is not a study or a research activity, but a high stakes event. If a researcher isn’t going to trust their conclusions when something like this occurs without a whole lot of checking, the same must be true for a state test. What differs, however, is that while any adjustments or manipulations made during research affect data, any adjustments made to the state test scores affect both students and their schools. That cannot be resolved with a footnote.

This is what happens when you ask a screwdriver to do the work of a hammer—the most reliable of these tests are still not designed as instruments to judge schools or kids, but rather, as a useful but limited analytical tool. They should never be used as they are by states as accountability tools. The argument that the results from today's tests are probably going to less reliable than they should be is likely true, but a more reliable result wouldn't solve the underlying problem that these tests are being wrongly used.

Finally, and with feeling, even when a reliable test score is asked to serve as a judge of school quality or student performance, the number of false positives and false negatives in the judgments will be ridiculous. A student who struggles historically may be slightly below the “passing” score as a result of great teaching, while another may be slightly above it because both parents have a PhD and the student is coasting through school. A declaration of failure for the first student or the school is as wrong as the declaration of success for the second student and most certainly the school.

Analyzing both students and their schools is in the design of such tests. Judging either is not.