Wednesday, May 16, 2018

Why state tests make lousy instructional tools

Items are selected for tests according to the test’s purpose. If a teacher is building a unit test from scratch he/she will build items that reflect what they needed students to learn, and the expectation would be that most of the students would answer the majority of them at least partially correct if they paid attention at all: it would be rare that a student who was at least partially present would score a zero. We say that those items signal the learned/not learned moment. The statistics behind those items are not particularly important—an important item that all students answered correctly would signal good teaching and learning, while one they all answered incorrectly would signal the opposite. The point here is to assess learning.

Researchers interested in analyzing people have long known that if you can order human beings on a human trait or characteristic you can detect patterns to explore. In the case of negative patterns, such as an ordering that shows women generally making less than men for the same job, we may want to disrupt that pattern and attempt to do so. Additional orderings can show whether or not such efforts are working. The researcher could perform a future ordering to see if the pattern had dissipated or still existed.

Many years ago researchers wanted to order students in terms of their literacy and numeracy attainment—they wanted to see what patterns existed so they could disrupt those they found to be negative. To do this they invented the basic methodology of standardized testing, which just like a teacher-developed test required items specific to the purpose. Only this purpose was not against the learned/not learned split, but the above/below average split. This can appear to the naked eye to be mere semantics given that both sets of items contain content from a domain, and yet they are very different to the point that they should not be substituted for each other.

Items used in standardized tests are selected for their ability to sort students into above and below average piles, and with enough items sort the students into an ordering from the student furthest below average to the student furthest above average. The only items that can do this are those that about half the students will answer correctly and half incorrectly (or in the case of a four-reposnse multiple choice item, that number is ~63%, since some of the kids will guess it correctly and statisticians want to take that into account—note that every state test I've ever reviewed follows this pattern).

The first item sorts kids into two piles (above or below average). The second item sorts kids into three piles (above average, average, and below average), and then each subsequent item creates more piles until you get enough to be useful (it generally takes around 35-40 items). Each item needs to contribute to this sorting to a finer and finer degree. This is why an item that all students answer correctly or incorrectly at the field trial stage will be eliminated when the test is constructed—such items may well reflect on learning, but they show all the students as being the same, and thus don't contribute to the sorting. Only items that exhibit a very specific pattern of responses are useful for this purpose.

It is this narrow statistical limitation that renders these items inappropriate for informing instruction. Lets say as a teacher I wanted to know if students can multiply two digit numbers: I could ask them to do so with a sheet of problems and check their responses. But as a standardized test developer I would only pick the items that separated the students into an above and below average pile (e.g., from that very narrow statistical range). Out of ten such items on my teacher-developed test, perhaps only one would fall into that narrow range (multiplying some numbers is trickier than others).

Now consider the two different views of a student who completes the ten problems in class and answers, say, eight of them correctly, but misses the one problem selected for the standardized test. The truth is that the student is actually doing pretty well on two-digit multiplication and maybe needs a little coaching, but that will be entirely missed if the instructional inference is only made from the item that fits the standardized test requirements. That inference would be that the student doesn't know two digit multiplication, which is false. The instructional responses would be very different, with one being appropriate to the student's needs, and the other being inappropriate. When I said earlier that trying to make instructional inferences from standardized testing can lead a teacher down the wrong path, this is exactly what I meant.

This principle applies to the whole of standardized testing.

The selection of standardized tests as an accountability tool was surprising at the outset—these are limited analytical tools, not judgment tools. Nevertheless, they produce consistent results given their design and purpose, and that consistency appealed to policy makers even though it has never meant what they think. What really surprised me was when policy makers declared such tests useful instructional tools. That would be like me declaring a screwdriver a hammer—no amount of trying will allow it to be up to the task. Nevertheless, that is where we find ourselves: in a bit of a confusing mess.

No comments:

Post a Comment