Sunday, January 5, 2020

How standardized tests do what they do (which isn’t what most people think)

Standardized test is the name most people assign to the tests used in state accountability systems, commercially available norm-referenced tests, and college admittance tests such as the ACT and SAT. I have long encouraged folks to drop the term “standardized,” since that merely refers to the conditions under which tests can be administered, rather than what this narrow family of tests are and do.

Instead, I prefer to call them predictive tests. This describes what they are intended to do.

I have also strongly encouraged a more critical use of vocabulary regarding predictive testing. This is because of the massive confusion that results from the plethora of terms now applied to testing that don’t mean what most people think, such as standards-based, or criterion-referenced.

What sets a predictive test apart from all other forms of testing is its ability to produce predictive scores. Simply (and crudely) put, if I am slightly above average this year you can predict that I will probably be slightly above average next year. If I am not, if I am well-above or below average, you can note it and begin the search for causes. Perhaps there are lessons to be learned or perhaps not, but as a signal for where to look such test scores have some use.

Confusion is created when people presume that their names for testing, such as standards-based or criterion-referenced, are parallel forms of testing to a predictive test. This is inaccurate. If the tests produce consistent results across administrations, they are first and foremost predictive tests. You may have drawn the content from a state’s written standards and labeled it a standards-based test, or drawn a line in the sand and assigned it a label, in which case you created a criterion (as you have assigned a score meaning that is external to the test). Or you may have conducted a comparative study after the fact that allowed you to apply norms. Regardless, the style of tests in which you are operating is predictive.

And, by the way, creating this narrow sort of instrument requires real specialization and training, as the sorting function will only occur in a consistent fashion with test items that perform within a narrow set of statistical criteria, and that combine to create a specific effect. This is a far cry from a teacher building a test to understand the effectiveness of their teaching or whether students learned a lesson—that isn’t even in the same ballpark. The last thing a teacher should care about regarding learning is whether their items sort kids into a curve, while that concern is first and foremost in order for a predictive test to work.

The greatest mistake people make with a predictive test is to presume that the consistency in the results has more meaning than it does, when the fact is that the meaning is surprisingly limited.

The consistency is created by first finding average and then calculating how far from average each test taker is. Since averages are reasonably consistent over time, as is a student’s relationship to average, the results will be as well.

The usefulness in this is that a student’s position is predictive as described above, and movement can be explored for potential lessons. The resulting orderings are also useful in that they show broad patterns behind them, often regarding socioeconomics, gender, race, etc. As researchers identify these and policies and procedures are put in place, future parallel instruments can be used to understand the effectiveness of those policies and procedures by noting whether or not negative patterns dissipate.

A perfect ordering on an entire domain is simply not possible—that would result in a test that was thousands of items long. Instead, test makers locate a few items that will order students about the same as if the ordering were done on the entire domain. This makes the test a proxy for the domain, and still useful in spite of the fact that it is not a statistically representative sample of it. So long as the ordering on the limited selection of content will be roughly the same as on the entire body of content it is still useful in the hands of a thoughtful researcher who understands how the tested content was derived.

The fact that such tests are proxies for the larger domain adds another limitation to the scores: they are estimates only, with some amount of imprecision in each. That just means that while a majority of the time students taking similar tests on consecutive days will score similarly, some will not, and some will have scores that differ a great deal. Again, in the hands of a researcher who understands these limitations and that the scores are simply a broad signal for where to look for patterns and causes, these limitations don’t render the results useless. While they are limited, they can be useful so long as that use can tolerate the fact that scores are estimates based on a proxy and nothing more.

The primary confusion comes because the predictive test methodology produces reasonably consistent scores over time even though the test is based on a proxy for the entire domain. The resulting estimates (scores) are still sufficiently consistent over time to allow for researchers to find some value in them. But that doesn’t magically transform them into something they are not, opening up a world of uses beyond their design. Any use that assumed so would be silly.

Which is why the use of state test scores can rightfully be called silly. They are derived from the predictive test methodology yet are treated not as proxies, but as representative of an entire domain, worthy of teaching to and guiding learning when that cannot be the case. They are treated not as estimates useful for research, but absolutes to make judgements. And worst, they are treated as signals of quality when that was never in their design.

This last point has been particularly disastrous for schools that serve students from historically marginalized communities. It is a fact that if you order students as of a day on a domain such as literacy—whether via proxy or a more complete measure—and some aspect of society contributes heavily to students’ ability to acquire knowledge within that domain, the ordering will reflect that. But as of that moment no judgment is available to be made. Some set of students may be behind because of real failure in their efforts or those of the school, in which case remedies for failure should be available and applied. But they may just as well be behind due to a lack of opportunity. In that case a failure judgment and remedy would be wrong, even unethical, as it would be the wrong remedy.

Rather, a different remedy should be applied that addresses the issue of being behind as being behind, but not failure. Mislabeling the problem would be a huge mistake as it would create perceptions that may not be real, force actions that run counter to need, and justify historical biases. Even worse, labeling being behind as failure risks converting being behind to failure, in which case the current system of test-based accountability could be said to have been a contributing cause to the further suffering of those who can least afford it, to the detriment of our nation as a whole.

In short, every role educational policy asks predictive tests to play is outside and beyond their design, with a profound number of ill effects that come from their bad assumptions. Predictive tests cannot be used to judge quality or effectiveness, guide or drive instruction, or indicate the effectiveness of policy.

So, there you have it: predictive tests work by being predictive, but in order to be predictive they can’t be much else, and they certainly cannot be used as the primary tool in school accountability. The sooner we all realize that fact the better.

No comments:

Post a Comment