Monday, March 26, 2018

The Fallacy in Commissioner Morath’s Argument that All Kids Can Pass STAAR

Last week Texas Commissioner of Education, Mike Morath, again stated his belief that all students can pass each STAAR test and therefore all students and all schools can be successful within the accountability program he is designing. His argument is this: STAAR is a criterion-referenced test, not a norm-referenced test, and thus all kids can pass it. When a friend of mine in attendance questioned this, Commissioner Morath acknowledged some superintendents did not believe this was the case and declared it a “difference of opinion.”

When it comes to the world of educational testing and educational accountability, I’m something of a testing and accountability expert. I’ve worked in that world for the better part of my career. I’ve written a book and a number of articles specifically about what such tests were designed to do (which is far more limited than most people think). I’ve read a ton by others way smarter than me on the subject whose work has helped inform my understandings, and I’d like to think that my work may have helped inform theirs.

Commissioner Morath’s statement is false. This is not a matter of differing opinions, but one of fact verses fiction. The fact that he continues using his flawed understanding of both criterion and norm-referenced testing as the basis for his version of educational accountability requires it be called out directly. His claim, that all schools and students can be successful within the system he is designing, is simply not possible. That he wants it to be true is admirable, but if he truly wants it he needs to select tools not designed to prevent it.

To see the issue clearly requires a basic understanding of several parts of standardized testing: ordering, statistical processes, scaling and norming, cut scores, and criterion-referenced tests. I promise, understanding them is not as difficult (nor as boring) as most would think.

The methodology behind standardized tests like those used in state testing programs was invented well over a hundred years ago as a means to order students. That ordering is from the student furthest below to the student furthest above average based on the relative differences between students. Such tests have the advantage of being fairly stable over time because of the fact that orderings based on single traits are fairly stable over time. For a researcher trying to study human characteristics this is research gold: it provides a stable basis for research that would otherwise be impossible. It is that stability of scores that made such tests attractive to policy makers as well, their limitations notwithstanding.

One purpose for such orderings is that they allow researchers insights into populations of students they would not otherwise have. Most notably: ordering reveals patterns. For example, if I laid out a ruler with salaries on it, starting with a dollar at one end and a million dollars on the other, and then asked everyone in a single profession to stand on their salary, I now have a huge amount of information I can explore. I would need to go to each point in the ordering and see who is there and why, and what meaningful patterns exist (likely in this case that men tend to make more than women for the same job). Think of that as a two-step process: first, I create an ordering, and second, I look in the ordering for causes.

The design of standardized tests allows for this same sort of ordering of students in educational domains. An educator is then able to look for patterns that could otherwise not be seen, and then for causes as to what do next. That they are now rarely used in that fashion is the fault of educational accountability, and a discussion for another time.

Test professionals must apply a host of statistical criteria to the items that make up such a test, without which the results would be useless from a research perspective. For example, if a student is slightly above average on last year’s test we would anticipate that the student would be slightly above average on this year’s test. If we saw that the student is now well-above or well-below average, that change needs to be shown to be meaningful for the test to be useful. Changes would not be meaningful to a researcher if they were random or haphazard or unpredictable. The statistical criteria allow test professionals to create tests capable of signaling when a change is meaningful. A researcher who observes meaningful changes through such tests would then conduct a further investigation to answer the how or the why.

Scores from standardized tests can be difficult to interpret. For example, if two parallel test forms exist, and one is slightly more difficult than another (this is common), then a score of 30 out of 40 items means different things—the student taking the more difficult test could be said to have actually scored a little higher than the student taking the easier test, but because the scores are both 30 out of 40 that can be tricky to see. Test makers therefore convert raw scores to a scale of some sort (the SAT and ACT, for example, both do this for this very reason). This allows for the 30 on the easy test to be converted, say to a score of 350, and the 30 on the more difficult test to be converted to a score, say, of 360. Those scale scores are therefore more precise estimates of what each student did than the raw scores.

That scale can have norms applied to it. This refers to the conversion from raw to scale scores. It adjusts the conversion so that the scale scores distribute students in a bell-shaped pattern. While this is a highly technical process, the end result is that scores can be more easily compared, for example, across grades and even across subjects. Given the complexity and expense of the process, normed tests have been largely limited to commercially available published tests, such as the Iowa Test of Basic Skills or the Stanford Achievement Test Series. They were often disliked by educators because someone would always have to be in the 1st percentile, and someone in the 99th. However useful to a researcher, that never sat well with teachers.

Either a raw score or a scale score can have a line drawn at that point in the ordering and be declared a passing or a cut score. Medical and nursing exams, for example, do this as a protection to society. They work from the assumption that the applicant to a profession who scores at the bottom should not be allowed into the profession, but the applicant at the top should. They then work to find a point somewhere in between as the dividing line. Any passing score they land on suffers from the logic mentioned above in that test takers at that point in the ordering will be there for a host of reasons. However, the test maker can perform research so that for the most part those above that line are indeed qualified. It is far from a perfect system and is always somewhat arbitrary, but it serves a useful purpose and absent an alternative persists.

Commissioner Morath has declared that because STAAR is a criterion-referenced, not a norm-referenced test, all students can climb over an established cut score, and all schools can succeed. This is not a small argument, but the basis and justification behind his thinking. If he is wrong, his system falls apart.

He is wrong, and the misunderstanding is massive.

“Criterion-referenced” is a phrase that refers to the content of the test being drawn from a blueprint explicitly designed to represent an educational subject, with some set of criteria established for a passing score. This is most common in tests teachers use, which are not designed to sort students but check for learning (all students can indeed, pass that sort of test if they learned the material). However, in the past decade or two standardized test-makers (and those who make state tests) made an attempt to move their content in this direction as well. Ordering is a rational research tool, so the logic went, and the closer the content to what was taught, the more meaningful the ordering will be. Thus, both tests teachers give in the classroom, and those that order students, can be criterion-referenced. In fact, it is entirely possible to have a criterion-referenced test that is also normed.

The accurate, factual description of STAAR is this: STAAR is a criterion-referenced testing program because it draws its content from state a blueprint that reflects standards documents. It is a test designed to order students against that content from the student furthest below to the student furthest above related to that content. The state then follows the model of nursing and medical exams and draws a passing line in the sand at a point in the ordering. STAAR is not a normed test, as no one performs the research that would allow for normed results. As a result, comparisons from one year to the next should be done cautiously, and certainly no comparisons are possible between subject areas.

We can state all this unequivocally thanks to the fact that Texas runs a very transparent system and publishes a technical manual annually that shows the application of all the statistical processes necessary to build a test that orders students. Those processes would not be present if the goal was something other than ordering students.

Commissioner Morath’s argument can be summarized as follows:
  1. STAAR is a criterion-referenced test
  2. Because STAAR is not normed, it does not order students.
  3. All students can get past the cut score if schools would do their jobs.
  4. All schools can therefore succeed in the accountability program.
Only one of those is true: STAAR is a criterion-referenced test. The others are all false. STAAR orders students, which means that for all students to pass all students would have to get past some point on a test designed to order them. That would be the equivalent of all students being above average—it cannot and will never happen. As a result, to say that all schools can succeed in his plan is deeply disingenuous. It is instead a guarantee that such a thing will not happen.

If I want to build something an engineer tells me will fall down, that is not a difference of opinion. It is the difference between the position of a professional who understands, and someone who does not. Not heeding the position of the professional in the case of a building would be foolhardy since the thing is unlikely to stand.

Not heeding the positon of professionals in the case of educational accountability is worse. It risks direct damage to the children the Commissioner is constitutionally required to serve by damaging the institutions that serve them. It is beyond foolhardy. STAAR orders students from the furthest below to the furthest above average. That is a verifiable fact. Not an opinion.