Monday, March 26, 2018

The Fallacy in Commissioner Morath’s Argument that All Kids Can Pass STAAR

Last week Texas Commissioner of Education, Mike Morath, again stated his belief that all students can pass each STAAR test and therefore all students and all schools can be successful within the accountability program he is designing. His argument is this: STAAR is a criterion-referenced test, not a norm-referenced test, and thus all kids can pass it. When a friend of mine in attendance questioned this, Commissioner Morath acknowledged some superintendents did not believe this was the case and declared it a “difference of opinion.”

When it comes to the world of educational testing and educational accountability, I’m something of a testing and accountability expert. I’ve worked in that world for the better part of my career. I’ve written a book and a number of articles specifically about what such tests were designed to do (which is far more limited than most people think). I’ve read a ton by others way smarter than me on the subject whose work has helped inform my understandings, and I’d like to think that my work may have helped inform theirs.

Commissioner Morath’s statement is false. This is not a matter of differing opinions, but one of fact verses fiction. The fact that he continues using his flawed understanding of both criterion and norm-referenced testing as the basis for his version of educational accountability requires it be called out directly. His claim, that all schools and students can be successful within the system he is designing, is simply not possible. That he wants it to be true is admirable, but if he truly wants it he needs to select tools not designed to prevent it.

To see the issue clearly requires a basic understanding of several parts of standardized testing: ordering, statistical processes, scaling and norming, cut scores, and criterion-referenced tests. I promise, understanding them is not as difficult (nor as boring) as most would think.

The methodology behind standardized tests like those used in state testing programs was invented well over a hundred years ago as a means to order students. That ordering is from the student furthest below to the student furthest above average based on the relative differences between students. Such tests have the advantage of being fairly stable over time because of the fact that orderings based on single traits are fairly stable over time. For a researcher trying to study human characteristics this is research gold: it provides a stable basis for research that would otherwise be impossible. It is that stability of scores that made such tests attractive to policy makers as well, their limitations notwithstanding.

One purpose for such orderings is that they allow researchers insights into populations of students they would not otherwise have. Most notably: ordering reveals patterns. For example, if I laid out a ruler with salaries on it, starting with a dollar at one end and a million dollars on the other, and then asked everyone in a single profession to stand on their salary, I now have a huge amount of information I can explore. I would need to go to each point in the ordering and see who is there and why, and what meaningful patterns exist (likely in this case that men tend to make more than women for the same job). Think of that as a two-step process: first, I create an ordering, and second, I look in the ordering for causes.

The design of standardized tests allows for this same sort of ordering of students in educational domains. An educator is then able to look for patterns that could otherwise not be seen, and then for causes as to what do next. That they are now rarely used in that fashion is the fault of educational accountability, and a discussion for another time.

Test professionals must apply a host of statistical criteria to the items that make up such a test, without which the results would be useless from a research perspective. For example, if a student is slightly above average on last year’s test we would anticipate that the student would be slightly above average on this year’s test. If we saw that the student is now well-above or well-below average, that change needs to be shown to be meaningful for the test to be useful. Changes would not be meaningful to a researcher if they were random or haphazard or unpredictable. The statistical criteria allow test professionals to create tests capable of signaling when a change is meaningful. A researcher who observes meaningful changes through such tests would then conduct a further investigation to answer the how or the why.

Scores from standardized tests can be difficult to interpret. For example, if two parallel test forms exist, and one is slightly more difficult than another (this is common), then a score of 30 out of 40 items means different things—the student taking the more difficult test could be said to have actually scored a little higher than the student taking the easier test, but because the scores are both 30 out of 40 that can be tricky to see. Test makers therefore convert raw scores to a scale of some sort (the SAT and ACT, for example, both do this for this very reason). This allows for the 30 on the easy test to be converted, say to a score of 350, and the 30 on the more difficult test to be converted to a score, say, of 360. Those scale scores are therefore more precise estimates of what each student did than the raw scores.

That scale can have norms applied to it. This refers to the conversion from raw to scale scores. It adjusts the conversion so that the scale scores distribute students in a bell-shaped pattern. While this is a highly technical process, the end result is that scores can be more easily compared, for example, across grades and even across subjects. Given the complexity and expense of the process, normed tests have been largely limited to commercially available published tests, such as the Iowa Test of Basic Skills or the Stanford Achievement Test Series. They were often disliked by educators because someone would always have to be in the 1st percentile, and someone in the 99th. However useful to a researcher, that never sat well with teachers.

Either a raw score or a scale score can have a line drawn at that point in the ordering and be declared a passing or a cut score. Medical and nursing exams, for example, do this as a protection to society. They work from the assumption that the applicant to a profession who scores at the bottom should not be allowed into the profession, but the applicant at the top should. They then work to find a point somewhere in between as the dividing line. Any passing score they land on suffers from the logic mentioned above in that test takers at that point in the ordering will be there for a host of reasons. However, the test maker can perform research so that for the most part those above that line are indeed qualified. It is far from a perfect system and is always somewhat arbitrary, but it serves a useful purpose and absent an alternative persists.

Commissioner Morath has declared that because STAAR is a criterion-referenced, not a norm-referenced test, all students can climb over an established cut score, and all schools can succeed. This is not a small argument, but the basis and justification behind his thinking. If he is wrong, his system falls apart.

He is wrong, and the misunderstanding is massive.

“Criterion-referenced” is a phrase that refers to the content of the test being drawn from a blueprint explicitly designed to represent an educational subject, with some set of criteria established for a passing score. This is most common in tests teachers use, which are not designed to sort students but check for learning (all students can indeed, pass that sort of test if they learned the material). However, in the past decade or two standardized test-makers (and those who make state tests) made an attempt to move their content in this direction as well. Ordering is a rational research tool, so the logic went, and the closer the content to what was taught, the more meaningful the ordering will be. Thus, both tests teachers give in the classroom, and those that order students, can be criterion-referenced. In fact, it is entirely possible to have a criterion-referenced test that is also normed.

The accurate, factual description of STAAR is this: STAAR is a criterion-referenced testing program because it draws its content from state a blueprint that reflects standards documents. It is a test designed to order students against that content from the student furthest below to the student furthest above related to that content. The state then follows the model of nursing and medical exams and draws a passing line in the sand at a point in the ordering. STAAR is not a normed test, as no one performs the research that would allow for normed results. As a result, comparisons from one year to the next should be done cautiously, and certainly no comparisons are possible between subject areas.

We can state all this unequivocally thanks to the fact that Texas runs a very transparent system and publishes a technical manual annually that shows the application of all the statistical processes necessary to build a test that orders students. Those processes would not be present if the goal was something other than ordering students.

Commissioner Morath’s argument can be summarized as follows:
  1. STAAR is a criterion-referenced test
  2. Because STAAR is not normed, it does not order students.
  3. All students can get past the cut score if schools would do their jobs.
  4. All schools can therefore succeed in the accountability program.
Only one of those is true: STAAR is a criterion-referenced test. The others are all false. STAAR orders students, which means that for all students to pass all students would have to get past some point on a test designed to order them. That would be the equivalent of all students being above average—it cannot and will never happen. As a result, to say that all schools can succeed in his plan is deeply disingenuous. It is instead a guarantee that such a thing will not happen.

If I want to build something an engineer tells me will fall down, that is not a difference of opinion. It is the difference between the position of a professional who understands, and someone who does not. Not heeding the position of the professional in the case of a building would be foolhardy since the thing is unlikely to stand.

Not heeding the positon of professionals in the case of educational accountability is worse. It risks direct damage to the children the Commissioner is constitutionally required to serve by damaging the institutions that serve them. It is beyond foolhardy. STAAR orders students from the furthest below to the furthest above average. That is a verifiable fact. Not an opinion.


  1. As a teacher of English Language Arts, I'd give Commissioner Mike Morath an A+ in fiction; however, in the genre of informational expository text, he fails. Facts matter, and obviously, he is oblivious to the STAAR assessment as a standardized test. Thanks to John Tanner for pointing out the critical truths that educators of tested subjects know, but many outside the arena of public education may lack understanding. Mr. Morath, you need to attend tutorials. Perhaps you can contact Mr. Tanner.

    1. I spent two hours on the phone one morning with the Commissioner last spring and was unsuccessful. He insists his position is accurate, and since everything in his system falls apart if it is not, he stands to lose a great deal should he admit the truth. It is interesting to note that Texas now has a law on the books that insists on a school accountability system in which all schools can succeed--however, that is not the system he is building, and it never can be with tests like STAAR as the backbone. That clause in the law--introduced last session--may be a lever for people to start exploring.

  2. State passing scores on various STAAR tests seem to miraculously fall in the 70-80% passing region. This is extremely suspicious; very likely norms are being used on historical data on the back-end to set criteria that produces the desired result. This approach will always produce high success rates but poorly educated students. Mr. Morath seems to be more concerned with quantity and not quality. TEA has yet to demonstrate that quality and quantity are not inversely related. As soon as a realistic and true standard is imposed, the failure rate soars...

  3. I agree with Mr. Morath. As a person who was in the education field and then having two children with learning disabilities I know that passing the STARR is something that all children can pass. I am a parent in favor of the STARR because it holds our schools and teachers accountable to all. Children can learn and pass it. You are told what to teach in TEKS for mastery. We all need to stop making excuses and start finding solutions on how to make those children who show no interest in doing what you are trying to program or instill in the others as the norm in education to those students who find no significance on gowing with the flow of others. Teaching needs to be looked at from a different perspective. Children now have technology at their fingertips and can find answers to questions a lot faster than when we were educated. Children are questioning what and how and why are we learning this? When I was growing up we didn't do that. We were programmed differently and because of that we learned the way were programmed to. Education and it's teaching has changed teachers are now programmed to teach around STARR and children find it less interesting, not significant, and not all bought into this education program that is being instilled or programmed into them.

    1. Were you "in the educational field" as a teacher who administered the STAAR test to all students? How can you believe All students are capable of passing? What about students whose IQ is borderline MR with emotional disturbances and is ADHD? How can this student with few or no accommodations compete and pass on the very same test as students with average to high academic and intellectual abilities and who are not emotionally disturbed or hindered by focusing issues? What about students who bubble in answers to finish quickly? What about students who are dyslexic, complicated with little to no short or long term memory skills? What about non English language learners who don't care and neither do their parents? When it comes to testing, the fact is one size does not fit all. Never has. Never will. When you "worked in the education field," you were actually a classroom teacher?

    2. I completely agree I have a child who is adhd and has a processing disorder and he receives a few accommodations such as reading the test aloud etc and he has taken the English 1 test 6 times and still cant pass and he has taken English 2 EOC twice failed both. I have had done school mandatory tutorials along with private tutors etc. nothing has worked yet and chances are never will. He passes all of his classes every grading period. never failed a grade but these staar test are just not passable for every student in school and its totally not fair for him to do all his school work, homework, projects and reports etc and pass all year but yet get told after all that for 4 years of high school that will not be able to graduate from high school because of these ridiculous test.. Its very sad and I feel helpless as a parent but he cant help that he cant get it. Its not his fault at all. These tests dumb these kids down more and more every time they take it and causes anxiety and depression and worthlessness

    3. I would like to know how my student who is functioning at about a 9 month old's ability level can pass a 6th grade STAAR test? He is nonverbal and does not communicate other than localizing with ahhh. There is no way to test him. Eye gaze doesn't work with any level of consistency. Has he learned this year, I believe so. But his accomplishment of sitting up independently or being more mobile aren't on the test. Get real.

  4. If you use plot the STAAR Rauch model scale score of the raw scoresscores the state percentile rank of the raw score the correlation is 0.98. The scale score is a percentile rank in disguise. By definition, a certain rank is defined as mastery. Morath is incorrect.

    1. That was plot scale score verus state percentile rank to get .98 correlation.

  5. Let's be realistic. Texas teaches to the STAAR test. All they are concerned with is data. As an educator who taught struggling readers, I had students who NEVER passed the STAAR. They showed gains, but because their reading levels were below the level of the test they were given, they never passed. Some districts do not have the money to set up classes for such students.

  6. THANK YOU for illuminating this fundamental flaw in our accountability system. What do you think about parents opting out to withhold our children’s scores from this broken system? If Commissioner Morath can’t acknowledge (or understand) how STAAR is used to sort students, teachers, schools, and school
    districts into “winners” and “losers”, where the “losers” are disproportionately poor, Black, Latinx, English language learners, then what happens if we deprive the system of the data it needs to sort? (Maybe 5%, 10%, or more students and subpopulations opting out?)

  7. The only hope I see in the testing swamp is to simply report scores and go on. So, if you want to go to college A, they can look at your STARR scores, SAT scores, grades, AP clssses, etc. and decide. You never pass or fail, you simply get what you get. When you graduate from high school, you receive a document that says, this is what you have done, accomplished and scored. If you want to be a soccer player, artist or chef, maybe your ability to pass English 4 is not all that important. If you want to go to Harvard,those scores and tests are going to be very important.Instead of expecting every student to meet an arbitrary bar, look for ways for every student to show sucess in some way, in some field. But be totally transparent, meaning that parents and students should not be mislead, as they are now, into believing that a high school diploma means acceptance into the college of your choice.

  8. The reading tests have readability scores that are above grade level. If it is a test they all can pass, then why make a fourth grader read a story on 8th grade level?! Our student would have a chance to pass if the passages were actually on grade level.