What this year's Texas accountability ratings really mean
Here’s why everyone in the state of Texas should take the most recent accountability ratings provided (or about to be provided) by the Texas Education Agency and drop them in the trash.
The state of Texas changed its testing program from 2022 to 2023. It added more items, it added different kinds of items, it appears to have added some extraordinarily difficult items, and it moved the administration entirely online.
Whatever your feelings about standardized testing (mine are quite strong as anyone who knows me is aware), the guidelines for how you make this sort of change are crystal clear: you start over.
The whole (and only legitimate) point of standardized testing is to create comparability of students via a test instrument, both as of a moment in time and over time. Since the students (and the world around them) are going to obviously change and grow, detecting changes and growth is only possible if the design of the instrument stands still. If the instrument shifts at the same time as the students, the myriad of moving parts make it impossible to know what any differences might mean.
This isn’t my opinion. Rather it is contained in the standards that define what a standardized test is and how do use them so as not to make inappropriate or invalid inferences. You can buy a copy of the standards here if you’d like to read them for yourself: https://www.apa.org/science/programs/testing/standards. This point about changing tests is crystal clear.
Only TEA isn’t starting over. They’ve maintained the name of the test from the old to the new, and they have made attempts to align the various cut scores from the old to the new in the hopes that the meaning of the labels will be comparable, even though the tests are not.
They have even made an attempt to place the new test onto the same scale as the old, which again is something the standards advise be done hesitantly as the scales may be as different as a ruler based on centimeters versus one based on inches.
Researchers will frequently conduct studies when changing testing programs in an effort to broadly understand the differences in their tests, in their labels and cut scores, and in the available interpretations, but in doing so they readily acknowledge that their findings should be taken with a grain of salt as the amount of error across such a transition is likely to be significant.
Most importantly, the researchers will make it clear that they will never be able to explain away all the differences between the two instruments. What they know is there will be differences that escape interpretation. Hence, the need to start over.
What all this means is that whatever differences you happen to see between the ratings of a school last year and the ratings of a school this year need to be perceived as manufactured. Made up. Perhaps even the result of politicization. And thus tossed in the trash.
If you imagine defining goodness in society one year via the number of kind acts performed, and the next via the number of philanthropic donations to charities, you’ll see what I mean by manufactured. Both definitions may prove to be legitimate views of goodness, but they can’t be compared. So trying to say we had x amount of goodness last year, and a different amount this year isn’t possible. You changed instruments and thus also the definition of what it means to be good.
If someone declares a change in goodness from the available data, either they don’t understand comparability (the kind version of things) or they have an agenda (the cynical version of things) that can be easily detected in whether the resulting message is positive or negative.
So what is the message being relayed via the non-comparable test results from 2022 to 2023? It’s still early, but it appears that the message is that schools serving children in more challenging environments failed to grow their students, while students serving more privileged populations grew them quite nicely. I can think of lots of politized reasons for both of those (my cynical view).
Regardless, the only way to arrive at inferences that jump from one test design to an entirely new test design is to manufacture them. And since that is the case here (not an accusation, by the way, but a fact—TEA has said as much though not in so many words), we are left with the only conclusion that from among the various formulas available, each of which would have produced a different result, this was the result they wanted.
One final note I’d like to make here is to point out that TEAs defense of all this will likely include the words valid and reliable in reference to both the old 2022 test design as well as the new 2023 test design. There is no doubt in my mind that both test instruments meet the requirements, (again see the standards) for what it means to have a valid and reliable test. But that doesn’t also mean that the validity and reliability of the instrument can extend to whatever else TEA might want to do with the results.
Using that argument to justify the machinations undertaken here would be like telling someone who has never seen a car before that passing the state inspection renders that car fit to run in the Daytona 500. What is a valid understanding regarding windshield wipers, brake lights, and a working horn cannot be extended any further. So too with the highly technical considerations of test validity and reliability.
What TEA has done can only be seen as a manufactured effort because that is all that is possible in the switch to a new testing program. And the results of that manufactured effort need to be seen as intentional, since TEA could just as easily have selected a different manufacturing process with a different result and did not.
Thanks, John! As always, your perspective is spot on.
ReplyDeletePerfectly captured. Thank you!
ReplyDeleteJohn, thank you for sharing your thoughts. I did not realize TEA had made base changes to the STAAR test. Fortunately, I'm out of the STAAR testing grind.
ReplyDeleteThank you, John. https://vimeo.com/745603388
ReplyDelete