Wednesday, January 15, 2020

The gross misunderstanding in educational accountability

For a word used with ease in educational policy circles, accountability is a term that is surprisingly misunderstood and misused.

Seeing this is relatively simple. Ask an audience to brainstorm a list of terms they associate with accountability and a pattern will quickly emerge. Many of the words will be positive such as:
  • Transparency
  • Effectiveness
  • Responsibility
  • Outcomes
And many of the words and phrases will be negative, such as:
  • Feet to the fire
  • Testing due to lack of trust
  • Blame
  • Shame
If you list these words in two columns on a sheet of paper what you will be observing are the two sides to accountability.

The negative terms represent what happens when an organization refuses to be accountable and/or is perceived as failing. In that case, accountability is something imposed on that organization by outside stakeholders for the purpose of bringing the organization in line. Such an accountability focuses the organization on failure prevention at the expense of everything else.

The positive terms represent what happens in effective organizations. These are organizations that internalize the principles behind these terms and attempt to exemplify them in their efforts.

This type of accountability focuses the organization on how best to sustain itself long-term, and how best to communicate its effort to its stakeholders.

Both types of accountability are perfectly valid depending on the circumstance.

What should be clear is that the objective for any organization should be an accountability focused on long-term sustainable excellence. This properly aligns the organization with its long-term goals and the idea of continuous improvement.

What should also be clear is that imposing an accountability of failure prevention by stakeholders must be performed thoughtfully. Its intent is not long-term sustainable excellence, but just the opposite: an immediate, short-term failure correction. The intent of an imposed accountability is to focus the organization and its resources on correcting the failure at the earliest possible moment or the organization’s existence may well be at risk.

An imposed accountability’s purpose is thus temporary: to force an immediate correction after which the organization can turn its focus towards long-term sustainable excellence. When an organization is having its feet held to the fire its job is not long-term sustainable excellence but something else. The sooner it can correct its errors and turn its attention towards long-term sustainable excellence, the sooner it can return to a state of effectiveness.

It would be deeply illogical and harmful to any organization required to operate in the perpetual shadow of an imposed accountability when the goal is long term effectiveness. The reason for this is simple: it would make the formal focus of the organization failure prevention, and thus attempts at long-term effectiveness would be perceived as secondary.

Even if the organization’s leaders recognized they were in an illogical system and attempted to focus stakeholders on their long-term approach, the fact that the imposed accountability was at the behest of stakeholders while the long-term approach was not, means the imposed accountability is likely to triumph. At best this would cause any positive message to be diluted, and at worst ignored or not believed.

Getting the balance right is always a challenge as organizations consist of lots of moving parts and it will regularly be the case that some of those parts are deserving of an imposed accountability. So long as such accountabilities are temporary that part of the organization can correct itself and return to a focus on the long term the accountability system. In that case the overall accountability system will be seen as contributing to the overall well-being of the organization.

The objective must be for any organization to spend the majority of its existence in an accountability focused on long-term sustainable excellence, and as little time as possible under the pressure of an imposed accountability. Only then will it be in a position to deliver effectively for its stakeholders

Sunday, January 5, 2020

How standardized tests do what they do (which isn’t what most people think)

Standardized test is the name most people assign to the tests used in state accountability systems, commercially available norm-referenced tests, and college admittance tests such as the ACT and SAT. I have long encouraged folks to drop the term “standardized,” since that merely refers to the conditions under which tests can be administered, rather than what this narrow family of tests are and do.

Instead, I prefer to call them predictive tests. This describes what they are intended to do.

I have also strongly encouraged a more critical use of vocabulary regarding predictive testing. This is because of the massive confusion that results from the plethora of terms now applied to testing that don’t mean what most people think, such as standards-based, or criterion-referenced.

What sets a predictive test apart from all other forms of testing is its ability to produce predictive scores. Simply (and crudely) put, if I am slightly above average this year you can predict that I will probably be slightly above average next year. If I am not, if I am well-above or below average, you can note it and begin the search for causes. Perhaps there are lessons to be learned or perhaps not, but as a signal for where to look such test scores have some use.

Confusion is created when people presume that their names for testing, such as standards-based or criterion-referenced, are parallel forms of testing to a predictive test. This is inaccurate. If the tests produce consistent results across administrations, they are first and foremost predictive tests. You may have drawn the content from a state’s written standards and labeled it a standards-based test, or drawn a line in the sand and assigned it a label, in which case you created a criterion (as you have assigned a score meaning that is external to the test). Or you may have conducted a comparative study after the fact that allowed you to apply norms. Regardless, the style of tests in which you are operating is predictive.

And, by the way, creating this narrow sort of instrument requires real specialization and training, as the sorting function will only occur in a consistent fashion with test items that perform within a narrow set of statistical criteria, and that combine to create a specific effect. This is a far cry from a teacher building a test to understand the effectiveness of their teaching or whether students learned a lesson—that isn’t even in the same ballpark. The last thing a teacher should care about regarding learning is whether their items sort kids into a curve, while that concern is first and foremost in order for a predictive test to work.

The greatest mistake people make with a predictive test is to presume that the consistency in the results has more meaning than it does, when the fact is that the meaning is surprisingly limited.

The consistency is created by first finding average and then calculating how far from average each test taker is. Since averages are reasonably consistent over time, as is a student’s relationship to average, the results will be as well.

The usefulness in this is that a student’s position is predictive as described above, and movement can be explored for potential lessons. The resulting orderings are also useful in that they show broad patterns behind them, often regarding socioeconomics, gender, race, etc. As researchers identify these and policies and procedures are put in place, future parallel instruments can be used to understand the effectiveness of those policies and procedures by noting whether or not negative patterns dissipate.

A perfect ordering on an entire domain is simply not possible—that would result in a test that was thousands of items long. Instead, test makers locate a few items that will order students about the same as if the ordering were done on the entire domain. This makes the test a proxy for the domain, and still useful in spite of the fact that it is not a statistically representative sample of it. So long as the ordering on the limited selection of content will be roughly the same as on the entire body of content it is still useful in the hands of a thoughtful researcher who understands how the tested content was derived.

The fact that such tests are proxies for the larger domain adds another limitation to the scores: they are estimates only, with some amount of imprecision in each. That just means that while a majority of the time students taking similar tests on consecutive days will score similarly, some will not, and some will have scores that differ a great deal. Again, in the hands of a researcher who understands these limitations and that the scores are simply a broad signal for where to look for patterns and causes, these limitations don’t render the results useless. While they are limited, they can be useful so long as that use can tolerate the fact that scores are estimates based on a proxy and nothing more.

The primary confusion comes because the predictive test methodology produces reasonably consistent scores over time even though the test is based on a proxy for the entire domain. The resulting estimates (scores) are still sufficiently consistent over time to allow for researchers to find some value in them. But that doesn’t magically transform them into something they are not, opening up a world of uses beyond their design. Any use that assumed so would be silly.

Which is why the use of state test scores can rightfully be called silly. They are derived from the predictive test methodology yet are treated not as proxies, but as representative of an entire domain, worthy of teaching to and guiding learning when that cannot be the case. They are treated not as estimates useful for research, but absolutes to make judgements. And worst, they are treated as signals of quality when that was never in their design.

This last point has been particularly disastrous for schools that serve students from historically marginalized communities. It is a fact that if you order students as of a day on a domain such as literacy—whether via proxy or a more complete measure—and some aspect of society contributes heavily to students’ ability to acquire knowledge within that domain, the ordering will reflect that. But as of that moment no judgment is available to be made. Some set of students may be behind because of real failure in their efforts or those of the school, in which case remedies for failure should be available and applied. But they may just as well be behind due to a lack of opportunity. In that case a failure judgment and remedy would be wrong, even unethical, as it would be the wrong remedy.

Rather, a different remedy should be applied that addresses the issue of being behind as being behind, but not failure. Mislabeling the problem would be a huge mistake as it would create perceptions that may not be real, force actions that run counter to need, and justify historical biases. Even worse, labeling being behind as failure risks converting being behind to failure, in which case the current system of test-based accountability could be said to have been a contributing cause to the further suffering of those who can least afford it, to the detriment of our nation as a whole.

In short, every role educational policy asks predictive tests to play is outside and beyond their design, with a profound number of ill effects that come from their bad assumptions. Predictive tests cannot be used to judge quality or effectiveness, guide or drive instruction, or indicate the effectiveness of policy.

So, there you have it: predictive tests work by being predictive, but in order to be predictive they can’t be much else, and they certainly cannot be used as the primary tool in school accountability. The sooner we all realize that fact the better.