Monday, November 25, 2019

Response to a common set of questions on how best to use tests in an accountability system

I received a note the other day with an inquiry. It contained five question. I took the opportunity to craft a response I’ll share below, since I get these sorts of questions a lot.

Here were the questions:

1. How can a standards based adaptive assessment used throughout the year be one tool used for accountability purposes?

2. If an assessment covering a set range of standards is used throughout the school year, what other factors need to be considered to more effectively determine if students are reaching developmentally appropriate learning targets?

3. Content mastery and student progress on state standards measure student proficiency towards specific items. How should student work samples, portfolios, or other student level artifacts be used as an indication of a school’s ability to develop independent young adults?

4. In terms of accountability, what value is there in communities creating annual measurable goals aligned to a 5 year strategic plan and progress towards those goals being the basis of accountability?

These questions are similar to those I get almost every day from people understandably trying to fit square pegs into round holes. There are multiple layers to a response.

First, accountability over the years has become commensurate with test scores and objective data. When trying to gather information about learning, proficiency, or progress, test scores are presumed to be the best, and often the only source for answers. Even when other sources are considered, test scores tend to occupy the primary position in the conversation.

Second, coverage is now the dominant paradigm in learning. Coverage is now a common goal regarding a state’s content standards, and most other educational targets such as development, mastery, and progress are presumed to relate to the amount of content consumed. This is due almost entirely to the fact that the tests are said to cover a broad swath of content, and given that success is in those tests, success and coverage are presumed to be one and the same.

“Success” in such a system is in fact anything but, due entirely to the design of that system. Consider that tests that produce predictive results over time result in far less interpretive information then state accountability systems presume. The assumption on the part of the state is that a predictive test score is capable on its own of signaling success or failure, both of the student and the school. But that assumption belies the design. Predictive tests produce scores that indicate where a thoughtful educator or researcher may want to explore further, but they cannot contain within them the causes behind the indicator—in fact, that ability to make direct causal connections is removed during the design process in order to create the stability in the results over time.

Once a cause is understood it may be worthy of judgment, but until it is explored any judgments (whether good or bad) are premature, made without evidence. Any judgments made prior to an exploration of causes will make an organization less, not more effective, because absent an understanding of cause any change is a shot in the dark at best. If an effect does occur it will be presumed the shot in the dark actions caused it, and those actions will be repeated or discarded without understanding if they did or did not contribute to the effect.

Any accountability that fails to allow for the identification of causes prior to judgments will do this. I know of no other field with an accountability that commits such an egregious mistake, as it is a recipe for confusion and inefficiency.

And please know, what I describe above is baked into the current design of educational accountability so that the questions you pose are common. Underlying each question is a deep desire for effective teaching and deep learning and preparing children for their lives, as well as the need to build long term sustainable solutions. But that isn’t what the current system was designed to do, which is where the misfit comes.

The best way to see this is to recognize that there are two sides to accountability. The first is easily understood if an audience is asked to list all of the terms they associate with accountability. Most will offer up things such as responsibility, transparency, effectiveness, outcomes, and success against mission. These are all positive and any effective leader includes all of them in their leadership practice.

But there is another side to accountability that we do to organizations that refuse to be accountable. In this case accountability is imposed upon these organizations. When it is necessary to impose an accountability, the positive terms are presumed to be absent and it becomes necessary to hold people to account, to motivate through blame and shame, to test claims due to mistrust, and to inflict punishment or sanctions when compliance does not occur.

The objectives in these two accountabilities are different. In the first the goal is a long-term sustainable effect. In the latter the goal is failure-avoidance. If the goal is failure avoidance there isn’t time to think about long-term sustainable effects as you aren’t yet there. First you need to prove you can avoid failure, then you can think about doing great things.

This is why in every case other than education, imposed accountabilities are temporary, meant to resolve a crisis in the short term so the organization can get back to long-term sustainable thinking. It would be folly to think that an imposed accountability can focus on long-term excellence as that is not in its purpose nor in its design.

It is this difference that defines the tension in the questions you propose. Those questions each contain the desire for a long-term effectiveness, and yet they are being asked from within an imposed accountability environment designed to promote failure avoidance (the coverage paradigm of our current standards environment is a perfect example, as it is about control in support of failure avoidance, not long-term excellence). Our policies use language that aligns with long-term effectiveness while imposing a system designed as a short-term response to failure.

All of which is exacerbated by the selection of a predictive testing methodology they assume can do things for which it was never designed, most notably signal on its own the success of a school or the quality of a student’s performance without actually knowing the cause.

With that as a context let me now start to address the questions you pose a bit more directly.

Any test score, be it a predictive test score with its underlying psychometrics, or a classroom quiz, is a form of evidence. But in order to serve as evidence for a thing you must first have a sense of what that thing is. Evidence is necessary to answer critical questions such as: who is learning? What are they learning? Who is not learning? What is preventing learning from happening?

None of these on their own are answerable through a single evidentiary source, and each question requires sources other than test data to create a sufficient understanding regarding what to do next. Any action that attempts to treat any data source as absolute risks a decision based on incomplete evidence, which makes the decision invalid, even if by luck it happens to be the right decision. In any case it makes the organization less, not more effective, by creating dissonance between the effects that can be observed and their causes. This in turn risks promoting the wrong causes for observed effects, which is never a good thing.

Finally, accountability in effective organizations occurs at a level both the technical experts within a profession and amateurs outside it can both relate to and understand. Think about a visit to the doctor and you'll understand what I mean. Those of us who are not doctors can stare at a battery of test results for hours and still not understand what they mean. We may go on WebMD and attempt to view each indicator in isolation, but a meaningful interpretation requires a doctor with a much broader and deeper understanding then those of us who have not been through medical school possess.

The doctor does not start by taking us one by one through each of the dozens of tests, but rather, at a level we can both relate to as an amateur and a professional: the relative health of the patient. From there, the doctor can take a patient into the weeds for a deeper conversation where technical understanding is required, but through a lens appropriate to those of us without medical training.

The same is true for any profession that requires technical understanding: engineering, mechanics, computer programming, education, etc. In each of these there exists a level at which professionals and amateurs can have meaningful conversations about the work, and it is at that level that organizational accountability must occur.

It would be difficult, if not impossible, for outsiders to engage in a meaningful way with the technical part of an organization. The nature of technical information is such that the further into it you go the more likely you are to identify contradictions, counterintuitive thinking, and a lack of absolutes, which requires a technical understanding to work through and still be effective. Someone without that technical understanding is at risk of seeing the contradictions, counterintuitive thinking, and lack of absolutes as negative, as evidence of something other than what they had hoped to see.

It would be na├»ve to think that the non-technical person could dictate a response based on their limited understanding that would be meaningful which is why it isn't done—it would make the organization less, not more effective. I don’t argue with an engineer over how far his or her beams can be cantilevered over an open space, but rather start at a point we both understand—what I want the building to look like—and let the professional then do their job.

Test scores represent technical information, especially predictive test scores with their psychometric underpinnings. As such they require technicians to interpret them properly given that those interpretations will often run counter to what an untrained eye might see. For example, an untrained eye may equate a low test score with failure and insist a school act accordingly. But a technician who understands such scores would first look to causes and other evidence before arriving at any conclusion.

It may be that the evidence suggests some amount of genuine failure exists, in which case the remedies for overcoming failure should be applied. But it may also be the case that the evidence suggests the student is simply behind his or her peers given that their exposure to academic content outside of schools is limited. In that case the remedies for being behind should be applied, which are very different then the remedies for failure. To apply the wrong remedy would make the school and the student less, not more effective.

Starting with test scores as the basis for any accountability absent a technical interpretative lens creates this very risk. Test scores, contrary to popular opinion, are not simple to understand, do not produce immediately actionable results, and should not be interpreted bluntly. They are always in the weeds of an organization, part of the technical environment in which professionals work. While we should never be afraid of sharing them broadly, it is imperative that we take our outside stakeholders into them through an interpretive lens appropriate to both the amateur and the professional. The failure to do this will result in misunderstandings and frustration on all parts.

The answer to all four questions that started this off is this: educational decisions require a rich evidentiary environment that goes well beyond traditional data sources to understand the educational progress of a child. Tests can certainly be a part of that evidentiary environment, and better tests and assessments are useful in that regard and we should encourage their production. But better tests or better assessment vehicles do not solve the accountability problem.

That problem is only solved once we can shift from an imposed accountability focused on failure avoidance to a true accountability focused on long-term sustained excellence. Continuing to treat testing as our primary accountability source mires us in the technical weeds and as a result is highly likely to create misunderstandings regarding school effectiveness.

My advice: ask the right questions, treat test scores as one evidentiary source but never the only evidentiary source, question the interpretations alongside other professionals so that the best possible conclusions can be reached, and define success in any long-term plan by answering the question: what is it we hope to accomplish? rather than: what should we measure? That latter question will tie you up in knots as what is measurable empirically represents only a small percentage of what matters in a child's life and to a school.

Evidence is the proper term, as we can gather evidence on anything we need to accomplish so long as we can observe it. Focus at that level and you’ll arrive at meaningful answers to each of your questions.