Friday, August 24, 2018

Do you support sending our best teachers to our most challenging school environments?

The question in the title to this blog has been posed to me twice in the last week, which I think is due to states releasing their accountability judgments of schools just before the kids all come back.

Common sense might suggest that only an idiot would say no. I'm not an idiot (according to most people I meet) but I'm here to say that we'll do more harm than good if we oversimplify our responses and just say, "sure."

First, we need to identify what we mean by a best teacher. I can do that easily. A best teacher is one who can maximize an educational benefit for the children in his or her classroom. We could get even more specific if we wanted and say that the best teacher for a child is the one who can maximize the benefit for that specific child, but for the sake of the argument here, lets keep it general: a best teacher is one who can maximize the educational benefit for children.

Second, we need to identify what we mean by our most challenging school environments. I can also do that easily. Those would include places that have historically had poor school leadership, or teachers not committed to their profession, or that serve communities of children who through no fault of their own find themselves in environments that make learning a real challenge and the students would benefit from additional help and support.

To take those best teachers and ask them to serve in our most challenging environments as described above would, by any stretch of the imagination, be a good use of a valuable resource.

However, there are huge issues for how we identify best teachers and our most challenging environments given that state test scores tend to be the main mechanism for doing both. Schools with high test scores are presumed to be quality schools, while schools with low scores are presumed to be bad schools, and so too with the teachers in the building. Thus a simplistic approach to selecting the best teachers and placing them into our most challenging environments would be to take teachers from high scoring places and put them into low scoring places.

This would likely do more harm than good for two basic reasons: first, state test scores on their own (contrary to popular belief) were never designed to identify the quality of a teacher. Test scores of the type that produce consistent results over time are useful to researchers in that they can signal where a researcher ought to take a deeper look, but whether that researcher finds something worthy of a positive or negative judgment is a different issue entirely.

For example, a group of students from a low-scoring high school classroom may all still be in school due to a dedicated teacher who prevented them from dropping out—that is not failure on the part of the teacher, but rather, evidence that we might want that teacher in our most challenging environment.

Or consider a group of students in a high scoring school who are where they are entirely because of stable homes, highly educated parents, and a desire for higher education, and their performance can be shown to be the same regardless of which teacher was placed before them. That is not evidence of success on the part of those teachers, but of something else. And since we have no real evidence of their effectiveness, we have no way of knowing if placing them in our most challenging environment will have the desired effect.

If we assume that state test scores on their own identify the best teachers, we risk sending the wrong teacher to our most challenging environments, and the entire system of education would be less, not more efficient as a result.

The exact same is true when trying to define our most challenging environments. If we rely on test scores as the signal we risk disrupting schools in which effective work is taking place and not sending the best teachers to where they can be most effective. Replacing the teacher mentioned above who has proven capable of keeping at risk students in school risks putting a less capable teacher in his or her place. Replacing an effective school leader with strong ties to the community that can be leveraged to support at risk students with a less effective leader may occur if you try and judge leadership through test scores without looking at underlying effects.

The risk is this: by relying on test scores to identify the best teachers and our most challenging environments, we risk sending unqualified teachers to replace qualified teachers, and we risk sending them to places where they are not needed as opposed to places where they are.

You do that and you'll hurt children far more than you'll help them--and mostly those who really need us. I'm all for sending the best teachers to our most challenging environments, but only once we have a valid means for identifying both.


Monday, August 13, 2018

School grades as snake oil that is good for no one

Just because someone offers you a snake oil cure for how to improve the quality of public schools doesn’t mean you have to swallow it. In fact, you shouldn’t.

The latest snake oil cure in Texas is school accountability via school grades. I know a good bit about school accountability—I make a living from the topic and have a deep-seated belief that true accountability is both necessary and achievable. And that snake oil isn’t the answer.

It’s easy to see the snake oil for what it is if you back up and ask a simple question: how does accountability work in successful organizations? I’ve explored the answer for years, written a book and a bunch of articles on the topic, and now work with schools to put in place what I’ve discovered. The answers to the question reveal the difference between a false accountability that will miss every policy goal it claims to support, and a true accountability that can move an organization in a desired direction.

I’ll mention three principles of true accountability to make my point.

Principle number one is that true accountability requires a complete, not a partial accounting. You wouldn’t invest in a company that provided one month’s worth of records and insisted it represented the entire year. You wouldn’t know the meaning behind a set of financial records without the shareholder’s report that explained the company’s performance and its plans for the future. Nor would you trust a non-profit that claimed to help people but refused to disclose how your donations were being spent.

Partial accountings that attempt to substitute for a full accounting must always be considered invalid. They can tell a story, but it will never be a full story, and risks being a wrong story. Any action against a partial accounting risks being a wrong action that makes things worse.

Principle number two is that true accountability must account for the mission of the organization, not just what is convenient to see or measure. The example I use all the time is the mission of the light bulb verses its measures. I can measure a great deal about a bulb—in fact, lights bulbs have measurable standards down to the tiniest detail so that any light bulb will fit into any socket and be as bright as any other 60-watt bulb no matter who manufactured it.

But if I focus only on what can be measured I miss the mission. The Louvre at night, the lit stage at the Metropolitan Opera, or a city skyline just after dusk represent the mission of the light bulb. If you only care about what can be measured, you risk that mission never being realized. And most of what matters in life and in organizations is at the mission, not the measurement level.

Principle number three is that true accountability demands contextualization to be accurately understood. Raw data that shows one company’s profits at 30%, and another’s at 1% aren’t comparable absent a context. Grocery store chains build hugely successful business models at very low margins, while a tech company needs a much greater margin to keep up with a constantly changing world. Some companies considered successful haven’t yet turned a profit and don’t plan to for years.

Absent the context for each, no judgments can be made. The grocery business cannot be judged as more or less successful the others absent the context. The business that has not yet turned a profit may in fact be the most successful of them all. It is the context that reveals the truth.

Note that context never equals excuse.

Make no mistake about carrying out these principles: some leaders do so far better than others, which in a quality organization necessitates changes. The most productive change occurs when individuals learn and improve, and while most can do just that, some either cannot or will not and for the good of the organization are asked to work elsewhere.

The snake oil of school grades violates each of these principles (and, just for the record, they violate each of the other principles not mentioned here as well) to the point that to call it accountability is to misname it.

First, school grades are by definition a partial accounting. They combine reading and math scores from end of a year tests (which given the design of the tests don’t mean what most people think) with several other annualized variables to produce a grade. This is in fact a partial accounting of the few things being analyzed, which in turn are a partial accounting of what happens in schools. Therefore the stories that results ask being wrong.

Second, school grades occur at the moment of the measure and never account for the mission of schooling. The mission of schooling should be to maximize the educational benefit for each child in the finite amount of time we have them in the educational system, so they are well-prepared to tackle life. By stopping at the measure, we put that mission in jeopardy, which harms kids.

And third, school grades are presented absent a contextualization. What was the focus of the school for the year? What were the issues unique to the student population and what was done to properly serve those needs? What are the hopes and dreams of the parents for their children, and to what degree is the school making those a reality? To what degree was the school effective? Understanding the context may well reveal that a school with fairly low test scores is serving its students, their parents, and the community effectively, while a school with fairly high test scores is not. That truth would be useful and actionable. The potential falsehood presented through decontextualized test scores and their resulting decontextualized grades would not.

A better way is possible. Asking the main accountability question: for what and to whom? offers anyone willing to ask it an insight into the mission of schooling and what it must attempt in order to properly serve students. A true accountability to that mission is a higher accountability than anything represented in the snake oil of school grades, and far more demanding of educators. A true accountability incents the truth, demands continuous improvement, and puts benefitting students front and center. It will reveal that some leaders do all of this better than others and insist that that those who lag behind their peers learn and grow themselves, with consequences when they do not or cannot.

The school grading system in Texas needs to be recognized for the snake oil it is. Don’t buy into its false promises of being clear and meaningful or offering a true path to improvement. Its failure to align with even one of the accountability principles reveals it for the charlatan that it is, and if you pretend otherwise it is the children of Texas who will bear the brunt.

We can and must do better.

Wednesday, August 8, 2018

The most bogus claim I've heard in months: that school grades are fair

Education Commissioner Morath in the great state of Texas is about to release grades for schools. His quote:

"The idea that the design of the system was meant to highlight both high levels of student achievement and high levels of educator impact makes this essentially the fairest system in the history of the state of Texas." (Article by Julie Chang in the Austin American Statesman, August 7, 2018--see it here--italics are mine.)

The claim in italics is bogus. And it is easy for anyone to see why.

Think of what it means to assess a student. You can do that by imagining the full range of assessment done to create understanding regarding a student or a school as a large sphere, with many layers to it. Trying to understand all the complexity to properly assess student or school needs and assign appropriate judgments is a constant, on-going thing. It requires trained teachers, lots of effort and energy, and proximity to the students being assessed.
Tests are, by design and definition, a focused, limited form of assessment. You can think of each test as focused on a small portion of the surface of the sphere. But that's it. No test exists that can provide a complete assessment, standardized tests aren't even designed go below the surface, and even all the tests in the world can't do the full job of assessment.

When doctors use tests to guide their assessment of a patient, they generally do lots of tests--why? Because tests frequently produce contradictory or inconclusive results. It is then the doctor's job as the chief assessor to interpret the results from the various tests to the benefit of their patient. It is the job of educators to do the same. Should either not do that their conclusions risk being dead wrong for either the patient or the student, with serious consequences either way.

Doctors understand the complete lack of validity in making a complex prognosis from a single test--it would be unethical to do so. Educators understand the complete lack of validity in extending the results from a single, narrow test to a broader judgment that ignores the more critical assessment sphere--and that it would also be unethical to do so.

It is imperative that if judgments about schools are going to be made they must address the entire assessment sphere and get to the level of understanding. That would be the definition of fairness. Any thing short of that is, by definition, unfair. Commissioner Morath, who uses a single test as if it can assess the whole of a student and a school, would be wrong to suggest that what he has done is fair or that his judgments are accurate. A simple understanding of assessment, as well as what a standardized test is, how they work, and the limits in their design, are the only things standing in the way of the Commissioner seeing this.

Tuesday, July 10, 2018

On standards standardizing

Formal standards standardize something. It is useful to standardize some things, such as electrical outlets, allowable car emissions, and the minimum requirements to become a doctor, lawyer, or nurse. Standardizing the outlet means that any electrical device with a compliant plug will fit, regardless of who manufactured either of them. Standardizing allowable car emissions helps keep the air clean. Standardizing minimum requirements to enter certain professions is intended to ensure a basic level of quality and protect citizens from those selling snake oil.

It should be noted that the standards that exist in the world have a profound impact on each of our daily lives. I can go to any gas station and know that the gas nozzle will fit in my car, rather than having to find a Honda nozzle for my Honda car. I can buy a car and trust it will meet federal standards regarding emissions. I can go to the doctor and know that at the very least they met the compliance requirements to be a doctor, and because I know those requirements to be fairly robust, have some confidence in that person never having met them.

Standards are met through compliance. When the plug fits the socket, the level of emissions is below a threshold, or the prospective doctor completes the last of the requirements to be a doctor, it can be said that in each case compliance with the standard was accomplished.

Standards are about control. An ideal standard is one that controls only what it must to achieve some end. We control the dimensions of a plug, which creates an infinite number of possibilities for whatever is powered at the end of that plug. We eliminate the opportunity for car manufacturers to make a profit off of cars that are bad for the environment. We control who can and cannot become a doctor for the sake of protecting citizens from harm.

A terrible standard standardizes the wrong things, which in turn creates inefficiencies and/or frustration. Standardizing the length of the cord attached to the plug to one or fifty feet would be silly as it would not match with the needs of both consumers and their selected product. A standard that required every car to have a gas tank in a safe place (think Ford Pinto), an internal combustion engine, and a quality exhaust system would eliminate the creativity required to build electric cars. A “desire to help” as the standard for entering the medical profession would doom a great many patients.

In schools we have standardized the length of the cord and the internal parts to schooling, as well as removed any standard for who can make educational decisions or even run a school. In short, the wrong things.

We standardize who is in what grade by age, regardless of need, life experiences, or the distance to some set of meaningful goals. We standardize the inputs through lists of controlled content (appropriately called standards given that they control what is to be taught) aligned to standardized tests that will be administered as the output, with the goal that all students will leave standardized around a particular test score (which for those of you who read my writings know is not actually possible), not to mention the bureaucratic requirements to standardize teacher actions throughout a school day.

And if you can fog a mirror but have never set foot in a school since you were a student you are all but qualified to apply for and run a charter school in most states.

We have to standardize some things in a school, just as any organization does. But what if we had approached schools like those who standardized the ordinary three-pronged plug did and standardized only what was absolutely necessary? What would those things be?

It isn’t a simple question. Outcome-based education was one attempt to standardize outcomes and let the inputs vary according to need, but it ran into a political buzz saw (and for legit reasons—there are a million ways to get to a successful adulthood and standardizing one or two of those belies that fact). The standards movement tried to standardize teaching and learning, and now we have lots of bored kids not learning to the depths they should, and teachers tired of not being able to focus on what students actually need. The test-based accountability movement tried to standardize an outcome each year on a date certain, by which all kids would have learned the years’ worth of material and demonstrate that learning via a test score. The silliness in that is just embarrassing (read my book, or anything I’ve ever written if you want the full argument as to why).

We should think about that question: what could we standardize in education that would help us maximize the educational benefit we can provide to each and every student within the limited resources available to us? We need to identify those things. And then we should figure out how to get there.

Quick note: I posted something similar to this before and someone wrote for sources. See chapter four in Pitfalls of Reform, for what is still one of the more detailed descriptions of the problem I've written. I'll tackle this issue at length when I get around to finishing the book I've been working on since Pitfalls was published.

Thursday, July 5, 2018

On confirmation bias and test scores

It is a natural thing to seek out messages that confirm our preexisting beliefs or ideas about the world and other people. It is also fairly common to interpret unclear messages in a manner favorable to the beliefs or ideas we hold. However, just because it is natural or common does not mean it is also good or right. It is not—in fact, just the opposite.

The tendency to seek out confirming messages is called confirmation bias. The trouble with confirmation bias is that it always risks replacing the truth with what we might want to hear. Any resulting action then risks being the wrong action when the truth is considered, while appearing to be the right action given the bias. It allows any judgment and subsequent action to appear and feel appropriate, while it may be entirely wrong given the underlying realities.

Our preexisting beliefs or ideas can occur from a dizzying array of possibilities, can be subtle or blatant, and can be sexist or racist. They may resonate from surface resentments, or deep within the subconscious mind. Regardless, if all we do is seek out messages that match those biases, we risk living in a fictional bubble that assumes the supremacy and validity of our beliefs, while presuming, quite illogically, that it is everyone else that is illogical.

Consider the bias that exists generally in American society that schools in wealthy neighborhoods are better than those in poorer neighborhoods. It is fairly simple to see that as a bias through a simple example: consider a school in a wealthier neighborhood is full of wealthy students whose families support and contribute to their children’s’ education while the school leaders coast on demographics and offer little by way of support. If so, the school deserves to be called out. Or consider that school leaders in a poor part of the community can be shown to spend each day making up for the effects of poverty and that because of the excellent decisions being made each of their students now has a meaningful chance at a good life. If so, those school leaders deserve our applause and our support.

Or it could be the opposite. Or anything in between. The point is, you have to go look. Unless you do, you risk an erroneous judgment and actions that are themselves illogical. It would be illogical and harmful to tell the poor school above it is failing, and it must change everything when that is blatantly untrue, just as it would be to tell the wealthy school it is a wonderful success and to keep up the good work. Both are lies that help neither.

But assume that you really believe your bias that schools in poorer neighborhoods are bad, and your goal is to find evidence to that end. You might consider lining students up as of a particular day each year according to their literacy or numeracy attainment. Because numeracy and literacy attainment are heavily influenced by the quality of non-school experiences, and because students from poorer environments tend to have fewer quality experiences than their wealthier peers, we would expect to see that reflected in such any ordering performed on a given day. In other words, students who are poorer will tend to be in the lower portion of the ordering, while students who are wealthier will tend to be more towards the top.

From an objective perspective such an ordering offers nothing to judge, and certainly nothing to act on. It reveals patterns we can analyze and attempt to disrupt, but until we go look, no judgments can be made, or actions assigned. If we were ignorant enough to assign a judgment and subsequent actions to the schools mentioned above based on such an ordering, we risk judging both wrongly at the direct expense of the students.

Just for the record, state test scores are based upon a testing methodology that orders students from the student furthest below to the student furthest above average as of a date certain each year on reading and math. Their design limits their interpretive range to revealing patterns for exploration, but because they match a preconceived bias they have been ignorantly presumed to mean so much more.

A person who falls victim to their bias that a rich school is just better than a poor school risks seeing the test scores as a confirmation of that bias, elevating that bias in their mind to a truth: they believed something and now they have evidence for their belief (even though their belief was wrong, and they have no real evidence confirming it). Policy makers and society at large have long held just this sort of bias, and when they saw standardized tests scores inappropriately declared, “aha! We told you so. We now must hold those terrible schools accountable and reading and math test scores will be just the thing.”

Here’s how you can know once and for all this is just a stupid bias: we could just as easily have rank-ordered students on creative output, only that would not confirm the bias that schools in wealthy neighborhoods are better that schools in poor neighborhoods because the ordering would not follow socioeconomic lines. Like any ordering that ordering might reveal patterns for us to explore but that’s it.

And here’s the danger of a confirmation bias against poor schools: if you confirm that bias through a tool that will always put students with fewer quality numeracy and literacy experiences outside of schooling at the bottom, you condemn those schools to perpetual failure without ever considering the actual evidence. And the longer you do that, the more you insist that those schools change everything, the more inefficient you make those schools, which risks converting your bias to the truth.

In short, what was a bias grounded in fiction risks becoming a truth because you forced bad judgments and inappropriate decisions on an at-risk population that in turn serves to keep them in their marginalized place. Shame on any of us that ever thought that was an acceptable thing to do.

Thursday, June 28, 2018

Why we have to have a new and better accountability

True Accountability is arguably the most important concept for every public school to embrace if the purpose of schooling is to maximally benefit each and every student. True Accountability represents a significant mind shift from what currently passes for educational accountability, which is anything but a meaningful accountability. True Accountability is far richer, far deeper, and far more robust. It demands real leadership, it places student benefit front and center, and is about moving a school closer to the goal of maximally benefitting each and every student. That goal represents an ideal to strive for that requires constant effort.

A truly accountable leader regularly offers up an objective accounting regarding their area of responsibility and is able to move his or her organization from that point in a desirable direction. In the case of schools, that direction is towards an organization more capable of maximizing student benefit. A truly accountable leader understands that accountability to processesis a compliance function that is unlikely to move the organization in a desirable direction. A truly accountable leader understands as well that slavish attention to metrics risks a corrupt process that will fail to benefit students. Being accountable for maximizing student benefit is the best way to ensure that an accountability is properly placed.

True Accountability is based upon a set of principles derived from organizations capable of delivering on the promise of maximum benefit and need to be internalized by school leaders who want True Accountability for their schools and students. It requires school leaders to develop a set of capacities that can serve as the bedrock for True Accountability. And it requires a set of structures that can provide the discipline to make the accountability work. The most recent (and the most promising) form True Accountability has taken is in the creation of Community-Based Accountability Systems.

The biggest obstacle to True Accountability has long been that the school accountability conversation has been co-opted by state actors with a biased view of public education as a failed institution. Those who criticize then system risk being labeled as unapologetic supporters of that failed institution, especially if they work within those institutions. Those who criticize from within the institution are viewed as upset that their substandard efforts are being brought to light.

The only way to overcome such a system is with an accountability that gets at the truth. And the truth is that schools are complex institutions with daily successes and failures that great leaders manage and work through to the benefit of the students they serve. True Accountability reveals the truth within a school and the quality of a leader’s decisions. In the hands of a thoughtful superintendent it becomes the means for each school to be great and each school leader to be effective. Done properly it becomes the basis for a community and a school board to understand their schools and the work being performed there. Done well it can eclipse the unhelpful policies that currently pass for educational accountability and replace them with something that actually works.

If you or your school system are interested in joining a movement, please get in touch with me.

Friday, June 8, 2018

Why judgments based on a rank position are stupid (technical term)

Several people forwarded a Seth Godin blog on forced rankings. See it here. I’ll add two cents to the conversation.

A forced ranking in education is: 1) is useful for some types of limited analysis; 2) has no capacity to offer a valid judgment, and 3) if used as a judgment tool almost always falls into the trap of a confirming an existing bias rather than reflecting the truth. Ironically, they are rarely used for analysis. Their most common usage in education is to confirm what people already (often wrongly) believe.

First, its limited usefulness: a ranking (or ordering) can be useful for detecting patterns in single traits where very little information is actually available (when lots of information is available it is far better to use other, more nuanced tools). Once a researcher forces a ranking he/she can search for patterns within that single trait. Some of those patterns may need to be undone. A salary differential between men and women, for example, would be one such pattern. At a later point in time, the ranking can be performed again to see if the pattern was successfully disrupted.

Note that I said single trait. You can order people on the relative differences in height or hair length, but not both at the same time. This represents an extraordinary limitation, one that must be considered, or misinterpretations will occur. You have to stick to a single trait or your analysis will be useless. Any forced ranking that first combines unlike things (which the rankings of colleges do ad nauseum, for example) makes the ordering meaningless. I could use some elegant mathematics to combine hair length and height, force it into a ranking, and then hide behind the math and declare it meaningful, but that doesn’t make it so.

Second, forced rankings are stupid judgment tools (stupid should be a technical term—I can think of no other word that captures the absurdity of judgments forced on a place in a ranking). A place in a ranking can be high or low or somewhere in the middle. To place a judgment at any point requires the irrational by assigning the same judgment to every person/event/college at that point prior to knowing the cause, which is the only thing that should be judged. In the case of colleges, one college may be at a point in a ranking due to great leadership and a committed faculty, while another may be at that same point due to its ability to attract a certain student demographic. Placing the same judgment on both would be silly—they are profoundly different institutions and their differences deserve to be seen accurately.

Third, those in any sort of position of privilege love forced rankings because of a thing called confirmation bias, which is another way of declaring an addiction to messages that confirm what people in a position of privilege want to believe: that they truly are the best. That bias is powerful. Would they accept a forced ranking that placed Harvard dead last? Or that didn’t list the attractive schools in wealthy neighborhoods ahead of the disheveled looking schools in the poor part of town? They would not.

I say “they” to make a point—the bias confirmed through a ranking occurs from a viewpoint. This is simple to see given all the ways to rank that aren’t selected that would show a very different order to things.

For example, student musicality, creative output, success at 4H competitions, and fluency in more than one language, to name but a few, would offer rankings that would not automatically discriminate against poorer communities, nor automatically put wealthier ones on top. If the preferred viewpoint is that schools in wealthier neighborhoods are the best schools, an ordering based on numeracy and literacy will help confirm that bias, while an ordering based on 4H programs would not. The two rankings would be profoundly different.

In both cases an analysis could be performed, and meaningful patterns identified, but any judgments made at any point in either ranking is a judgment made absent the underlying facts. But because those living in wealthier neighborhoods got to pick the ordering that confirmed their bias, they went with the ranking with the best fit.

That isn’t to say that numeracy and literacy are unimportant, or that serious discrepancies don’t exist that need to be addressed, but rather, that judgments based on a place in an ordering are highly likely to be false, specious, and dead wrong.

The biased selection of the ranking that fits your limited world view and its use to judge others does not make that view correct, but merely biased. It sacrifices the truth for a good feeling at the expense of others. It is always wrong to do.

Wednesday, May 16, 2018

Why state tests make lousy instructional tools

Items are selected for tests according to the test’s purpose. If a teacher is building a unit test from scratch he/she will build items that reflect what they needed students to learn, and the expectation would be that most of the students would answer the majority of them at least partially correct if they paid attention at all: it would be rare that a student who was at least partially present would score a zero. We say that those items signal the learned/not learned moment. The statistics behind those items are not particularly important—an important item that all students answered correctly would signal good teaching and learning, while one they all answered incorrectly would signal the opposite. The point here is to assess learning.

Researchers interested in analyzing people have long known that if you can order human beings on a human trait or characteristic you can detect patterns to explore. In the case of negative patterns, such as an ordering that shows women generally making less than men for the same job, we may want to disrupt that pattern and attempt to do so. Additional orderings can show whether or not such efforts are working. The researcher could perform a future ordering to see if the pattern had dissipated or still existed.

Many years ago researchers wanted to order students in terms of their literacy and numeracy attainment—they wanted to see what patterns existed so they could disrupt those they found to be negative. To do this they invented the basic methodology of standardized testing, which just like a teacher-developed test required items specific to the purpose. Only this purpose was not against the learned/not learned split, but the above/below average split. This can appear to the naked eye to be mere semantics given that both sets of items contain content from a domain, and yet they are very different to the point that they should not be substituted for each other.

Items used in standardized tests are selected for their ability to sort students into above and below average piles, and with enough items sort the students into an ordering from the student furthest below average to the student furthest above average. The only items that can do this are those that about half the students will answer correctly and half incorrectly (or in the case of a four-reposnse multiple choice item, that number is ~63%, since some of the kids will guess it correctly and statisticians want to take that into account—note that every state test I've ever reviewed follows this pattern).

The first item sorts kids into two piles (above or below average). The second item sorts kids into three piles (above average, average, and below average), and then each subsequent item creates more piles until you get enough to be useful (it generally takes around 35-40 items). Each item needs to contribute to this sorting to a finer and finer degree. This is why an item that all students answer correctly or incorrectly at the field trial stage will be eliminated when the test is constructed—such items may well reflect on learning, but they show all the students as being the same, and thus don't contribute to the sorting. Only items that exhibit a very specific pattern of responses are useful for this purpose.

It is this narrow statistical limitation that renders these items inappropriate for informing instruction. Lets say as a teacher I wanted to know if students can multiply two digit numbers: I could ask them to do so with a sheet of problems and check their responses. But as a standardized test developer I would only pick the items that separated the students into an above and below average pile (e.g., from that very narrow statistical range). Out of ten such items on my teacher-developed test, perhaps only one would fall into that narrow range (multiplying some numbers is trickier than others).

Now consider the two different views of a student who completes the ten problems in class and answers, say, eight of them correctly, but misses the one problem selected for the standardized test. The truth is that the student is actually doing pretty well on two-digit multiplication and maybe needs a little coaching, but that will be entirely missed if the instructional inference is only made from the item that fits the standardized test requirements. That inference would be that the student doesn't know two digit multiplication, which is false. The instructional responses would be very different, with one being appropriate to the student's needs, and the other being inappropriate. When I said earlier that trying to make instructional inferences from standardized testing can lead a teacher down the wrong path, this is exactly what I meant.

This principle applies to the whole of standardized testing.

The selection of standardized tests as an accountability tool was surprising at the outset—these are limited analytical tools, not judgment tools. Nevertheless, they produce consistent results given their design and purpose, and that consistency appealed to policy makers even though it has never meant what they think. What really surprised me was when policy makers declared such tests useful instructional tools. That would be like me declaring a screwdriver a hammer—no amount of trying will allow it to be up to the task. Nevertheless, that is where we find ourselves: in a bit of a confusing mess.

Tuesday, May 15, 2018

A testing glitch on STAAR--and a response to a question

Texas experienced some computer glitches today administering the state testing program (the system shut down for a little more than an hour). A superintendent friend wrote me a note asking about the potential impact on the reliability of the results. I'm posting below what I wrote to him.
--------------

Reliability refers to (among other things) the kids doing about the same on parallel tests—but that assumes similar conditions for each test. The conditions for this test compared to another would be different given the interruption—therefore it is reasonable to suspect that the results would differ.

For example, kids who tried hard before the break may think that the grownups don’t care enough to create a system that works and not take the part after the break seriously—you could see that by comparing scores before and after to see if effort decreased. If it did, then that definitely affects the reliability of the scores, since they would very likely perform differently on a parallel form of the test. There are lots of similar issues that could be considered in the same vein: increased stress after the break, teacher stress being seen during the break, possible exposure to items between students (unless the kids were made to sit with their heads on their desks for an hour and a half and not say a word to each other), etc.

If you were testing as a research activity with no pressure on the kids or the schools and this happened (you're not—but bear with me):
  1. You would declare the data suspect until you could perform additional analysis.
  2. You would likely do a study after the fact to determine if the gap had an impact and the degree of the impact.
  3. If you could determine that the impact was either consistent (say 2 points for every student) or there was no impact you may decide to include the data and maybe make some adjustments, but certainly with a big footnote.
  4. If the impact was all over the place and no patterns could be found, you may need to declare the data corrupt, toss it, and figure out a plan B.
  5. In the end, assuming the data were going to be used, a researcher would probably want to repeat at least a sample of the study as a point of comparison—just to be sure. No good researcher is likely to be entirely confident in the results until they could confirm that their conclusions would be the same regardless of whether or not the gap in testing time had occurred.
The point is that as a research activity this would present a mess, but their are tools that can try and make sense of what happened and adjust. Still, this would give a researcher fits and be far from ideal.

State testing is not a study or a research activity, but a high stakes event. If a researcher isn’t going to trust their conclusions when something like this occurs without a whole lot of checking, the same must be true for a state test. What differs, however, is that while any adjustments or manipulations made during research affect data, any adjustments made to the state test scores affect both students and their schools. That cannot be resolved with a footnote.

This is what happens when you ask a screwdriver to do the work of a hammer—the most reliable of these tests are still not designed as instruments to judge schools or kids, but rather, as a useful but limited analytical tool. They should never be used as they are by states as accountability tools. The argument that the results from today's tests are probably going to less reliable than they should be is likely true, but a more reliable result wouldn't solve the underlying problem that these tests are being wrongly used.

Finally, and with feeling, even when a reliable test score is asked to serve as a judge of school quality or student performance, the number of false positives and false negatives in the judgments will be ridiculous. A student who struggles historically may be slightly below the “passing” score as a result of great teaching, while another may be slightly above it because both parents have a PhD and the student is coasting through school. A declaration of failure for the first student or the school is as wrong as the declaration of success for the second student and most certainly the school.

Analyzing both students and their schools is in the design of such tests. Judging either is not.

Thursday, April 19, 2018

Response to a great question on school accountability

Kristi Hassett, a trustee in Flower Mound Texas posed the following via Twitter: Many ed reformers want schools to run like businesses. @testsensejt, I wonder what an Industrial Engineer would say about our current testing & accountability regime. What would they look at to determine value? What would they conclude?

My response is a smidge long for a tweet, but the question is a good one and the answer goes right to the heart of the matter:

People in organizations are generally accountable for the quality and efficacy of their decisions, as judged by a supervisor. Organizations are generally held accountable via market forces, and failure via the market generally signals bad decisions by its people. This is true whether you’re a non profit or an engineering firm.

Current school accountability pretends to have invented a competitive market by which to judge schools and then presumes that the position of a school in the market signals the quality of decisions. This is flawed In the extreme.

First, it presumes schools all start with similar raw material and therefore any differentiation in the output can be attributed to those in the school. When it acknowledges that schools do not start with similar raw material, it further presumes that the failure to match the outcomes of those at the highest positions (or at least grow towards them) can also be attributed to those in the school. Having done so, it then presumes that some sort of market force (or public humiliation) can serve as a corrective tool.

But the definition of a quality outcome for a school differs a great deal. The poorest school in America should focus on all of its students walking across a graduation stage with the grit and determination to succeed in life, even though their circumstances may have left them somewhat behind their wealthier peers in academics. That academic deficiency need not have a crippling effect if other skills essential to a successful life can be instilled. That is not to say that a deficiency in academics is irrelevant, but rather, it is a school’s job to make up to the extent possible the gap between what life throws at a student and what he or she needs to thrive. To the degree a school can do that, those deficiencies can be overcome.

And the richest? It’s position in the market as defined by school accountability signals almost nothing regarding the quality of the school. It would have a difficult time not being declared successful. And yet that would not signal if the school is a quality place to learn, if it placed a special emphasis on its small population of less affluent students to ensure they weren’t marginalized, or if the decions being made were in the best interest of the students, parents, and the community.

Which brings me to my second point: school accountability has adopted a market-based accountability that imposes a judgment without ever considering the quality and efficacy of the decisions being made. We know this because had they done so we would not see the high level of correlation between a “quality” school and the relative wealth of the community. Rather, we would have high quality poor schools and low quality rich schools, which is extremely rare in what currently passes for school accountability.

Finally, the signal for a market force chosen for school accountability was a standardized test score. Standardized test scores have the distinct advantage of confirming the bias most people have that rich schools are good and poor ones bad. However, consider this: what if the "market" had been defined as creative output, not standardized test scores. Creativity in students doesn’t follow socioeconomic lines, and so you couldn’t predict from the census data which schools would and would not succeed. Rather, success of the school would be largely dependent on the quality of the decisions made by those who work there.

What the industrial engineer would say, I think, is that educational accountability suffers from a misunderstanding for what accountability is, how it works, the inability to actually recognize quality, and an inability to recognize in the judgments a confirmation bias rather than the truth.

Tuesday, April 3, 2018

How charter and choice starve public schools

Policy makers continue to set forth choice and charters as the cure-all for what ails education. I can argue the fallacies behind that thinking until I’m blue in the face. However, in this blurb all I want to make clear are the simple economics of the thing. The economic argument alone, I believe, is enough to cause us to rethink the entire charter enterprise.

Imagine within a community it costs five dollars a year to educate each general education student. That would be an average. Some students would cost more, and some would cost less, but it would be difficult, if not impossible, to identify the actual costs for a particular student.

Now imagine you are someone motivated by profiting from public school dollars and you open a charter to do that. Like the public school, you would be given five dollars for each student who comes to your school.

If all the kids who come to your school cost more than five dollars to educate, your business would fail—you would either spend what was necessary to educate those kids and lose money, or you would serve those kids poorly and make your profit, but the odds are in that case you would lose your charter pretty quickly. A charter full of kids who cost more than five bucks each is doomed.

It is also true that if you took kids such that they mirror the population, you would need an average of five dollars per student to do the work, which would suck up the profit you’d hoped to earn (presuming you were true to your word and properly served those students). Again, as a business with a profit motive that makes for a bad model.

The reality is that the best way to maximize profit and ensure your success is to find as many kids as possible who cost less than five dollars a year to educate. That way you have sufficient resources to do the work with some left over.

This is where the idea of choice is the perfect beard. While seemingly democratic, it is anything but. It confuses the invitation to participate in a democratic process with actual participation. The simple truth is that only some segments of the population have the capacity to actually choose, which tends to be parents whose children are the least expensive to educate. Parents with few resources, multiple low-wage jobs struggling to put food on the table, or who lack transportation, often have children who are more expensive to educate given their lack of opportunity outside school. This is not a criticism, but a fact, one public schools accept as part of their ethical responsibility of being a public school.

The idea of choice is the perfect vehicle for finding the least expensive students to educate and pulling them out of the public schools. This creates an economic disparity that furthers the gap between the haves and the have nots: the charter school will have more than sufficient resources to educate the children that attend it, while the public school will be left underfunded given the needs of the children who remain behind.

The only thing fair or equitable about this is nothing. Arguments about quality have to be pushed to the side: if you were running an over-funded school the odds of success would be relatively high, just as they would be relatively low in an under-funded school. The fact that much recent research suggests charters in general fail to outperform publics should be an even bigger economic argument against them: that means they are using their resources poorly while making a profit on the backs of our most vulnerable students.

On the surface, what appears to be happening is a simple exercise in democratic choice and market forces. What is actually happening is the undermining of our public schools at the expense of those who most need the benefits a quality education can afford.

Monday, March 26, 2018

The Fallacy in Commissioner Morath’s Argument that All Kids Can Pass STAAR

Last week Texas Commissioner of Education, Mike Morath, again stated his belief that all students can pass each STAAR test and therefore all students and all schools can be successful within the accountability program he is designing. His argument is this: STAAR is a criterion-referenced test, not a norm-referenced test, and thus all kids can pass it. When a friend of mine in attendance questioned this, Commissioner Morath acknowledged some superintendents did not believe this was the case and declared it a “difference of opinion.”

When it comes to the world of educational testing and educational accountability, I’m something of a testing and accountability expert. I’ve worked in that world for the better part of my career. I’ve written a book and a number of articles specifically about what such tests were designed to do (which is far more limited than most people think). I’ve read a ton by others way smarter than me on the subject whose work has helped inform my understandings, and I’d like to think that my work may have helped inform theirs.

Commissioner Morath’s statement is false. This is not a matter of differing opinions, but one of fact verses fiction. The fact that he continues using his flawed understanding of both criterion and norm-referenced testing as the basis for his version of educational accountability requires it be called out directly. His claim, that all schools and students can be successful within the system he is designing, is simply not possible. That he wants it to be true is admirable, but if he truly wants it he needs to select tools not designed to prevent it.

To see the issue clearly requires a basic understanding of several parts of standardized testing: ordering, statistical processes, scaling and norming, cut scores, and criterion-referenced tests. I promise, understanding them is not as difficult (nor as boring) as most would think.

The methodology behind standardized tests like those used in state testing programs was invented well over a hundred years ago as a means to order students. That ordering is from the student furthest below to the student furthest above average based on the relative differences between students. Such tests have the advantage of being fairly stable over time because of the fact that orderings based on single traits are fairly stable over time. For a researcher trying to study human characteristics this is research gold: it provides a stable basis for research that would otherwise be impossible. It is that stability of scores that made such tests attractive to policy makers as well, their limitations notwithstanding.

One purpose for such orderings is that they allow researchers insights into populations of students they would not otherwise have. Most notably: ordering reveals patterns. For example, if I laid out a ruler with salaries on it, starting with a dollar at one end and a million dollars on the other, and then asked everyone in a single profession to stand on their salary, I now have a huge amount of information I can explore. I would need to go to each point in the ordering and see who is there and why, and what meaningful patterns exist (likely in this case that men tend to make more than women for the same job). Think of that as a two-step process: first, I create an ordering, and second, I look in the ordering for causes.

The design of standardized tests allows for this same sort of ordering of students in educational domains. An educator is then able to look for patterns that could otherwise not be seen, and then for causes as to what do next. That they are now rarely used in that fashion is the fault of educational accountability, and a discussion for another time.

Test professionals must apply a host of statistical criteria to the items that make up such a test, without which the results would be useless from a research perspective. For example, if a student is slightly above average on last year’s test we would anticipate that the student would be slightly above average on this year’s test. If we saw that the student is now well-above or well-below average, that change needs to be shown to be meaningful for the test to be useful. Changes would not be meaningful to a researcher if they were random or haphazard or unpredictable. The statistical criteria allow test professionals to create tests capable of signaling when a change is meaningful. A researcher who observes meaningful changes through such tests would then conduct a further investigation to answer the how or the why.

Scores from standardized tests can be difficult to interpret. For example, if two parallel test forms exist, and one is slightly more difficult than another (this is common), then a score of 30 out of 40 items means different things—the student taking the more difficult test could be said to have actually scored a little higher than the student taking the easier test, but because the scores are both 30 out of 40 that can be tricky to see. Test makers therefore convert raw scores to a scale of some sort (the SAT and ACT, for example, both do this for this very reason). This allows for the 30 on the easy test to be converted, say to a score of 350, and the 30 on the more difficult test to be converted to a score, say, of 360. Those scale scores are therefore more precise estimates of what each student did than the raw scores.

That scale can have norms applied to it. This refers to the conversion from raw to scale scores. It adjusts the conversion so that the scale scores distribute students in a bell-shaped pattern. While this is a highly technical process, the end result is that scores can be more easily compared, for example, across grades and even across subjects. Given the complexity and expense of the process, normed tests have been largely limited to commercially available published tests, such as the Iowa Test of Basic Skills or the Stanford Achievement Test Series. They were often disliked by educators because someone would always have to be in the 1st percentile, and someone in the 99th. However useful to a researcher, that never sat well with teachers.

Either a raw score or a scale score can have a line drawn at that point in the ordering and be declared a passing or a cut score. Medical and nursing exams, for example, do this as a protection to society. They work from the assumption that the applicant to a profession who scores at the bottom should not be allowed into the profession, but the applicant at the top should. They then work to find a point somewhere in between as the dividing line. Any passing score they land on suffers from the logic mentioned above in that test takers at that point in the ordering will be there for a host of reasons. However, the test maker can perform research so that for the most part those above that line are indeed qualified. It is far from a perfect system and is always somewhat arbitrary, but it serves a useful purpose and absent an alternative persists.

Commissioner Morath has declared that because STAAR is a criterion-referenced, not a norm-referenced test, all students can climb over an established cut score, and all schools can succeed. This is not a small argument, but the basis and justification behind his thinking. If he is wrong, his system falls apart.

He is wrong, and the misunderstanding is massive.

“Criterion-referenced” is a phrase that refers to the content of the test being drawn from a blueprint explicitly designed to represent an educational subject, with some set of criteria established for a passing score. This is most common in tests teachers use, which are not designed to sort students but check for learning (all students can indeed, pass that sort of test if they learned the material). However, in the past decade or two standardized test-makers (and those who make state tests) made an attempt to move their content in this direction as well. Ordering is a rational research tool, so the logic went, and the closer the content to what was taught, the more meaningful the ordering will be. Thus, both tests teachers give in the classroom, and those that order students, can be criterion-referenced. In fact, it is entirely possible to have a criterion-referenced test that is also normed.

The accurate, factual description of STAAR is this: STAAR is a criterion-referenced testing program because it draws its content from state a blueprint that reflects standards documents. It is a test designed to order students against that content from the student furthest below to the student furthest above related to that content. The state then follows the model of nursing and medical exams and draws a passing line in the sand at a point in the ordering. STAAR is not a normed test, as no one performs the research that would allow for normed results. As a result, comparisons from one year to the next should be done cautiously, and certainly no comparisons are possible between subject areas.

We can state all this unequivocally thanks to the fact that Texas runs a very transparent system and publishes a technical manual annually that shows the application of all the statistical processes necessary to build a test that orders students. Those processes would not be present if the goal was something other than ordering students.

Commissioner Morath’s argument can be summarized as follows:
  1. STAAR is a criterion-referenced test
  2. Because STAAR is not normed, it does not order students.
  3. All students can get past the cut score if schools would do their jobs.
  4. All schools can therefore succeed in the accountability program.
Only one of those is true: STAAR is a criterion-referenced test. The others are all false. STAAR orders students, which means that for all students to pass all students would have to get past some point on a test designed to order them. That would be the equivalent of all students being above average—it cannot and will never happen. As a result, to say that all schools can succeed in his plan is deeply disingenuous. It is instead a guarantee that such a thing will not happen.

If I want to build something an engineer tells me will fall down, that is not a difference of opinion. It is the difference between the position of a professional who understands, and someone who does not. Not heeding the position of the professional in the case of a building would be foolhardy since the thing is unlikely to stand.

Not heeding the positon of professionals in the case of educational accountability is worse. It risks direct damage to the children the Commissioner is constitutionally required to serve by damaging the institutions that serve them. It is beyond foolhardy. STAAR orders students from the furthest below to the furthest above average. That is a verifiable fact. Not an opinion.