Friday, September 27, 2013

Data-less decisions

(Reading the previous post first may help—this one follows from it)

A data-less decision in education is just that: a decision made absent supporting data. Data-less decisions are bad for the simple reason that whatever decisions are made tend to be in support of an existing bias. Such bias can be positive or negative, very fair and objective or extremely unfair and subjective. Sometimes the bias is based in what is actually true, but just as often it is based on an untruth or a stereotype. All this is why the mantra of data-driven decision-making has been established as a proper goal for educators.

The problem is that if I look at a student in a particular situation and I possess no meaningful data I am highly likely to let any number of my biases enter in to my view of the student. This can include but is certainly not limited to my views on gender, race, socioeconomic status, whether the school is in an urban, suburban, or rural setting, and perhaps the quality of the football team or the student’s status as a star athlete.

For example, the data are quite clear that suburban schools tend to outperform urban schools by a large margin for any number of reasons. Thus if the only thing I know about a student is the that the school of record is in an urban setting it would be a fairly natural thing to presume that from an achievement perspective the student might be expected to under-perform against suburban peers. If I acted upon such a supposition when assigning that student to classes I would be making a data-less decision heavily colored by that bias.

Having done so I may be guilty for promulgating a status quo I dislike. It may be that the student has the capacity to be the next Einstein and yet having assigned the student to remedial classes I helped preserve a stereotype rather than shut it down. Most of the time data-less decisions are made against a huge combination of different types of bias that are manifest in far subtler ways, but the pattern almost always seems to be toward the preservation of the status quo and not the other way around.

This isn’t to suggest that people are by nature racist or evil or mean. Many data-less decisions are made with the best of intentions. The point is that we are each products of a history that is anything but neutral, and the simple truth is that much of our bias has some basis in fact—e.g., urban schools as of this particular moment in time do in fact under-perform against their suburban counterparts—or stems from a historical precedent that can be tough to shake. The promise of data-driven decisions is that such bias can be removed from our decisions.

The most dangerous data-less decisions are those that appear to be supported by data. Those decisions risk reinforcing whatever societal norms exist under the false pretense that the data suggested that as the proper thing to do. The data, in that situation, act as a Rorschach blot, allowing you to think you see something that puts forth an argument for your approach when the empirical reality may be otherwise. Data-less decisions that appear to have the support of data risk justifying a bias that need not in fact exist. Such decisions help solidify such bias rather than disrupt it.

Nowhere is the data-less decision more prevalent than in the use of test data in schools. Standardized test data have a very limited range of potential uses by design. Included in that design is the ability to compare schools and students to each other as an aid to identifying which schools have solutions that are working that can then be applied elsewhere.

Not included in that design is pretty much everything else. Those comparisons are silent as to their cause, so any assumptions about the practices that produced the comparisons have to come from a place outside of the test data. So too with judgments regarding the quality of the school, the nature of the curriculum, and whether or not a teacher did or did not do his or her job properly. (Policy makers continue to assume otherwise much to the detriment of our students and schools, but bad policy cannot make test scores magically perform an act for which they were never designed.)

The majority of judgments being made about schools and teachers from test data are then data-less judgments, and any decisions made from such judgments are themselves data-less.

Due to the nature of test data, however, the data-less-ness is almost impossible to see. Remember that a set of standardized test data offers a statistical representation for how things are at the moment, and the test itself is designed to show where each student and school falls within that overall representation at that moment in time. Test data, then, actually reflect whatever biases happen to exist as of a given moment in time. Test data are neutral when it comes to what those biases are, in that they don’t care what biases exist. Test data will reflect them regardless.

That means it is a very tempting thing to look at the rank ordering of schools and conclude that those that rank near the bottom are lousy schools and those at the top are great schools, because in many cases that may well be true. However, any such judgment from the test data alone is a data-less judgment, since test data are silent as to their cause and are not designed to make judgments regarding quality.

That is so hard to see through. If we take a slice in time measure it will show the effect of whatever bias exists in the world. If at that same moment we add several additional slice in time measures designed to answer the additional questions we have we may be able to make some fairly accurate statements regarding the quality of a school and those working inside it.

But having taken those slice-in-time measures our goal must be to take a set of actions that remedy the shortcomings and advance the cause of education. If those remedies are successful, then a new set of slice in time measures should provide evidence of our progress.

Instead, what we now do is make one of those original slice-in-time measures—a test, which by definition is very limited and incapable of comments regarding quality—the basis for our remedies. We take an instrument selected and created for its capabilities to show us what a rank ordering at looked like yesterday and use it as the basis for defining tomorrow.

Here is where we need to really pause and ask ourselves a very tricky question: if we are basing tomorrow upon a measure designed to show yesterdays rank ordering, might we in fact be guilty of preserving the bias that existed when that original rank ordering took place? Rather than allowing education to progress and designing a new instrument that showed us the results of that progress, by using the original instrument over and over and placing the quality determination squarely within it might we in fact be guilty of further entrenching ourselves in an old status quo when all we really want to do is escape it?

As we heap data-less judgments regarding the quality of the teacher and the school on to the system two very clear and very contradictory messages emerge. The first is the altruistic insistence that teachers and schools advance the cause of education for their students and serve them well. The second is the very pragmatic demand that success is about teaching to a test designed to reveal the biases of yesterday. Accountability will be measured not by the altruistic message, but by how well you perform against a definition of reality that should now be out of date.

Lots of metaphors apply here: running on ice, running in circles, shooting yourself in the foot, you name it.

Having operated under such a scenario should we really be surprised that our test-based culture has failed to produce the transformations it promised? Or should we finally realize that believing the false promise of a test-based culture to magically transform the status quo is perhaps one of the greatest barriers to seeing that actually happen.

Saturday, September 21, 2013

What reliability doesn’t say

A standardized test is at its simplest a data collection tool. It only works if the data collected meet a certain standard in terms of statistical reliability. Reliability is all about the consistency of the measure or observation, and generating a sufficient level of reliability to allow for reasonable inferences to be made requires both skill and planning. In the process of achieving that reliability, however, you impose a whole series of limitations on what the resulting data can say in the name of allowing it to say a few things well.

To make the idea of reliability more concrete, imagine that you have two observers of a rat moving through a maze and you ask each observer to record their observations by writing down what the rat is doing during the experiment, with no other tools than a pen and a piece of paper. Odds are the two observers will offer a related but very different narrative, which means that from a research perspective the observations would be of limited use.

The reason for the limited use is that some of the inferences drawn from one observation risk being refuted or not supported by the other, and so a researcher making any inference risks that inference not being supported by data.

Statistical reliability can be obtained if a certain amount of discipline is introduced to the observations. Adding a stopwatch would of course help because both observers would be likely to agree upon the amount of time it took. So too would limiting the observations being recorded to a list that included only the salient points of the hypothesis being studied.

Under this second scenario the agreement would offer an indication of the reliability of the judgments, and with a high degree of reliability a researcher has an increased level of confidence that their inferences can be based on good data.

What must be remembered, however, is that while such reliability enables a great deal in terms of making valid inferences, it also conceals a great deal in what goes unobserved or unrecorded. If, for example, our focus is on the number of wrong turns the rat takes under a number of different scenarios we would make note of those particular phenomena, and not on whether the rat was brown or white or tended to make its wrong turns more often to the right or to the left. It isn’t that those things are unimportant or irrelevant in a broader context, but rather, they are not a part of the question being asked and thus are not included as part of the observational milieu.

No set of reliable observations is therefore ever complete—the fact that they can be made to be reliable is an artifact of the manner in which the observations are controlled. They aren’t controlled for some nefarious purpose, but in order to create a limited number of powerful observations that can lead to increased understanding. Once those controls are put in place, however, a researcher, by definition, draws a very firm line in the sand that limits the range of possible inferences. Having done so, the vast majority of the universe of inferences is removed as a possibility, leaving only the few that are the focus of the research.

The price of reliability, then, is that it must pick and chose, and in doing so always leaves most of the universe outside of its gaze.

Standardized tests are a type of observation, with the data being collected in the form of a student’s responses. What that means is that the test itself is necessarily limited in its scope and the vast majority of the universe in which the tested content is contained is external to the tested material.

For the purposes behind a standardized test those limitations don’t pose a problem, because the purpose is pretty straightforward: show the rank ordering of students in a manner that students can be compared to each other, and one group of students can be compared to another group of students. To do that you need only test items that behave in a very narrow way: roughly half the students need to answer each of them correctly, and half incorrectly. Those are the ideal items for answering such a question since 25-30 of such items are generally enough to provide enough data to answer the question regarding where each student and group of students rank. Items within that narrow range do a very nice job of spreading students across every possible number of correct responses.

Within the universe of a domain such as reading the items on a reading test needed to show the rank ordering of students represent a tiny sliver of the domain.

Reliability is built into the instruments so that they allow for a high level of confidence in the inference regarding where a student ranks against the tested material. What they don’t allow for are for any additional inferences that were not a part of what was observed. That’s just plain logic: if you choose to narrow the focus of your observation to achieve a reliable set of data what would make you think that inferences outside and beyond what was observed are suddenly available?

So now lets think about what is outside the “observations” being made by a standardized test. That would include everything that isn’t in those items. It includes the most challenging of the material that was taught and the most basic, since none of that is on the test. That content consists of material that by the end of the year every student would have answered correctly, or perhaps the vast majority would still answer incorrectly, and thus that material fails to help answer the question regarding rank ordering. You could include it, but it wouldn’t contribute to the reliability of the measure—in fact, it hurts the reliability because it doesn’t contribute anything to the purpose of the measure.

Such tests don’t include any observations as to the quality of the teacher or the school. This often comes as a surprise to most people since the entire world now seems content to assign a quality judgment to a school based on test scores. But where within the limits of the observational lens is any question as to school or teacher quality? Rank orderings are good for the purpose of comparisons—in fact, for the purpose of offering up meaningful comparisons, they are ideal—but the placement of a student or a school within a ranking says absolutely nothing as to what caused a student or school to land at that point.

Filling in the silence beyond the student or school ranking with statements as to the quality of the school or the teachers may seem on the surface to be justified—and such judgments may in fact correlate somewhat with the reality regarding quality schools—but such statements are themselves entirely unsupported by the data. Inferences about quality made from any standardized test are in the same category as speculating that the faster mice in the maze ate a better breakfast than their counterparts without having a speck of data as to whether or not that is true.

Finally, the answers that students provide offer no advice as to what should be changed in the curriculum to support better instruction the following year, and yet policy requires that state test scores be returned in order for schools to use them for just this purpose. This is perhaps the greatest crime we commit with test scores, and takes us so far beyond the observational lens offered by a test score that it really is both shameful and laughable at the same time.

The best way I can show the paucity of instructional value from such a test is to point out the types of observations that would be needed in order to identify the candidates for improvement from one year to the next. Consider the following as a representative yet incomplete list of the kinds of questions that need to be answered, and compare that to the only question a standardized test is designed to answer regarding the rank ordering of students:
  • Did a teacher teach a rich curriculum and teach it well? 
  • Did a teacher differentiate instruction according to the needs of individual students?
  • Were students under prepared coming in to school this year and thus in need of extra support to catch up to their peers?
  • When learning failed to occur, what was the cause? Discipline issues? Personality conflicts? Novice teachers?
  • Did teacher re-teach concepts and ideas that were not understood in ways that represented a “once more, but louder” approach, or did they attempt to teach the same thing but through a different lens or approach?

What should be obvious is that the data-gathering efforts that would help provide answers to these and other questions like them would require an entirely different set of observations than those required to identify the rank ordering of students. What should be just as obvious is that no matter how well we answer the question regarding how students and schools rank we cannot suddenly ask the observations to extend their reach into areas that the test doesn’t cover.

We need reliability in social science research and certainly in testing, but we need to understand that imposing it upon our observations—whatever their form—means that our opportunity to make inferences narrows to the observational target. If we think otherwise we are quite likely making inferences that lack any actual support.

If I am right in this last point—and I would argue I am—then much of what currently passes for data-driven decision-making is actually data-less decision making that we pretend is valid.

Tuesday, September 17, 2013

Why I don’t hate the Common Core

Multiple sources have accused me on multiple occasions of hating the Common Core and thereby the Common Core assessments. This is understandable. I am one of the few people to criticize the overall selection of behavioral statements as the paradigm for what we call an educational standard, and the Common Core follows that trend.

By “behavioral statement” I mean that our standards in education tell students what behaviors they should engage in: understand this, comprehend that, multiply two digit numbers, etc. As a curriculum guide such statements are extremely useful since that is precisely what a teacher attempts to do everyday: get students to behave in ways that further a students’ learning.

As the basis for a standardized test such statements are also more than appropriate, since such instruments are designed to allow for inferences about student performance relative to such behaviors and to other tested students—when used properly, which is another issue for another time.

But should such statements be allowed to serve at the elevated level of a standard? And, if we answer that question in the affirmative—which we have for nearly twenty years now—what are the consequences for having done so?

The easiest way to answer this question is to ask about the purpose of standards in industry and government, since their ability to transform broken industries, improve the quality of our air and water, create real and meaningful competition, and reduce the price of goods and services to the end consumer was the reason education decided it too could experience a similar transformational benefit from a rich set of standards.

But industry standards are most notably precise. They have to do with making the world more efficient (the size of gas nozzles), safer (clear air standards), cheaper (35 mm film as the standard that helped make cameras affordable for everyone), or even just better (minimum highway EPA for new cars). These standards care little about the behaviors that cause them to be met, but leave that up to those with the expertise to achieve such things. Rather, the standard allows for an infinite number of behaviors to lead to the standard being met, with the benefit to follow.

It is interesting that education chose the behaviors as the standards system. Had we chosen a standards system more in line with those that had created the types of changes we had hoped for, we might well have seen the types of transformations that industry and government experienced when they adopted such standards. Instead, we chose behavioral standards and yet we expect them to produce the exact same result as the more precise industry standards.

And what—to close out this entry—might a precise educational standard look like? Here are just a few examples:
  • All students must write well at least once in order to matriculate to the next grade, with the difference being the level of effort and time, but not the expectation.
  • In each year of schooling, a student’s teachers will select an assignment, project, or area of study that a student struggled with and reassign the work, with the additional requirement that the work be competed to an A standard, and provide the supports and scaffolding necessary to see that happen.
  • Over each three-year period of service teachers must present a paper, a research project, a content area project (such as a play or a novel) to their peers.
For a school to do any of these would require a whole range of behaviors that would differ by student and school. Differentiated curricular and instructional decisions would have to made against need, and the system of schooling would have to organize itself very differently than at present for these types of standards to be met. Instead, we now have a system that attempts to align the behaviors and then generate a similar outcome for everyone.

In education (and in virtually any field or endeavor) you have to pick: align the behaviors and you all but guarantee a differentiated outcome, since students will respond differently to those behaviors, or align the outcomes and allow the behaviors to differentiate against need.

The behavioral statements we position as educational standards bear no resemblance to the standards that created the desired level of transformation elsewhere. Education now aligns the behaviors and demands a similar outcome, when transformational standards define an outcome and leave the behavioral piece to those who truly understand how to achieve the outcome.

That is why I don’t hate the Common Core. As a guide to generate a rich curriculum it may be more than adequate or even amazing—I am not a curriculum person with the ability to make that determination and so I won’t try. But as a guide the opportunity still exists to differentiate instruction in anticipation of a similar outcome; as a standard the message is for instruction to standardize but then the guarantee is that the outcomes will differ.

I am disappointed in our inability to see what we have done: to repeat, we attempted to replicate a standards environment with real transformational power but then failed to adopt the type of standards that had actually produced such a transformation. The result is that we now standardize the wrong pieces.

We are left with the expectation of transformation when we failed to include any transformational tools anywhere in the educational package.

Monday, September 9, 2013

Pictures of "rigor"

If one Googles "rigor" and selects the image option the result offers clear evidence that the term has a huge variety of meanings in education--many of which are incommensurate with each other--strengthening my argument that its use as an educational term is more about the creation of a community around a set of terms than in serving as a useful adjudicator for what should comprise a quality of education. Consider the following:

1. The dictionary definition clearly has not yet caught up with the current use of the term as "the adjudicator for everything that is to be valued in a quality education." The definition below is from a recent unabridged dictionary and even dictionaries on the internet have not yet posted anything resembling the new meanings. Perhaps that is because nothing approaching actual agreement exists in the term's current usage. It is more likely the severity of the semantic offense in that these new possible meanings are so far removed from the old denotative meanings that they are unrecognizable as having anything to do with the historical term.

 2. See below for evidence of the claim in #1 that its all about rigor.

3. The quote below suggests that the "rigor" in the term "rigor" and the "rigor" in the Latin "rigor mortis" are not the same thing, even though every dictionary I have so far found says that they are. Its a folksy sentiment but that doesn't make it accurate.

4. There is a head banger band called Rigor Mortis (at least they look like head bangers), and now, apparently, a clown.

5. I've seen this Venn diagram many times and find it hard to reconcile which of the new meanings of rigor are being referenced here, and not surprisingly none of those contained in the dictionary seems to fit either. Not even "a sudden feeling of cold with shivering accompanied by a rise in temperature, often with copious sweating."

6. Rigor, of course, must be applied in an equitable fashion. These students all seem to be smiling a lot considering that the education they are advocating for will be one that is severe, strict, and unyielding in its harshness.

7. I find it remarkable in the next image how the terms "flexible" and "rigor" are used in the same sentence, given that they are listed in nearly every thesaurus I can find as antonyms.

In attempting a bit of humor here my point is this: I am often accused of splitting hairs on this topic, but I assure you that every image above (except the clown) was created by someone that had something meaningful to say and their message now fails because of imprecise language. They chose a popular term with a meaning far removed from whatever message they were actually trying to send, and now the actual meaning is lost, or at best confused.

That is what I object to: educating 55,000,000 children each year is tough enough without putting a confusing set of vocabulary in the way. It makes us look like we don't actually know what we're talking about, or we're trying to hide something through language that obfuscates. Neither is good.

A contrarian article and talk

In October 2011 an article I wrote appeared in Educational Leadership in which I compared 
teaching to the test in schools to studying for an eye exam. You can read it (and see a horrible picture that seems to have deteriorated with time) here.

The same comparison will be made in a book I have coming out this fall on the pitfalls of education reform, and Peg Tyre used it in her book The Good School.

I also spoke at the AASA conference the following spring and you can see an article on the talk here and see a brief interview here.

Skills assessment

Lots of attention is now being given to the notion that a skills-based education is a good thing and in our test-obsessed culture many are starting to look for assessments that indicate the presence or absence of such skills. That’s likely to lead to a whole bunch of inauthentic behaviors if we aren’t careful.

Consider that researchers are quite adept at finding ways to identify the presence or absence of such things under the guise of research, but in order to do so must take a fairly circuitous route. Questions, observations, and a host of other data gathering efforts that distill information into usable chunks are extremely valuable in allowing a researcher to make statements regarding skill attainment in schools, but the data elements almost always represent a correlation that in turn enables the inference.

In order for the inference to be valid, however, the correlation must be to the desired behavior. The instant any sort of accountability is tied to the correlation the correlate will substitute for the desired behavior—the system will follow the formal definition of success, even if it differs from what is actually desired.

If our history with standardized testing offers any lessons, policy makers—who have a long history in education of confusing correlations with cause and effect—will at some point legislate success on the substitutes. They then run the risk of presuming that as the substitute measure climbs and falls it represents something real and meaningful, when odds are, like a test score in the current system, it is as much an indicator of the degree of manipulation the system has undergone as opposed to a measure of the presence or absence of a desired behavior.

The fact that systems sway when accountability is placed on some component of them should come as no surprise—that is the intent. Educational policy makers have up to now shown not one bit of hesitancy in placing accountability on the thing that was used by researchers as the correlate to something larger and more important rather than the thing itself and then delude themselves into thinking that the correlate and the thing are one and the same.

The hope for the skills movement is that the correlates often don’t really even resemble the desired skill. A simple commonsense look should suggest that the patterns of responses in a researcher’s survey or the presence of certain traits in an observer’s checklist that may correlate to the presence or absence of a desired skill are not themselves a representation of the skill. Such things are easily manipulated by anyone who realizes that a right answer exists, and even those looking to answer honestly will skew towards any right answers they know to exist. Just knowing that a right answer exists will have that effect when the results will be used as a basis for judgment.

Accounting for a skills-based environment is critical and doable, but we are going to have to do so using a very different set of tools than anything in the current educational toolbox.