Meaningless measurements remain meaningless

One of the Education Reform Movement's most touted innovations is the implementation of value-added methods of teacher evaluations. Under these methods, teachers are not evaluated by the absolute skill level of their students on standardized tests; instead, they are evaluated by how much their students improve during their time under that teacher.

This all sounds great, except of course that this evaluation method doesn't work. If the evaluation actually captures something meaningful about teacher performance, you would expect teachers to have steady results across the years. Instead, teachers have dramatically different scores from year to year:

One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year. Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors.

Nearly every time this problem is raised, education reformers will point out that this is not the only metric being used, that in-class evaluations are used as well (or whatever other metric). This might seem to be a reasonable rebuttal at first glance, but on further inspection it is completely incoherent. If the value-added evaluation method is totally meaningless and does not appear to track teacher quality in any reliable way, mixing it into a cocktail of other evaluations does not change anything about its meaninglessness.

Imagine if instead of value-added methods for evaluating teachers, we just drew names out of a hat for teacher rankings. Someone reasonably objects that this is totally meaningless and stupid. So in response, we add more evaluation methods on top of the hat-drawing method. Does that make any sense at all? Does the arbitrariness of the hat-drawing method become less so because we have added more sensible methods alongside it? Obviously not.

If a measurement is meaningless, you get rid of it. Mixing it with less meaningless measurements does not make it any better; all it does is dilute the usefulness of the other measurements you are using.