Ceiling, Floor Effects Distort Teacher Evaluation Metric
Tuesday, February 28th, 2012New York City has done a bold thing with the evaluations of their public school teachers: they have put the scores online for anyone to view. Though this action may be a source of distress for the city’s teachers, it is useful for the public, especially as a way to critique not education quality but instead the metric itself. Specifically, I am curious to know just how teachers at high and low performing schools get evaluated by the city’s metrics. This question is interesting because New York City uses a “value-added” approach to teacher evaluation. They predict how well the students in a teacher’s class will perform and then attribute differences between expected performance and actual performance as caused by the teacher. This is intended to balance out the effects of poverty, parent education, past test scores, etc.
The problem is that there is much reason to suspect that teachers in high and low performing schools may be subject to increased error in their evaluations. And if this error is associated with the measurement, then a key assumption of testing theory is violated. Think of it this way. Let’s say you are trying to use a set of addition problems to measure how well some students have learned math. You believe that the score on the math test is a decent measure of the students’ actual learning, if you also include some error. That error could be caused by a host of factors; say, for example, one student didn’t sleep well the night before while another ate a great breakfast and is feeling alert. As long as this error is random, we won’t be concerned about it distorting scores. But now imagine that half the class is blind and the test is written only. In this case, we should see the amount of actual knowledge explained by the test score be much smaller (and thus error be much bigger) for those students who are blind compared to sighted students. Thus, we conclude that our test is not an accurate measure of student learning.
This same thing may be occurring in teachers in high performing schools. Consider this New York Times article in which they discuss 73 cases of teachers whose students produced exceptionally high scores (at or above the 84th percentile) but were rated as below average. This occurred because the city’s metric expected the students to perform even better.
What may be occurring here is called a “ceiling effect,” a case in which the measurement can go no higher and thus limits the upward expansion of the data. In this case, the students are expected to perform exceptionally well, such that the teacher’s own efforts may not be able to have much of any effect. Any teacher could do well with students whose parents are highly involved, perhaps by working through schoolwork with their children each night or paying for additional tutoring after school. In the most possible extreme, the city’s metric might predict that all students in a given classroom perform at the 99th percentile on the test. In this case, it is by definition impossible for the teacher to be rated anything but average.
The same process may occur as a “floor effect,” whereby the lowest performing students can do no worse. In these cases, it should be much easier for teachers to receive at least an average rating, given how low the expectations are for the their students. That’s not to suggest that these teachers can show up and do nothing all day. But it does indicate a problem with the evaluation system. In the cases of teachers with either a high or low bar, the feedback given by the performance evaluations is much less useful than a teacher whose students have average expectation.
The article points out that the city has found the metric to be “too sensitive” and Sean Corcoran, associate professor of educational economics at New York University, has also noted that, even among the city’s best students, teachers will still have stratified ratings. This recognition of problems is very valuable, and the city should be credited with acknowledging problems in the system.
The trouble is that acknowledgement doesn’t go nearly far enough. The ratings do not meet basic assumptions in testing theory, meaning we should question their meaning. Yet they have been released to the public and will (likely) be used in making decisions about teacher performance. These tests should serve as a dry run for changes that must be made before the tests can actually be used to make administrative decisions. Only through careful adjustment and fine-tuning, including removing correlation between error and measured score, can the city’s metric be applied fairly. As it stands now, it is only doing a disservice to teachers, students, parents, and administrators.