One of the dangers of using test scores uncritically, arises when we use a single statistical indicator to describe test results for accountability results. Even properly constructed valid and reliable tests present significant challenges as accountability devices. As the NRC report explains:
Incentives for educators are rarely attached directly to individual test scores; rather, they are usually attached to an indicator that combines and summarizes those scores in some way. Attaching consequences to different indicators created from the same test scores can produce dramatically different incentives. For example, an indicator constructed from average test scores or average test score gains will be sensitive to changes at all levels of achievement. In contrast, an indicator constructed from the percentage of students who meet a performance standard will be affected only by changes in the achievement of the students near the cut score defining the performance standard.The same data taken from the same test scores yields significantly different results, depending upon the statistical measure used. Here, in the table below are three ersatz classes of 7 students each, for illustration purposes. Imagine that the scores provided here are the results of a math test, for which a score of 75 is deemed proficient. (This idea of setting a precise score as the cutoff for proficiency is a device that permeates our NCLB based education environment.) At the bottom of each column, you will find the average class score (48, 51, and 54 respectively), the median class score (40, 51, and 54 respectively) and the percentage of students proficient (29, 14 and 42 percent respectively).
Notice that the average score of each of the three classes is within a few points of each other. Class C has no student doing exceptionally well, but the class average score is marginally higher. In addition, if its teacher is to be evaluated by by the proficiency rate, class C would seem to be significantly superior to the other two. Do you agree that Class C is doing three times as good as Class B (even though the class averages are within 3 points of each other)?! Notice that Class B has two kids on the "bubble," that is just marginally below proficiency. If the teacher responsible for Class B raises her students scores by 10 points, her class proficiency rating will rise from 14 percent to 42 percent, and she will be rated a super star. If the teacher of class A raises her students scores by 10 points, she will still only have 29 percent of her students rated proficient, and we'll be saying that she made no appreciable difference in class proficiency.
|Class A||Class B||Class C|
When our local newspaper reports test score performance, and likely when your local newspaper reports test scores, it is reporting state measures that focus heavily on a very narrow range of measurements. Often, the results are heavily weighted towards the performance of disadvantaged students in order to highlight efforts to close the achievement gap. The NRC report tells us:
Given the broad outcomes that are the goals for education, the necessarily limited coverage of tests, and the ways that indicators constructed from tests focus on particular types of information, it is prudent to consider designing an incentive system that uses multiple performance measures. Incentive systems in other sectors have evolved toward using increasing numbers of performance measures on the basis of their experience with the limitations of particular performance measures. Over time, organizations look for a set of performance measures that better covers the full range of desired outcomes and also monitors behavior that would merely inflate the measures without improving outcomes.