Op-Ed: The DOE’s Teacher Evaluation System Has Obvious Flaws That Ought to Be Corrected Before Initial Implementation
The Department of Education states that the goal of its proposed teacher evaluation system, dubbed “AchieveNJ,” is to move “from a compliance-based, low-impact, and mostly perfunctory [evaluation system] to focus on educators as career professionals who receive meaningful feedback and opportunities for growth.”
Who could be against changes that move us in that direction?
Two natural questions to ask are: How well does the current system work? How much better is the proposed system?
Most of the explanations I’ve heard from the department were of the form, “Colorado does it this way.” As a child, when I used such an “explanation,” my mother rejected it perfunctorily with, “If Freddy jumped off a cliff, would that make it right?”
The road to reform is always difficult, but support for reform can be gathered, if, at each step along the way, evidence is gathered about the efficacy of the reform. Because the department has largely skipped these steps, the proposed system ignores the well-accepted paradigms of science.
One measure of teacher effectiveness will be student growth percentiles (SGPs). This seems like one of those ideas that only sound good if you say them fast. Dealing with percentiles means it is a zero-sum game; if one student goes up it must be counter-balanced by another who goes down. Such a system is hardly calculated to build a cooperative learning environment.
Moreover, it does not allow us to see overall trends, only comparative ones. If we measure children’s heights, some would grow faster than others. The percentile changes would reveal this, but it would miss the big fact that all are growing. Both aspects -- absolute growth and comparative growth – are important, but the focus on just percentile growth diminishes the more important aspect of absolute growth.
The department has assured teachers that, “Students at the highest end of proficiency can also show growth -- so no educator is ever ‘penalized’ for teaching students at any achievement level.”
How can this be true? If a student gets a score of 40 percent, there is a lot of room to improve. With a score of 95 percent there isn’t.
And the use of percentiles instead of percentages makes things worse. Score distributions have lots of people in the middle and they get sparse in the extremes, so a small gain in the middle of the distribution jumps you over a lot of people. For example, a 10-point gain in the SAT from 790 to 800 jumps over about 4,000 to 5,000 people (out of the 1.5 million who took it); whereas a 10-point gain in the middle of the distribution, from say 500 to 510 jumps over 54,000 other people, so the gain in percentile rank is more than five times greater in the middle than in either tail.
Using student growth objectives (SGOs) -- another metric called for by the department -- is much easier said than done. Goals must be stated, typically, as claims; then evidence must be described that would serve to support those claims; finally, tasks developed that generate that evidence.
But one must be very specific. Otherwise you can have two classes that may have the same claim but the tests that generate the evidence to support that claim could be very different in difficulty.
Teachers make up tests all the time, but usually the test has two purposes: to prod the students into studying and to guide further study/instruction. When you add a further purpose -- the formal evaluation of the teacher and the principal -- the test score must carry a much heavier load.
In a study of teacher-developed tests, McGill University's renowned psychometrician, J. O. Ramsay, found that in no case were such tests reliable enough for even rudimentary use in such sensitive measurement tasks as these. And if teachers are to be compared on the results that their students have on such tests, we must have a common metric. That means that the tests given by one teacher must be formally equated to those given by another. How can this be done?
These are weighty issues that must be addressed. I had thought that this was the purpose of the pilot programs, but have seen nothing at all that analyzes those pilots in any rigorous way. Indeed, from what I have heard, the program will go into effect before such analyses are completed. This is akin to the old Western line, “You’ll have a fair trial and then we’ll hang you.” Except, in this instance it is, “We’ll hang you and then you’ll have a fair trial.”
The department has suggested that portfolios be used to measure student growth, but this is another idea that only sounds sensible if you say it fast. When portfolios were used as part of a statewide testing program in Vermont about 15 years ago, it was a colossal failure. It was so unreliable, unpredictable and fantastically expensive that state officials abandoned the program.
This illustrates a deeper lesson: Some measurement methods that work acceptably well at a classroom level do not scale. A folder of a student’s work that is produced for a parent-teacher conference illustrates what is going on and directs the discussion. But when the folder is reified as a “Portfolio Assessment,” we begin to get into trouble, for it is at this stage that we ask more from it than it can provide. Research shows that portfolios are well suited for one purpose but not the other. What would make New Jersey’s use different?
The goals of AchieveNJ are laudable, but these proposals should be viewed as tentative first steps. Before anything is implemented, a great deal more thought, resources, and time must be allocated to assess how well each innovation works.
I understand that the pressure of funding from “Race to the Top” was overwhelming, but perhaps even U.S. Secretary of Education Arne Duncan might be convinced that it would be worth funding at least one state that had a rigorous evaluation plan as part of their proposal.
Without one, I fear that Christopher Hitchens’ pithy observation will unfortunately hold true in New Jersey: “That which is asserted without evidence, can be dismissed without evidence.”
That would be too bad.