An inventory: 11 issues with value-added studies (evaluations based on student test scores)


Even though most standardized tests for K-12 students were designed with the individual student’s learning in mind, we often gravitate towards the use of such scores when we seek to trace the success of a teacher, program, or school. They seem so objective, so unambiguous, so well suited to the task. And often our impulse is to reward teachers whose students score highest, to demand more of the rest, and perhaps to direct more resources their way.

Now at some point it may become apparent to folks that schools and districts with the highest average scores are also those with the most affluent student populations. (Systematic studies consistently find that income accounts for 60-80% of the variation in test scores among different groups. ) Realizing this, many will call instead for tests at start and end of the school year and advocate rewarding the schools and teachers whose students show the greatest improvement. This is an eminently natural position to take. But it turns out to be fraught with an extraordinary number of challenges.

Statistical models for assessing contributions to student learning (value-added models, if you can stomach that awkward yet adhesive term) have received intense scrutiny among educational researchers in the past five years. I became attuned to this literature after I ran into some serious roadblocks in an effort to isolate contributors to reading and math improvement for children in a school touched by the WIDE World program. I began to see that gathering more complete data and using more sophisticated methods (hierarchical linear models, propensity scores) could solve only some of my problems. And I began to collect my own and more distinguished researchers’ impressions of the hurdles one might need to overcome to develop a sound explanatory model of test score change. What follows is a list of these issues. While no study is likely to involve all of them, most will bump up against quite a few.

1. Studies can easily confuse effects from individual students; from being among certain students; from teacher; from intervention; and from school. To what should we compare a certain result — to the results that would have occurred if the student(s) had not been in school at all ? if they had been in school, but had stayed in the previous grade? if they had been in a different school? with a different set of classmates? with different teachers ? Questions such as these are too often neglected, detracting from the soundness of research claims. (1), (5)

2. Few students study under just one teacher, making it perilous to try to attribute gains or losses to an individual teacher.

3. What students in some classes learn may spill over and reach students in other classes (“contamination”).

4. Groups of students do not often stay together long-term, so while students may exert effects on one another — which are difficult enough to measure — these are extremely difficult to track longitudinally. (1), (5)

5. It’s difficult to separate past effects (of any of these types — from teacher, school, or set of classmates) from more recent ones. (1)

6. Variations in the policies by which schools assign students to special Ed or ESL programs can distort results, as can any pattern of biased exclusion of students from testing. Students not promoted will be left out of any calculation of year-to-year change, when including such students would lower the group score. (According to Walt Haney, this is one source of the spurious “Texas Education Miracle” of the late 1990s.) (1), (2)

7. Inclusion of different-enough schools in a study means one must extrapolate to a point beyond the reasonable. E.g., suppose that, within schools with 0%-30% limited English-proficient (LEP) students, each difference of 10 percentage points in LEP is linked with an average test score difference of 3 points. That would mean a 3-point score difference for a 0% LEP school compared to a 10%, and a 6 point difference for a 0% compared to a 20%. However, for a school with a % LEP far outside that range, such as 60%, that relationship may not hold at all. The slope might get much flatter or much steeper. In such cases trying to adjust or control for % LEP would yield misleading results. (1)

8. Thomas Kane and Douglas Staiger have shown that 50-85% of year-to-year variation in group test scores can be attributed simply to yearly fluctuations in the academic levels of incoming student cohorts. In other words, to noise: to something that has nothing to do with the teacher’s or program’s effectiveness. Differences between student groups within a year figure to be subject to noise as well. The authors also convincingly show that, because group averages fluctuate much more for small groups, it is the smaller schools who are more apt to suddenly rise to the top or sink to the bottom, netting them undeserved rewards or penalties. Such outstanding schools almost always end up closer to the middle of the pack the following year, demonstrating the principle of regression to the mean. Their exceptionableness is due not to anything noteworthy such as an instructional change, but only to chance. (3)

9. Student performance in different subjects must be assessed via different instruments. It would be pointless to try to use a single instrument such as the SAT whether testing reading, world history, or advanced placement physics. And different tests vary in their propensity to show change, either because of differences in the relative difficulties of pre-and post versions or because of differences in either version’s validity and reliability. This fact complicates any study involving multiple subject areas or multiple grades.

10. Since virtually all (all?) standardized tests in education rely to some degree on students’ reading ability, value-added research results in all subjects other than reading will be compromised unless all students have achieved a certain reading level. One’s ability to think effectively with social science, math, or science material will not be picked up by a test unless that test is properly matched to the student’s reading ability. Moreover, group comparisons are potentially invalidated if some groups are more affected by this problem than others.

11. It is often desirable to try to relate student outcomes to some kind of indicator of baseline teaching effectiveness. Some examples are years of teaching experience; type of certification or teacher preparation program; educational degree; professional development points; and experts’/administrators’ ratings. Unfortunately, the first two of these have been fairly conclusively shown to be largely unrelated to test score outcomes, based on a recent, very large-scale study in New York City. (4) The other three variables seem unpromising based on WIDE’s recent evaluation work, including an unpublished urban school study involving about 25 teachers and 300 students. This is not to say that teacher quality itself does not matter. Indeed, evident from Kane et al.’s recent paper is the very great need for some usable measure that can serve as a proxy for teacher quality.

Rubin, Stuart, and Zanutto (1) and Damian Betebenner (5) make several suggestions that I find to be key for thoughtful research using value-added models. Three seem to be the most important:

  • Randomize to the extent possible.
  • Collect data on as many relevant variables as possible; statistical control of these, while far inferior to equalizing through randomization, is still useful.
  • Be very careful to think through, and make explicit, your assumptions. The best analytical method for a particular study and research question will depend on these assumptions. Example: Is it reasonable to expect that no improvement would occur absent a certain intervention? If so, it makes sense to analyze gain scores, as with analysis of variance. Is it instead reasonable to think that all students would improve to some degree even without the intervention, and that their posttest score could be predicted as a linear function of their pretest score? If so, analysis of covariance would make sense.

I suppose it is clear by now that I am pessimistic about the prospects of modeling standardized test scores, or changes therein, as a way of isolating the contributing factors in student achievement/improvement. Rubin et al. take a stronger stand (p. 18):

[… We] do not think that [most value-added] analyses are estimating causal quantities, except under extreme and unrealistic assumptions. We argue that models such as these should not be seen as estimating causal effects of teachers or schools, but rather as providing descriptive measures. It is the reward structures based on such value-added models that should be the objects of assessment, since they can actually be (and are being) implemented.

(1) Donald B. Rubin, Elizabeth A. Stuart, and Elaine L. Zanutto (2003). A potential outcomes view of value-added assessment in education .
(2) Walt Haney (2000). The myth of the Texas miracle in education .
(3) Thomas Kane and Douglas Staiger (2002). Volatility in school test scores: Implications for test-based accountability systems .
(4) Thomas Kane, Jonah Rockoff, and Douglas Staiger (2006).
What does certification tell us about teacher effectiveness? Evidence from New York City.
(5) Damian Betebenner (2006). Lord’s Paradox with Three Statisticians (Presented at 2006 AERA Annual Meeting in San Francisco; seems to be temporarily unavailable on the Internet.)


Are WIDE World surveys representative?


WIDE World’s evaluation efforts rely quite a bit on surveys. Since response rates have been about 55% for our end-of-course evaluations and about 20% for our one year follow-up surveys, it is natural to ask what gives us confidence that the fraction who respond constitute a representative sample of course participants. Our web page, Course Evaluations, briefly discusses the ways we try to address the issue (both at the top and bottom of the page). The current piece will fill in some gaps and add some detail.

We regularly check to see whether survey respondents hold, on average, similar characteristics to those of nonrespondents. For example, we check whether our findings are applicable regardless of region, teaching subject, level of experience, or educational degree. Generally we find that they are, and the glaring exception turns out to be less important than one might think. Respondents tend to have amassed more participation points than nonrespondents have. But level of participation shows such low correlations (r ~ 0.0 to 0.2) with important outcomes such as course satisfaction, or appraisal of the course’s effects, that for these outcomes nonrepresentativeness in participation is scarcely an issue. And to reiterate what was described in the page listed above, in 2005 our special effort to individually reach nonrespondents (especially those with low participation) revealed surprisingly few differences in their opinions of the courses compared to respondents.

As for our recent one year follow-up survey, it is true that respondents were more likely than others to have gone on to become coaches in our program. However, the numbers in this group were small, and adjusting for this imbalance would have exerted only a slight effect on the findings reported at One Year Follow-up. Our paper entitled Beyond Self-Report, p. 12, provides another examination of such factors as they pertain to a follow-up survey from the previous year.


Inquiry-based Learning and Teaching for Understanding


The TfU framework has been used widely at very different educational levels (k-12 classrooms, higher ed, teacher ed, professional development) to nurture inquiry-based learning and teacher learning about inquiry. In many ways, Teaching for Understanding provides a useful scaffold for inquiry approaches. Both share a performance view of learning, which is at the heart of an inquiry approach (learning through active, reflective doing). TfU goes further in providing structure by asking teachers to:

  • think hard about what they want students to learn from their inquiry experience (UGs), then
  • think even harder about where to aim inquiry in the service of getting at those learning goals (Performances of Understanding), and
  • think even harder still about how teacher and student will both gauge that learning as it’s taking place in ways that deepen and extend that learning (Ongoing Assessment).

Moreover, TfU’s take on performances of understanding helps teachers discern different kinds of performances appropriate to different stages of inquiry (messing about, guided inquiry, culminating).

Useful sources: San Francisco’s Exploratorium Institute for Inquiry Research has a page on Inquiry Education Research. Four of 18 articles listed for further reading about inquiry learning are from the Teaching for Understanding Project. These include:

Brandt, Ron “On Teaching for Understanding: A Conversation with Howard Gardner” Educational Leadership, v50 n7, April 1993.

Perkins, David and Tina Blythe “Teaching for Understanding: Putting Understanding Up Front” Educational Leadership, v51 n5, February 1994.

Perkins, David “Teaching for Understanding” American Educator: The Professional Journal of the American Federation of Teachers; v17 n3, pp. 8,28-35, Fall 1993.

Wiske, Martha Stone “How Teaching for Understanding Changes the Rules in the Classroom.” Educational Leadership; v51 n5, pp. 19-21, Deb 1994.

Examples of Inquiry w/ TfU…

“Try Science” designed with teaching for understanding Best Practices Applied in Online Science Teacher Education by June Talvitie-Siple, ABD University of North Carolina-Chapel Hill, USA…

Treagust, D. F., Jacobowitz, R., Gallagher, J. L., & Parker, J. (2001). Using assessment as a guide in teaching for understanding: A case study of a middle school science class learning about sound. Science Education, 85, 137-157.

Inquiry and the National Science Education Standards: A Guide for Teaching and Learning (National Research Council, 2000) provides an overview of inquiry-based teaching strategies and examples of what inquiry teaching and learning look like, both inside and outside the classroom. Underlying all of this work is the notion that teaching for understanding requires more in-depth treatment of a smaller number of key ideas, which is sometimes summarized by the slogan “less is more.”


Research about Teaching and Learning with Technology and Student Outcomes


A WIDE Team in TSNT recently asked their coach for information comparing “the costs of using new technology and their benefits in terms of educational outcomes with the cost of more traditional programs and their respective educational outcomes.” I (DES) had to stare at this question a bit. It really involves a three-step:

(a) relationship between (teaching and learning with technology) and student outcomes;

(b) comparing student outcomes for teaching and learning with technology AND teaching and learning without technology

(c) comparing benefit (in terms of student outcomes) with costs of t&l with tech and t&l without.

When you get to (b) above, comparing t&l with and without tech., things become a bit more confusing. Clear-headed researchers would say, just like the Vermont farmer, “can’t get theah from heah,” because just introducing an innovation changes the equation. So you have to spend a lot of time justifying why the comparisons are equivalent and that already puts you on shaky ground. I’m not ducking (c), comparative cost-benefit, by claiming that (b), comparison, is a bridge too far; it’s just that I wouldn’t trust any studies that got to (c) without some firm footing in (b). So that’s why I’m sticking with (a)!

The most robust research has to do with (a) anyway. NCREL has done some good work in this arena. See, in particular, the study “A Meta-Analysis of the Effectiveness of Teaching and Learning with Technology on Student Outcomes” by Don Waxman (2003) .

WestEd published the following study: Ringstaff, C. & Kelley, L. (2002). The learning return on our educational technology investment. San Francisco, CA: WestEd.

A bit more dated and less rigorous but still helpful meta-analysis was done under the aegis of the Milken Foundation. This was also published in a peer-reviewed journal. Schacter, J., & Fagnano, C. (1999). Does computer technology improve student learning and achievement? How, when, and under what conditions? Journal of Educational Computing Research, 20(4) These articles all point to distinct, situation-specific, context-dependent advantages, all of which make addressing the comparison quesiton, b, trickier & trickier.


Does Teaching for Understanding Improve Student Achievement?


No wide-scale, experimental studies have been conducted of the Teaching for Understanding framework, but several wide-scale studies of constructivist teaching have made connections between instructional practices and student gains on standardized tests. Among the most recent and rigorous is that of Martin Abbott and Jeffrey Fouts of the Washington School Research Center. In their study, Constructivist Teaching and Student Achievement, Abbott and Fouts conducted observations of 669 classrooms in 2000-2001 and found, after controlling for income, that constructivist teaching was clearly correlated with higher student achievement on standardized tests in reading, writing, and math. While analyses such as this are seldom straightforward, these researchers felt that the benefit of constructivist teaching would have shown up as even greater if certain study variables had been aggregated in different ways. We’d add that more precise measurement of what constitutes constructivist teaching would have strengthened the correlations as well. In the National Research Council’s synthesis of research around student learning in history, math, and science, How Students Learn, the authors write the following:

. . . in some cases there is evidence that teaching for understanding can increase scores on standardized measures (e.g., Resnick et al., 1991); in other cases, scores on standardized tests are unaffected, but the students show sizable advantages on assessments that are sensitive to their comprehension and understanding rather than reflecting sheer memorization (e.g., Carpenter et al., 1996; Secules et al., 1997). (p. 177)

Relevant here is the statement by leading educational evaluator Jim Popham, of UCLA, that “the vast majority of [state standardized tests] are instructionally insensitive—that is, they’re unable to detect even striking instructional improvements when such improvements occur” [1]. Since arguably the most prominent trend in educational testing in the last five years has been greater demands for tests that tap into higher order thinking, we may soon start to see research results that show greater effects from instructional changes such as the adoption of the Teaching for Understanding approach. Other reports that have received attention in linking constructivist teaching to student outcomes include:


[1] Popham, James W. (2006) “Assessment for Learning: An Endangered Species?” Educational Leadership. (63)5. p. 82-83


Roots of Teaching for Understanding


The Teaching for Understanding framework has been widely used to plan, conduct, and assess teaching aimed at developing learners’ capacities to apply their understanding flexibly in varied situations. The framework derives from a multi-year collaborative research program that synthesized contemporary theories of pedagogy with analyses of effective classroom practice [1]. Both strands of research–theoretical and practice-based–draw on over three decades of studies of understanding, learning, and teaching conducted by researchers at the Harvard Graduate School of Education’s Project Zero. Broadly, Teaching for Understanding or TfU is rooted in constructivist notions of the nature of knowledge, student cognition, and instruction. Talbert, McLaughlin and Rowan [2] describe constructivist teaching in the following way:

The constructivist view of effective classroom instruction is often called ‘teaching for understanding,’ and research on this topic has become a priority for educational policy makers. The importance of this form of teaching lies in its potential to enhance the kinds of cognitive outcomes for students that the American educational system has heretofore been notoriously ineffective at producing. While American schools have been relatively successful in engendering basic-skills achievement, they have not done well in promoting students’ success in tasks variously described as problem solving, critical analysis, higher-order thinking, or flexible understanding of academic subject matter- learning outcomes associated with teaching for understanding. (p.47)


[1] Published works include:

Blythe, T. (1998). The teaching for understanding guide. San Francisco: Jossey-Bass.

Wiske, M. S. (Ed.). (1998). Teaching for understanding. San Francisco: Jossey-Bass.

Wiske, M. S., Rennebohm Franz, K., & Breit, L. (2005). Teaching for understanding with new technologies. San Francisco: Jossey-Bass.

Wiske, M. S., & Perkins, D. N. (2005). Dewey goes digital: The wide world of online professional development. In C. Dede, J. P. Honan & L. C. Peters (Eds.), Scaling up success: Lessons learned from technology-based educational innovation. San Francisco: Jossey-Bass.

[2] Talbert, Joan E., McLaughlin, Milbrey W., and Rowan, Brian. “Understanding Context Effects on Secondary School Teaching.” TEACHERS COLLEGE RECORD (Fall 1993) 45-68, as quoted in Teaching for Understanding: Educating Students for Performance by Ken Kickbusch.




A compendium of musings from the research staff of WIDE World, Harvard Graduate School of Education.

Log in