Skip to Content

Testing: How the Sausage is Made

By Alain Jehlen

How do standards become test scores?
What do 'proficient' and 'grade level' really mean?

How are high-stakes tests dumbing down education?

These aren’t questions you’ll see on any test, but the answers might surprise you.

Little Jimmy opens his test booklet and reads:

What number goes in the box to make this number sentence true?
11 —   ?    = 2

Your whole year's work has come down to this. If he gets the right answer, your school is on its way to the modern Holy Grail: Adequate Yearly Progress. If not, you're a failure.

But how did that question get in front of him?

Each year, hundreds of millions of test questions are developed, answered, and scored. Some 45 million tests required by the Elementary and Secondary Education Act, better known as No Child Left Behind (NCLB), will be administered this year.

The industry rakes in more than half a billion dollars a year for these tests, but its spokespeople insist their profit margins are tight because of fierce competition and ultra-tight deadlines. To you, it may seem slow if you get your scores in six weeks, but for them, the pace is frantic.

These one-size-fits-all instruments increasingly dictate what educators do for most of the rest of the year. (That, incidentally, is what “standardized” means: the same test taken under the same conditions by all those wildly different students.)

The big question: Has the focus on tests produced students who are better educated or more competitive in the world economy? Probably not. High national test scores don't correlate with healthy economies, according to a study by researcher Keith Baker (see What's in a Score? ), and all the intense, high-stakes testing hasn’t had any visible impact on national test scores anyway.

That is, while scores on the high-stakes state tests go up as teachers focus on them, students do no better on other, broader achievement tests.

What's the alternative? We could make more use of a testing apparatus that adjusts to every child, evaluates results quickly, and immediately makes appropriate changes in instruction: the human educator. That's why NEA and more than 100 education, civil rights, minority, religious, and other organizations have signed a joint manifesto urging Congress to change NCLB so schools will no longer live or die according to test scores.

And you can lend your voice, too: Visit NEA's Legislative Action Center, where you can contact your own lawmakers about NCLB, and

the NEA Fund for Children & Public Education, which helps candidates who support public education.

A Dictionary of Confusion:Words like 'proficient' and 'grade level' sound good, but mean different things in different places.

You would think educators could reach consensus on the meaning of some of the most important words in their discipline—is that too much to ask?


Take the phrase "grade level." Education Secretary Margaret Spellings, President Bush, and other Administration leaders say the purpose of ESEA/NCLB is to bring all children up to "grade level."

Some experts like NCLB critic Gerald Bracey point out that technically, "grade level" means the skill level that divides students in half: 50 percent are above grade level, and 50 percent are below. So it's impossible to have all students at or above grade level.

But many educators use the term more loosely, and clearly so does Secretary Spellings. Ask teachers what “on grade level” means, and they’re likely to say it’s a level that most students—maybe 75 or 90 percent—can reach with good teaching and good home support. (Of course, the home situation is not something a teacher can necessarily control.)

Even 90 percent is still a far cry from 100 percent. But despite political rhetoric, NCLB’s “adequate yearly progress” provisions don’t actually say anything about “grade level.” Instead, they say all students must be “proficient” by 2014.

So what’s “proficient”? Humpty-Dumpty-esque, NCLB lets each state define “proficient” in its own way. And states have taken full advantage of their discretion. An Education Week study several years ago found enormous geographic variation in how skillful a student must be to rate “proficient.”

Ironically, the fastest way for a sub-standard student in Massachusetts, a state where the bar is set high, to become “proficient” is to move to a state where that word means something quite different.

Just the Facts? How NCLB’s high-stakes tests actually push standards down.

Here’s a high-stakes question for you:

Class A will see this test item on its state test:
Which of the following is equal to 6(x + 6)?

     A) x + 12     B) 6x + 6     C) 6x + 12      
     D) 6x + 36     E) 6x + 66

Class B will see this item:
3 pineapples
1 serving = 1/2 pineapple

Given the information above, write a mathematics word problem for which 3 ÷ 1/2 would be the method of solution.

Both questions are from the National Assessment of Educational Progress for eighth-grade math, conducted by the U.S. Education Department. More students got Class A’s question right than Class B’s question.

But the real high-stakes question is this:

How would you prepare Class A to answer their question? And how would you prepare Class B?
Test experts point out that tests don’t just record information. They also drive what students learn and how they are taught. Teachers are under strong pressure to teach to the test—if they don’t, students, teachers, and the school will likely suffer for it.

Educators and business leaders generally agree that a well-educated person is one who can solve problems and apply knowledge to new situations—one who could do well on the question posed to Class B. But many educators report that NCLB is driving a trend toward more factual recall or formula test items like the one for Class A.

To be sure, multiple-choice questions are much cheaper and faster to score than open-ended questions. In 2005, as states got ready for the enormous increase in testing required by NCLB, Education Week reported that 15 states with 42 percent of the nation’s students had chosen to make their reading and math tests entirely multiple-choice.

“What upsets me,” says test consultant Scott Marion, “is that a lot of states were moving to richer tasks [in their tests] in the 1990s, but that’s slipping away. And teachers tend to model the kinds of questions they see on the state test.” Marion is vice president of the National Center for the Improvement of Educational Assessment, a non-profit that advises 15 states on their tests.

Not everyone agrees that multiple-choice is the biggest problem. H.D. Hoover, principal author of the Iowa Test of Basic Skills for 40 years (and recently retired), thinks they can be every bit as thought-provoking as open-ended questions.

Case in Point

The full moon rises at midnight…
a) always    b) usually    c) rarely    d) never.

That question stumps a lot of adults—and it’s too tough to put on a state test. Figuring it out depends on really understanding why the moon has phases, and then applying that understanding—not bad for a 10-word item. (For the answer and an explanation, click here).

While Hoover defends the potential of multiple-choice questions, he agrees with Marion and others that NCLB is dumbing down tests and hurting classroom instruction. The problem, he says, is the whole emphasis on high-stakes tests.

The testing regime forces states to create huge numbers of items and score the answers very fast. It’s much easier and quicker to write items that test knowledge of facts or formulas than to write questions that get at deep understanding. And for open-ended questions, it’s easier to score those that ask for facts than those that test a student’s ability to apply what’s been taught in a real-world scenario.

The result: fewer questions like Class B’s, and more “cloned” items—similar questions used year after year.

The extreme level of detail in new state content standards also contributes to the problem, Hoover adds. Questions have to be written almost in the same words as the standards—otherwise, the test-maker won’t get credit for aligning the test with them. But that means the question won’t measure how well the student can use his or her understanding beyond the immediate situation called for in the standard.

On top of that, Hoover says NCLB’s dictum that every student be “proficient” by 2014 is impossible to carry out by any reasonable definition of “proficient.”

What’s the solution? “The only way is either to change the law or to build a test so simple-minded that everyone can be ‘proficient,’” Hoover says.


Connecticut Goes to Court

Can NCLB make states abandon “instructionally sound” test practices? That’s the question now before a federal court.

Connecticut was using expensive, sophisticated tests with performance tasks and other open-ended questions to assess all students every other year. The federal Department of Education insisted that NCLB requires annual testing. Connecticut said it couldn’t afford to give its elaborate tests every year.

The federal response: Use inferior tests if you have to, but do it every year, or do without federal funds. “[S]ome of the costs of the system are attributable to state decisions,” wrote Education Secretary Margaret Spellings. “While these decisions are instructionally sound, they do go beyond what was contemplated by NCLB.”

The case is pending. Meanwhile, the state is compromising, testing yearly with tests that use complex, probing questions, but not as many of them.


The Mirage of Rising Test Scores: State test scores are going up, but student achievement may not match.

Does all the pain that students and teachers are going through because of high-stakes testing serve a greater purpose? Is it raising student achievement?

The answer appears to be no.

In October 2005, the National Assessment of Educational Progress (NAEP), which is the only national testing program, reported the first results in which a significant impact from the NCLB testing program should have shown up. There wasn’t any. If you hadn’t known something big was happening in the nation’s schools, you never would have guessed it from the new numbers.

Reading achievement stayed about the same. Math scores, which have been rising for many years, rose some more, but more slowly than before NCLB took effect.

Yet many states are boasting of big gains on state tests. Why do state scores jump while national scores are level or rising slowly? Harvard University testing expert Daniel Koretz has seen this before, and he has a one-word explanation: “Coaching.”

On a test, he points out, “we can’t test kids on everything they’ve learned in math, so we just test them on 45 questions.” If the pressure to raise scores becomes great enough, teachers focus on the particular elements they know will be tested.

“In the last few years, the situation has become absolutely egregious,” Koretz says. “They’re bringing in people from the outside to tell teachers what to skip, what to throw out.”

But isn’t that just focusing on the most important topics? No, says Koretz, because the NAEP scores show that all this effort is not improving real achievement. “The people who get cheated are the kids,” he says.

It’s not a new problem. In the 1990s, Kentucky adopted a high-stakes fourth-grade reading test, which served as a model for NCLB. “They had a staggeringly large increase in scores over two years,” Koretz says. “But on the NAEP, their scores didn’t change at all.” This pattern, he says, has been repeated over and over: higher scores on a state high-stakes test, but no improvement or only a small improvement on the national test for the same subject.

Koretz studied this phenomenon early in his career, in a district whose scores on a commercial test averaged about half a year above grade level. At that time, in the 1980s, the stakes were “laughably low compared with today’s, but teachers felt pressure from the administration to raise scores. When the district bought a new test, scores dropped like a rock,” he recalls. Four years later, they were back up. But when Koretz and his colleagues gave the old test to a random sample of classes, they scored lower.

In other words, the four years of test score gains reflected teaching to the new test, not real achievement.

So what can be done to make tests reflect reality?

First, educators should set realistic expectations for student scores, Koretz says. Tests should also be written so they’re harder to coach for. If the subset of topics found on the test changes from year to year, schools will be less likely to narrow their teaching. “But that would produce smaller gains—more meaningful, but smaller,” he adds. “And nobody wants that.”

Koretz also echoes the growing call, supported by NEA and more than 100 other organizations, for assessments based on multiple measures of student performance.

“That’s not to say tests shouldn’t be in the mix,” he says, “but there’s a lot more going on in classrooms than we can capture with a single test.”

The Chimp, the Chump, and You: Can a dumb machine help students write smart essays?

"It is with chimpanzee greatest esteem and confidence that I write to support Risk of physical injury Employers ..."

Thus began an “essay” cooked up by University of California-Davis writing instructor Andy Jones, which earned a stellar 6 out of 6 rating from the Educational Testing Service (ETS). Not from a reader, but from software called Criterion, a leader in the field of computerized essay scoring.

UC-Davis was considering using Criterion to decide which students should be allowed to skip a writing course, and Jones thought that was a bad idea. So he took a letter of recommendation he had written, replaced the student’s name with a few words from a Criterion writing prompt, and substituted “chimpanzee” for every “the.” Criterion loved the result, calling it “cogent” and “well-articulated.”

The university shelved the plan. Yet computers are rating millions of essays per year in thousands of districts. Their main use is not for making high-stakes decisions, but to help teachers teach. Their big advantage: speed. The student gets feedback in seconds.

“It’s allowed me to triple my writing assignments,” says Irvine, California, middle school teacher Pat Thornton, who uses Criterion with her 157 writing students. “It doesn’t analyze the content of an essay—that’s still our job. But it gives students a heads-up on the mechanics.” Thornton assigns five essays a month, and grades one herself in depth. “I use it because the more students write, the better they get,” she says.

At ETS, researcher Jill Burstein says she’s heard from many teachers that they tell students, “Give me your essay when you’ve got all the spelling and grammar mistakes fixed.”

Does the program’s conventional approach damp down creativity, leading students to stick with predictable writing structures? “It could, but we don’t allow that to happen,” says Thornton. “And with kids who are really struggling with writing, that’s not all bad. At least they will pass the high school exit exam.”

Criterion costs $11 to $15 per student per year, depending on the number of students. Competing programs from Vantage Learning and Pearson Knowledge Technologies cost a bit more.

Although their main use is as a teacher aid, essay-rating programs do sometimes score high-stakes tests. Vantage’s program scores the West Virginia seventh-grade writing assessment and wrested the essay portion of the Graduate Management Admission Test (GMAT), used by business schools, away from ETS. West Virginia uses the computer alone to score, but there’s also a human scorer on the GMAT.

Vantage says its program is more accurate than a person. That is, when a group of expert human scorers and the Vantage program score the same essays, Vantage consistently comes closer to the group’s average score than any human.

But how can a computer judge an essay? It doesn’t scratch its head, laugh at students’ jokes, or, as the UC-Davis case proved, roll its eyes at illogical sentences. But even a word processor can find misspellings and grammatical errors. Essay-scoring programs go much farther. Vantage Vice President Harry Barfoot says their program can tell whether an essay is “engaging,” “insightful,” and has “a clear sense of audience.” But he won’t divulge exactly how because that’s proprietary information.

Burstein, the ETS researcher, says its program uses words and word patterns to detect the essay’s structural elements such as introduction, main point, supporting evidence, and conclusion. Some clues are obvious, such as “For example,” and “In conclusion.”

But most of the detective work is subtle, she says. The program doesn’t just evaluate isolated words. It looks for their position in the essay, what comes before and after, and many other contextual clues to calculate the probability that a set of words is a supporting argument.

Can the computer tell whether the argument is sound? No. Criterion development director Linda Reitzel of ETS likes to say that “man bites dog” makes just as much sense to Criterion as “dog bites man.”

And how about the chimps that made a chump of Criterion? “We work with good-faith writing,” says Burstein. "It’s not that you can’t fool the system—you can. That’s why we always want to keep a human in the loop."

Read the perfectly scored chimp essay.

Published in:

Published In


How do you write great questions?

Here are some tips.