Norms vs. Standards


I’ve found myself trying to explain the difference between norm and standards reference multiple times in the last few weeks, which means it’s time to write about it. A lot of people get this distinction– but a lot of people don’t. I’m going to try to do this in plain(ish) English, so those of you who are testing experts, please forgive the lack of correct technical terminology.

A standards-referenced (or criterion-referenced) test is the easiest one to understand and, I am learning, what many, many people think we’re talking about when we talk about tests in general and standardized tests in particular.

With standards reference, we can set a solid immovable line between different levels of achievement, and we can do it before the test is even given. This week I’m giving a spelling test consisting of twenty words. Before I even give the test, I can tell my class that if they get eighteen or more correct, they get an A, if they get sixteen correct, they did okay, and if the get thirteen or less correct, they fail.

A drivers license test is also standards-referenced. If I complete the minimum number of driving tasks correctly, I get a license. If I don’t, I don’t.

One feature of a standards-referenced test is that while we might tend to expect a bell-shaped curve of results (a few failures, a few top scores, and most in the middle), such a curve is not required or enforced. Every student in my class can get an A on the spelling test. Everyone can get a drivers license. With standards referenced testing, clustering is itself a piece of useful data; if all of my students score less than ten on my twenty word test, then something is wrong.

With a standards-referenced test, it should be possible for every test taker to get top marks.

A norm-referenced measure is harder to understand but, unfortunately, far more prevalent these days.

A standards-referenced test compares every student to the standard set by the test giver. A norm-referenced test compares every student to every other student. The lines between different levels of achievement will be set after the test has been taken and corrected. Then the results are laid out, and the lines between levels (cut scores) are set.

When I give my twenty word spelling test, I can’t set the grade levels until I correct it. Depending on the results, I may “discover” that an A is anything over a fifteen, twelve is Doing Okay, and anything under nine is failing. Or I may find that twenty is an A, nineteen is okay, and eighteen or less is failing. If you have ever been in a class where grades are curved, you were in a class that used norm referencing.

Other well-known norm referenced tests are the SAT and the IQ test. Norm referencing is why, even in this day and age, you can’t just take the SAT on a computer and have your score the instant you click on the final answer– the SAT folks can’t figure out your score until they have collected and crunched all the results. And in the case of the IQ test, 100 is always set to be “normal.”

There are several important implications and limitations for norm-referencing. One is that they are lousy for showing growth, or lack thereof. 100 will be and has always been “normal” aka “smack dab in the middle of the results” for IQ tests. Have people gotten smarter or dumber over time? Hard to say– big time testers like the IQ folks have all sorts of techniques for tying years of results together, but at the end of the day “normal” just means “about in the middle compared to everyone else whose results are in the bucket with mine.” With norm referencing, we have no way of knowing whether this particular bucket is overall smarter or dumber than the other buckets. All of our data is about comparing the different fish in the same bucket, and none of it is useful for comparing one bucket to another (and that includes buckets from other years– as all this implies, norm referencing is not so great at showing growth over time).

Normed referencing also gets us into the Lake Wobegon Effect. Can the human race ever develop and grow to the point that every person has an IQ over 100? No– because 100 will always be the average normal right-in-the-middle score, and the entire human race cannot be above average (unless that is also accompanied by above-average innumeracy). No matter how smart the human race gets, there will always be people with IQs less that 100.

On a standards-referenced test, it is possible for everyone to get an A. On a normed-referenced test, it is not possible for everyone to get an A. Nobody has to flunk a standards-referenced test. Somebody has to flunk a norm-referenced test.

What are some of the examples we live with in education?

How about “reading on grade level”? At the end of the day, there are only two ways to determine what third grade “grade level” is– you can either look at all all the third graders you can get data for and say, “Well, it looks like most of them get up to about here” or you can say “I personally believe that all third graders should be able to get to right about here” and just set it yourself based on your own personal opinion.

While lots of people have taken a shot at setting “grade level” in a variety of ways, it boils down to those two techniques, and mostly we rely on the first, which is norm-referencing. Which means that there will always be people who read below grade level– always. The only way to show the some, more or all students are reading above grade level is to screw around with where we draw the “grade level” line on the big bell curve. But other than doing that kind of cheating with the data analysis, there is no way to get all students reading above grade level. If all third graders can read and comprehend Crime and Punishment, then Crime and Punishmentis a third grade reading level book, and the kid in your class who has trouble grasping the full significance of Raskolnikov’s dream of the whipped mare as a symbol of gratification through punishment and abasement is the third grader who gets a D on her paper.

And of course, there are the federally mandated Big Standardized Tests, the PARCC, SBA, PSSA, WTF, MOUSE, ETC or whatever else you’re taking in your state.

First, understand why the feds and test fanatics wanted so badly for pretty much the same test to be given everywhere and for every last student to take it. Think back to our buckets of fish. Remember, with norming we can only make comparisons between the fish in the same bucket, so the idea was that we would have a nation-sized bucket with every single fish in it. Now, sadly, we have about forty buckets, and only some of them have a full sampling of fish. The more buckets and the fewer fish, the less meaningful our comparisons.

The samples are still big enough to generate a pretty reliably bell-shaped curve, but then we get our next problem, which is figuring out where on that bell curve to draw the cut score, the line that says, “Oh yeah, everyone above this score is super-duper, and everyone below it is not.” This process turns out (shocker) to be political as all get out (<a href=””>here’s an example of how it works in PA</a>) because it’s a norm-referenced test and that means somebody has to flunk and some bunch of bureaucrats and testocrats have to figure out how many flunkers there are going to be.

There are other norm-referencing questions floating out there. The SAT bucket has always included all the fish intending to go to college– what will happen to the comparisons if the bucket contains all the fish, including the non-college-bound ones? Does that mean that students who used to be the bottom of the pack will now be lifted to the middle?

This is also why using the SAT or the PARCC as a graduation exam is nuts– because that absolutely guarantees that a certain number of high school seniors will not get diplomas, because these are norm-referenced tests and somebody has to land on the bottom. And that means that some bureaucrats and testocrats are going to sit in a room and decide how many students don’t get to graduate this year.

It’s also worth remembering that norm referencing is lousy at providing an actual measure of how good or bad students are at something. As followers of high school sports well know, “champion of our division” doesn’t mean much if you don’t know anything about the division. Saying that Pat was the tallest kid in class doesn’t tell us much about how tall Pat actually is. And with these normed measures, you have no way of knowing if the team is better than championship teams from other years, or if Pat is taller or shorter than last year’s tallest kid in class.

Norm referencing, in short, is great if you want to sort students, teachers and schools into winners and losers. It is lousy if you want to know how well students, teachers and schools are actually doing. Ed reform has placed most of its bets on norm referencing, and that in itself tells us a lot about what reformsters are really interested in. That is not a very useful bucket of fish.

Leave a comment