You are on page 1of 3

Source: http://www.stat.psu.

edu/

Creating a normal probability plot of residuals


-Here's
Here's the basic idea behind any normal probability plot: if the data follow a normal distribution
with mean and variance 2, then a plot of the theoretical percentiles of the normal
distribution versus the observed sample percentiles should be approximately linear. Since we
are concerned about the normality of the error terms, we create a normal probability plot of the
residuals. If the resulting plot is appr
approximately
oximately linear, we proceed assuming that the error terms
are normally distributed.
The theoretical p-th percentile of any normal distribution is the value such that p% of the
measurements fall below the value. Here's a illustrating a theoretical p-th percentile.
centile.

The problem is that to determine the percentile value of a normal distribution, you need to know
the mean and the variance 2. And, of course, the parameters and 2 are typically unknown.
Statistical theory says itss okay just to assume that = 0 and 2 = 1. Once you do that,
determining the percentiles of the standard normal curve is straightforward. The p-th percentile
value reduces to just a "Z-score"
score" (or "normal score"). Here's a illustrating how the p-th percentile
value reduces to just a normal score.

The sample p-th percentile of any data set is, roughly speaking, the value such that p% of the
measurements fall below the value. Think about it. The median is just a special name for the
50th-percentile.
percentile. If you are asked to determine the median of a set of data, you punch the data into

your calculator ... oops, I mean ... you order the dat
data,
a, you take the value in the middle, and call it
the median. (If you have an even number of data points, so there is no value in the middle, you
average the two middle values.) The median is the value so that 50%, or half, of your
measurements fall below the
he value
value. Now, if you are asked to determine the 27th--percentile, you
take your ordered data set, and you determine the value, so that 27% of the data points in your
data set fall below the value. And so on.
We have sufficient background now to get back to the point of this section, namely, to create a
normal probability plot of residuals. Let's take a look at the example in the following table. The
first two columns of the table contain data for some predictor x and some response y. The third
column labeled i is just an index variable which keeps track of the data points (incidentally, n =
9). The fourth column labeled RESIDS contains the residuals corresponding to each of the data
points that are obtained when regressing the response y on the predictor x. Note that the residuals
appear in increasing order. That is, the data were entered into the table so that the residuals
would appear in increasing order. This, of course, facilitates finding the sample percentiles of the
residuals.

i
1
2
3
4
5
6
7
8
9

residuals
ordered!
RESIDS
-2.70103
2.70103
-1.04639
1.04639
-1.01031
1.01031
-0.39175
0.39175
0.29897
0.60825
0.64433
1.60825
1.98969

PCT
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

MTB_PCT
0.060976
0.158537
0.256098
0.353659
0.451220
0.548780
0.646341
0.743902
0.841463

NSCORE
-1.54664
-1.00049
-0.65542
-0.37546
-0.12258
0.12258
0.37546
0.65542
1.00049

What is the sample median of the residuals? Well, it's the middle value 0.29897 of the
ordered residuals. If you look at the column labeled PCT (for "percent"), you see that 0.29897
sure enough corresponds to the 50th percentile. By that reasoning, -1.04639
1.04639 is the 20th
percentile, 1.60825 is the 80th percentile, and so on. Incidentally, the percentages in the PCT
column are calculated using the fformula

. So, the residuals in the RESIDS column are the

sample
-th
th percentiles. That is, the values in the RESIDS column are the sample
percentiles we'll use in our normal probability plot.
Now, we just have to find the theoretical percentiles that corre
correspond
spond to the sample percentiles.
There's just one slight adjustment we have to make. Minitab, as well as most other statistical

software, doesn't consider

Minitab uses
MTB_PCT.

to be the appropriate formula for determining the percentages.

instead. These adjusted percentages appear in the column labeled

The DOX text by Montgomery uses (i-0.5)/n.


Now, we can use the standard ( = 0, 2 = 1) normal distribution to find the theoretical
percentiles, which appear in the column labeled NSCORES. Here's a figure that illustrates the
determination of three of the NSCORES appearing in the table.

In Excel: the normal score (or z--score) can be found using this command: norminv(prob., 0, 1)
, where the prob. value corresponds to MTB_PCT in this document; e.g., (i-0.5)/n
0.5)/n
Finally, we create the normal probability plot of the residuals by creating a scatter plot with the
theoretical percentiles (in the NSCORES column) on the y axis and the sample percentiles (in
the RESIDS column) on the x axis:

Note that the relationship between the theoretical percentiles and the sample percentiles is
approximately linear. Therefore, the normal probability plot of the residuals suggests that the
error terms are indeed normally di
distributed.

You might also like