Cursus

FACULTY OF AGRICULTURAL AND
APPLIED BIOLOGICAL SCIENCES

Department of Applied Mathematics, Biometrics and Process Control
APPLIED STATISTICS FOR THE
FOOD SCIENCES
Dr. ir. O. Thas
Chapter 1
Introduction
1.1 Structure and aim of the course
These course notes are only a part of the course. Since the aim of this course
is mainly to give you insight in the interpretation of statistical procedures
and how they can be applied, rather than on the statistical theory, the theory
classes will often make use of a series of applets that are meant to dynam-
ically and interactively illustrate statistical concepts. This course is not a
manual for these applets. Instead, this text contains a brief introduction
and summary of the methods that are discussed in the classes. Some of the
examples will be included as well.
The applets can be found at
http://tbw2.rug.ac.be/ILO
The aim of the course is that you can interpret statistical analysis, such
that youre able to understand the statistical considerations in literature. A
second aim is that you are able to perform a statistical analysis yourself.
1
1.2 Outline of the course
The following topics will be discussed in this course:
random variables and distributions; populations and samples
week 1
the sample mean as an example of a statistic
week 2
hypothesis testing (applied to the mean)
weeks 3,4
methods concerning proportions
week 5
ANOVA (analysis of variance)
weeks 6,7,8
regression
weeks 9,10
logistic regression
week 11
Applied Statistics for the Food Sciences Chapter 1 p. 2
1.3 What is Statistics
Statistics is the science that aims to formalize the process of induction.
In many sciences where experiments play a central role, the scientist builds
its theories be means of induction, i.e. from the results of an experiment
(observations), he tries to generalize these specic observations to a theory
that holds more generally. Lets take a simple example: suppose you want to
assess the eect of a particular diet on the weight of people. Then, you will
typically take (sample) a group of people, next you will split them into two
groups (randomize them). One group will get no specic diet, and the other
get the diet. After some weeks you will weight all those people so that you
know the weight reduction. The problem to be solved is: what is the dier-
ence in weight reduction between the two groups of people, or, equivalently,
what is the eect of the diet in terms of weight reduction. Of course, once
you have the data, you can simply calculate the average weight reduction
within the two treatment groups, and you might be willing to conclude that
the diet is eective as soon as the average weight reduction is larger in the
diet group as compared to the other group. This process of reasoning is
induction. You want to generalize the results from a small experiment to
all future people that will take this diet. Now, statistics is the science that
says how you have to make these calculations and to what extend you may
generalize the observations. In particular, statistics uses probability theory
such that these generalized conclusions can be accompanied with probability
statements, which allows you to formulate the conclusions of an experiment
with a degree of certainty. In the above example, statistics can also be used
the other way around: it may help you in determining a priori (i.e. before
the start of the experiment), how large the samples (the two groups of peo-
ple) must be in order to make the conclusions at the end with a prespecied
degree of certainty.
The statistic process of induction is also known as inference. Statistics helps
with the inferring from the sample.
Yet another important methodological aspect in theory building in experi-
mental sciences is falsication, which was originally introduced by Popper.
This means that if you have postulated a theory, i.e. you have a hypothesis,
then you must try to falsify the hypothesis and you must not only try to
conrm it. Such an experiment must be designed accordingly. As we will see
later in this course, statistical hypothesis testing agrees with this philosophy.
Chapter 2
Basic Concepts
In this chapter some of the basic concepts of probability theory and statistics
will be briey explained in a non-mathematical way. We will start with
explaining the dierence and the relationship between a population and
a sample. From there, we go to the important concept of a random
variable, which is characterized by a distribution. Finally, we will also
say something about the dierent types of data (continuous, discrete, factors,
...).
Many of the concepts will in the classroom be illustrated by means of the
applets. In this text, we will only briey refer to these, though actually they
form the back bone of this course!
2.1 Populations and Samples
The dierence between a population and a sample is most easily explained
in an example.
Example 2.1. Consider again the example given in the Introduction: we
are willing to assess the eect of a particular diet on the weight reduction of
people. Up to now we have very loosely used the term people. We expect,
though, that not all people react equally on a certain diet. E.g. there may
be a dierence between men and women, dierences between younger and
elderly people, dierence between people that do a lot of physical labour as
compared to people having to sit all day, ...
5
in this example we could dene the population as the group of people for
which we want the conclusion of the experiment to hold. Thus, if at the
end of the experiment, we want to recommend the diet to all people of all
sexes and of any age, than we dene the population as such. If, on the other
hand, we aim only at elderly women, then we restrict the denition of the
population accordingly.
In the above example, people constitute the basic elements of the population.
In general we sometimes use the term experimental unit or just element.
We will often assume that the population is innitely large. In practice this
will only mean that it is at least approximately large, or, the population is
large as compared to the number of elements that will be sampled.
Since the concept sample is closely related to the population, we will rst
discuss the sample before giving the more general denitions.
Example 2.2. Example 1 continued. Once the population is dened, e.g.
all people, then we should determine how to take a sample. A sample is a
subset of the population that is subject to the experiment. Throughout this
course we will always assume that every element of the population has the
same chance of being selected in the sample. Thus, in the example, every
human being on earth must have the same change of being selected in the
sample. You see immediately that this implies a problem: it will not be
practically feasible to select all people with the same probability. E.g. very
young babies cannot be selected at all (they only drink milk!), and, very
likely the scientist is not willing to perform a world wide experiment (unless
he found sucient funds to nance this expensive experiment). Thus, these
practical restrictions imply that the denition of the population must be
restricted to people older than e.g. 12 years, living in Belgium.
Once a realistic population is dened, the sampling can start. Or, at least,
the design of the sampling plan. As mentioned before, every element in
the population must have the same probability of being sampled. Suppose,
for simplicity, that you have a list of all people in the population, and their
addresses. Typically, you will also have determined a priori how many people
must be sampled. Then, you must only sample completely at random to
assure the equal selection probability.
At this point it is already important to realize that the procedure described
above, implies that every time you would repeat the whole procedure (i.e. the
sampling), other persons may be selected in the sample. This characteristic,
which is due to the randomness of the sampling process, will be referred to
as repeated sampling.
From the above example we have learnt that in practice the denition of the
population and of the sample go hand in hand. It is time now to give the
general denitions.
Denition 2.1 population: the population is the set of elements (units)
which all have equal chance of being selected in the sample. Furthermore, the
population is the set of elements about which you want the conclusions of the
experiment to hold.
Denition 2.2 sample: the sample is a subset of the population. The ele-
ments of the sample are subject to the experiment.
Before we continue with explaining randomization, we will discuss the role
of populations and samples in the induction process. In the introduction it
was already mentioned that statistics is a formalism of induction. Population
and sample are important concepts in this respect. If the sample is taken
in a correct way from a well dened population (and the calculations are
performed correctly), the induction is valid. Thus, the conclusions derived
from the sample may be generalized to the whole population.
2.2 Randomization
Randomization is a very important concept in statistics. In the previous
section we have actually already come across it once. There, we have said that
a sample has to be taken at random such that every unit in the population has
the same chance of being selected. This is one of the forms of randomization.
The other type of randomization is explained in the next example.
Example 2.3. Example 1 continued. Once a sample is taken from the
population, it has to be decided which people should have the diet and which
should not have the diet (control group). Thus we have to split the sample
into two groups. Suppose, that we have decided that both groups are equally
large.
In order to make valid conclusions (induction), the sample has to be split
completely at random into the two groups, i.e. every element in the sample
should have equal probability to be selected to each of the groups. Here we
have two groups, thus each person in the sample should have probability
1
2
to be in the diet group and probability
1
2
to be in the control group, with the
additional assumption that both groups must be equally large.
An easy counterexample to illustrate the importance. Suppose that instead of
splitting the sample completely at random, we would select only overweighed
people in the diet group. It may (a priori) be expected that these people react
better to the diet as compared to other normal-weighted people. Thus, it
may be expected (i.e. it has a high probability) that the calculation based
on the sample, indicate that the diet has very good results. It is obvious
that this conclusion is not representative for the whole population. Indeed,
we wanted the conclusions to hold for every individual in the population,
and not only for the overweighed people, and then even only as compared
to normal-weighted people. We say that such a procedure leads to biased
results, i.e. the estimation of the eect of the diet is biased. When, on the
other hand, the randomization would have been performed completely at
random, the estimated diet eect would have been unbiased.
The above example not only illustrates the meaning of randomization, but
it also shows what is meant by bias. Later, we will see a more formal
denition of bias, but already at this point we could intuitively see that bias
means the dierence between the real diet eect (= what we would measure
if we were able to take the whole population as a sample, split completely at
random into two groups) and the expected estimated diet eect in the sample
when the sample is split into two groups according to some specic procedure.
In particular, when this procedure is completely at random, then the bias
is zero, and when this procedure is the one of the example (selecting over-
weighted persons in the diet group), then there will be a substantial bias.
Thus, in conclusion, randomization is crucial in order to make valid (unbi-
ased) conclusions. Also here, the repeated sampling concept comes into
play. You can imagine that an experiment can be performed repeatedly (e.g.
over time, or at least as a though-experiment). Since the splitting of the
sample into groups is a random process, the groups will every time you do
the experiment be dierent. Thus, also the results from the experiment will
very probable be dierent every time. We will see later that this is exactly
the basis of the understanding of statistical methods, and the basis of the
validness of the induction process.
Yet another way to understand the necessity of randomization is the following
example.
Example 2.4. Example 1 continued. Suppose that the diet eect actually
depends on some genetic characteristics of the people in the sample. Since
it is not feasible in practice to do a genome analysis on every one, in order
to eventually assess the relation between some genetic markers and the diet
eect, it is best to distribute the people, w.r.t. their genome, as equally
as possible over the two groups. In this way the possible inuence of the
genome on the diet eect will be averaged out from the sample. Since we
dont have information on the genomes of the people in the sample, the only
way to guarantee an equal distribution of genomes over the two groups, is to
randomize the people in the sample. Then, at least on average, the eect of
genome on the diet will be eliminated. (Note the on average refers here to
repeated sampling.)
Note also that the result of such a randomization also guarantees that the
two groups resemble the whole population as close as possible w.r.t. the
genome distribution. Hence, the results from the study will very probably be
representative for the whole population (this is important in the induction
process, of course). Also, if the two groups, on average, do not dier w.r.t.
any characteristic that might have an inuence on the diet eect, then any
substantial dierence in weight reduction that is estimated from the sample
may be attributed to the diet since the diet is the only dierence between
the two groups! This is a very important reasoning; it is a corner stone in
establishing a causal relationship!
Finally, we mention that the two groups in the example, are in general often
referred to as the two treatment arms. Typical characteristics of the elements
in the sample (e.g. the genome in the previous example) that might possi-
bly have in inuence on the results of the experiment, are called baseline
characteristics or baseline variables.
2.3 Stratied Sampling
From the reasonings presented in the two previous sections, we have learnt
that randomization and sampling completely at random are important steps
in order to guarantee that the resulting treatment groups in the sample are
as representative as possible for the complete population, and that the treat-
ment groups do not dier on average w.r.t. any baseline characteristic.
Example 2.5. Consider an extreme example, where we take a sample of
only 4 persons. The sample is taken completely at random from a population
with 50% men and 50% women, but since the sample size is extremely small,
we end up with a sample of 4 men. We could say now, that this sample
is not representative for the population, although the sampling has been
performed correctly. Such an example may however occur, especially with
small samples. Stratied sampling may reduce such problems.
If we know a priori that the population consists of 50% men and 50% women,
and that gender may have an inuence on the experiment, then we could
stratify for gender. This means that we will perform the sampling such that
in every treatment arm there will be 50% men and 50% women, and still
have as much random sampling as possible.
Suppose the total sample size is 40. Since the process of sampling completely
at random and the randomization over the treatment groups are perfectly
consistent, the stratied sampling can be established by rst sampling 10 men
completely at random from the population. These 10 men form the 50% men
in the diet group. Next, again completely at random, 10 women are selected
from the population. These women are also put into the diet group. Finally,
independently of the sampling process to form the diet group, 10 men and
10 women are sampled from to population to constitute the control group.
2.4 Random Variables

A random variable is about one of the most important concepts in statistics.
It is the basis for the calculation of probability statements that come along
with the statistical inference.
We will rst illustrate a random variable based on Example 1, but suppose
now that the response of the experiment is not the weight reduction of a
person, measured in kilogram, but suppose that it is simple a binary response,
say response=1 if there is a weight reduction for the person, and response=0
if there is no weight reduction. We could also say that the binary response
indicates a success or no success of the diet. More formally, we say that the
response variable Y is dened as
Y =
_
1 if succes
0 if no succes
(2.1)
We could dene such a response variable for each person in the sample.
Therefore, we introduce an index i, referring to person i. Thus,
Y
i
=
_
1 if succes for person i
0 if no succes for person i
(2.2)
where i = 1, . . . , n (n persons in sample).
It is very clear what Y means as soon as the experiment is nished, for
then the responses are measured. But let us think about the behaviour of Y
before the experiment, or how Y would behave under repeated sampling.
For simplicity, we will rst suppose that there is only 1 person in the sample.
Here is what may happen:
we sample 1 person completely at random, give him the diet, and after
1 month we measure his weight, which results in Y = 1 or Y = 0. This
is not deterministic! The result is not only determined by the true
eect of the diet. As mentioned before, it may partly be determined
by e.g. its genetic information, which we do not know and which is
dierent among all people. Moreover, this one person is sampled com-
pletely at random from the population, which means that the sampling
process is independent of e.g. its genetic information (in fact of any
baseline characteristic). Thus, also Y is random. Say, e.g., that in our
experiment the results is a success of the diet, thus Y = 1.
Repeated Sampling means that, at least in our minds, we repeat the
whole experiment, independently of all previous experiments. Thus, in
our simple experiment, we sample again 1 person completely at ran-
dom from the population. Since we assumed the population to be
(approximately) innitely large, it will very probable another person
now. Moreover, this person will be independent of the previous sam-
pled persons. Again we give him the diet and measure its weight after
one month. Again, the response will be partly inuence by the diet
itself, and partly by (unmeasurable) baseline characteristics of this per-
son. Suppose, that this time there was no weight reduction, thus now
Y = 0.
We could this experiment over and over ... in our minds we could do
it an innite number of times.
Y is a random variable. It is not deterministic in the sense that in a repeated
sampling experiment (the thought-experiment given above) under constant
conditions (i.e. every time the same diet is given), the response, Y , is not
constant, and it cannot be predicted without (random) error. Fortunately
there is probability theory, which allows us to study random variables and
to state at least some of its properties. In our example, Y can only take two
dierent values, 0 or 1, and is therefore called a Bernoulli random variable.
Or, we say that Y is distributed as a Bernoulli random variable, or simply,
Y is Bernoulli distributed. Typically distributions are determined by some
parameters (see later for a more precise description). In our example, the
only parameter describing the random process, is a probability, say . This
is the probability of success, and may be directly interpreted in terms of
the repeated sampling experiment: suppose the experiment is repeated on
innite number of times, then, at the end, we could compute the fraction of
experiments that resulted in a success (i.e. the number of successes divided
by the total number of experiments). This fraction is by denition exactly
the probability of success . We write
Y B(). (2.3)
This reasoning can be put the other way around: if we know that Y is a
Bernoulli variable with e.g. = 0.80, then this means this if we would do
the experiment (approximately) an innite number of times, we could expect
that 80% of all experiments result in a success. However, this direction of
the reasoning will not often happen in statistics. It is rather the other way
around. Based on an experiment we want to estimate the probability of
success . At this point, you can already feel that the more experiments (or
the larger the sample size), the more accurate can be estimated.
A probability statement as given above, is denoted by
P{Y = 1} = . (2.4)
Since a Bernoulli variable can take only two distinct values, we can very
easily calculate the probability of no success,
P{Y = 0} = 1 P{Y = 1} = 1 . (2.5)
This is simply based on a basic property of probabilities that they must sum
to one, since Y must be one of 1 or 0.
Applet. Applets 1b and 3b illustrate the sampling from a Bernoulli
distribution. The applet can be used to illustrate the sampling and the
meaning of the parameter .
Back to the example. Suppose now that not only one person is sampled in
the experiment, but that more generally n persons are sampled completely
at random. Then we could dene as a response variable, the total number
of successes, which will of course be between 0 and n. Again you see that
a similar repeated sampling experiment can be performed. But now, every
repetition of the experiment, results in a number between 0 and n. Again we
call the response Y , which is now distributed as a binomial random variable,
i.e.
Y Bin(n, ), (2.6)
where n and are the parameters. Again has the interpretation of the
probability of success for a single person. In particular, this probability
applies to all n persons in the sample.
When n > 1 the probability may even be estimated in one sample, by
simply dividing the number of success by n, i.e.
Y
n
. Later we will discuss this
in more detail. If, in the repeated sampling experiment, we would calculate
this estimate for every repetition. Then, by denition, the parameter is
the average of all the computed
Y
n
.
Note that there is a simple relation between a Bernoulli (Y
i
) and a Binomial
(Y ) random variable: Y =
n
i=1
Y
i
, both with the same probability =
P{Y
i
}.
Up to now we have only discussed so called discrete random variables in
the sense that the variable can take only values in a nite set S. For the
Bernoulli variable S = {0, 1}, and for the binomial variable S = {0, 1, . . . , n}.
If however, we go back to the original formulation of Example 1, then the
response variable is the weight reduction in kilograms. For such a variable
the set S of possible values is, at least theoretically, innitely large, for the
measurement is a continuous. If the response variable, Y , is indeed dened
as such, then we say that Y is a continuous random variable.
It is fairly simple to see that in theory we cannot use probability statements
as e.g. P{Y = 7.4}. The reason: if there are an innite number of possible
outcomes for a continuous random variable Y , then every probability of the
above form, is exactly zero. We will need another type of characterization
for continuous random variables: distribution functions.
2.5 Distribution Functions
2.5.1 Some General Denitions
In the previous section we have argued that for continuous random variables
Y , probability statements of the form P{Y = 7.4} cannot be used for they
are all zero. Nevertheless, we can use expressions of the form P{Y 7.4},
i.e. the probability that Y is smaller or equal to 7.4 kilogram. what does this
mean in a repeated sampling experiment? This means that if the experiment
(with one person in a sample) were repeated an innite number of times, and
every time the weight reduction of the person in the sample were measured,
then the number of experiments that resulted in a weight reduction 7.4 can
be counted, and this number divided by the total number of experiments, is,
by denition, exactly equal to the probability P{Y 7.4}. (Essential in the
possibility of computing probabilities, is that the event of which you want to
know the probability, must be countable in a repeated sampling experiment.)
Denition 2.3 Distribution function: the distribution function of a random
variable Y is dened as
F(x) = P{Y x} . (2.7)
Sometimes the distribution function is called the cumulative distribution
function.
We will now give some mathematical properties of this function. They are
illustrated in Applet 1b for the most common continuous distributions (e.g.
the normal distribution).
Note that F(x) is an non-decreasing function in x. We will further assume
that F(x) is a (right-)continuous function. Moreover, we will assume that its
rst derivative exists,
f(x) =
d
dx
F(x), (2.8)
which is called the density function. It is non-negative continuous function.
Once we know the relation between f and F we can also dene the distribu-
tion function as
F(x) =
_
x
f(y)dy, (2.9)
which may be interpreted as the surface under f between and x.
Based on basic probability calculations, we can also calculate
P{Y > x} = 1 P{Y x} = 1 F(x) =
_
+
x
f(y)dy, (2.10)
which is the surface under f between x and +. We often refer to such a
probability as the tail probability. For a random variable Y with distribution
function F we dene the quantile y
implicitly as
P{Y > y
} = , (2.11)
i.e. y
is such that the probability that Y is larger than y
is .
Also probabilities of the type P{x
1
< Y x
2
} can be calculated:
P{x
1
< Y x
2
} = F(x
2
) F(x
1
) =
_
x
2
x
1
f(y)dy, (2.12)
i.e. it is the surface under f between x
1
and x
2
.
Applet. Applet 1b is used to illustrate the above concepts.
2.5.2 Some Important Distributions
We will now discuss briey 4 important continuous distributions: the normal,
the t, the F and the
2
-distribution.
Applet. Applet 1b illustrates these 4 distributions and the interpretation
of their parameters.
The normal distribution
Without any doubt is the normal distribution the most important distribu-
tion in statistics. Often we will assume that a random variable is normally
distributed. A normally distributed random variable takes values in S = R,
the set of real numbers ranging from up to +. In practice, however,
response are often bounded. Still it is very often a reasonable approximation
to use the normal distribution.
A normal distribution is characterized by its distribution function F, which
is a function that is parameterized by two parameters: the mean , and the
variance
2
. We sometimes write F
,
2 to stress this fact. The square root
of
2
, , is called the standard deviation. We say that the mean determines
the location of the distribution, and
2
determines the spread or the scale.
This is very clearly illustrated in Applet 1b . A normal distributed random
variable is denoted by
Y N(,
2
). (2.13)
Note also that the distribution is symmetric about (i.e. the density f is
symmetric). This implies e.g. that f( x) = f( + x).
When = 0 and
2
= 1, the corresponding normal distribution is called the
standard normal distribution. We will often use the symbol Z to indicate a
standard normal distributed random variable. This is a distribution that is
symmetric about = 0. Therefore, f(z) = f(z), and
F(z) = P{Z z} = 1 P{Z > z} , (2.14)
and thus, z
= 1 z
1
.
Suppose that Y N(,
2
), then the following important property holds:
Y N(,
2
) (2.15)
Y N(0,
2
) (2.16)
Y
N(0, 1), (2.17)

i.e. the standardized variable
Y
is a standard normal variable. More gen-

erally holds for constant a and b that
Y N(,
2
) (2.18)
Y a N( a,
2
) (2.19)
Y a
b
N( a,

2
b
2
). (2.20)
The t-distribution
A t-distribution is parameterized by one parameter, f, which is called the
degrees-of-freedom. We say that a random variable, say T, is distributed
as t with f degrees of freedom, i.e.
T t
f
. (2.21)
As the normal distribution, T is dened over S = R.
The t-distribution is also a symmetric distribution. Thus, for the quantiles
corresponding to tail probabilities and 1 holds
t
f,
= 1 t
f,1
. (2.22)
The F-distribution
The F-distribution is parameterized by two parameters, f
1
and f
2
. Both are
called degrees-of-freedom. In particular, f
1
is the degrees-of-freedom on
the numerator, and f
2
is the degrees-of-freedom of the denominator. We say
that e.g. F is distributed as F with f
1
and f
2
degrees-of-freedom,
F F
f
1
,f
2
. (2.23)
The F-distribution is not symmetric. F can take values in S = R
+
.
The
2
-distribution
Finally we give the
2
-distribution, dened over S = R
+
. It is parame-
terized by one parameters, f, the degrees-of-freedom. We say that X is
2
-distributed with f degrees-of-freedom,
X
2
f
. (2.24)
It is not a symmetric distribution.
2.6 Expected Value
2.6.1 The Mean
A very important concept in statistics is the expected value or the expectation
of a random variable Y , denoted by E{Y }. It is sometimes also referred to as
the mean of the variable Y . The expectation can be dened mathematically,
but here we rather prefer to explain it in the context of a repeated sampling
experiment.
As an example, we consider a normally distributed random variable Y . In
particular, Y N(,
2
). Suppose, in a repeated sampling experiment, that
we sample an innite number of times from such a distribution. And, at the
end, we compute the average of all Y s. Then, by denition, this average
is exactly equal to E{Y }, which is for the normal distribution equal to .
Thus, for a normal distribution holds,
E{Y } = . (2.25)
Remember that we actually have already encountered this type of calculation
in a repeated sampling experiment (Section 2.4). In particular Y was there
a Bernoulli random variable with probability . There we have argued that
is exactly the fraction of the number of successes over the total number
of repeated sampling experiments. Since the number of success is also given
by the sum of Y s (remember they are 1 for a success, and 0 otherwise), the
fraction is basically the average of Y s, and thus, = E{Y }. This is an
important example in the sense that here we have a probability that can be
dened as an expectation.
If Y Bin(n, ), then E{Y } = n, and for X
2
f
, E{X} = f.
2.6.2 The Variance
Based on the denition of the expectation, the variance of a random variable
can be dened as well:
Var {Y } = E
_
(Y E{Y })
2
_
, (2.26)
i.e. the variance of Y is the expectation of the squared deviation from its
mean. And, since it is dened as an expectation it can be interpreted in the
repeated sampling context, but here in every repetition of the experiment
(Y E{Y })
2
must be computed and averaged at the end.
For the normal distribution, it can be shown that
Var {Y } =
2
, (2.27)
for the Bernoulli distribution Var {Y } = (1 ), and for the binomial
distribution Var {Y } = n(1 ). Finally, if X
2
f
, Var {X} = 2f.
Its name already suggests that the variance is some kind of measure of
the variability of a random variable. To illustrate this interpretation, we
consider a very extreme imiting case: suppose that Var {Y } = 0, then
E
_
(Y E{Y })
2
_
. Hence, if, in a repeated sampling experiment, the sam-
pled Y were measured an innite number of times, and every time (Y
E{Y })
2
is computed, then, in order to end up with an average of these (Y
E{Y })
2
equal to zero, almost every repetition must result in a (Y E{Y })
2
extremely close to zero, thus Y extremely close to its expectation E{Y }.
Hence, when Var {Y } = 0, we may be extremely sure that every sampled Y
is almost exactly equal to E{Y }. We say that there is almost no variability.
Sometimes we will also say that there is almost no uncertainty about Y .
2.7 The Relation between Population and Dis-
tribution
In this section we will illustrate that there is a very tight relation between
a population and a distribution function, as soon as the response variable
is dened and the experimental circumstances are set. We will split the
discussion into two parts. First we will show that once the population is
dened, the distribution function is set as well. Next, we will go the other
way around.
Population distribution function.
Example 2.6. Example 1 continued. As before, we dene the
population as the set of all people older than 12 years, living in Belgium.
Further, we dene the response variable as the Bernoulli variable Y
indicating success (i.e. a weight reduction). We argue now that with
these specications the distribution of Y is determined, though not
necessarily known (e.g. parameters may still be unknown).
Since Y is a binary variable, we know already that Y B(), though
is still unknown to us. However, in theory, now that the population is
dened, we could repeatedly sample from the population. By counting
the fraction of successes in an innite number of repeated experiments,
P{Y = 1} is completely determined. Hence, also the unknown param-
eter = P{Y = 1}. Of course, also P{Y = 0} = 1P{Y = 1} is now
known. Remark that in the above example,
the link between distribution and population is only there thanks to
the repeated sampling interpretation of probability. And the repeated
sampling experiments cannot in general not be performed in practice;
we should only use it as a theoretical tool. Here we used it to establish
a relation between population and distribution.
Distribution function population.
Suppose now that we only dened a distribution. E.g. a Bernoulli dis-
tribution with probability . We further known that the experiment
consists in measuring success (e.g. weight reduction). Then we know
that these specications imply the existence of a innitely large pop-
ulation, such that the relative frequency of success is exactly equal to
.
The arguments presented here are fairly easy to understand when Y is a
discrete Bernoulli variable. Lets try now to make the same exercise when
Y is a continuous random variable, e.g. Y N(0, 1). We now claim that
there is a link between the population and the complete distribution function
F(x) = P{Y x} for all x. Well, the same reasoning as above applies now,
except that not relative frequencies of success must be calculated from the
repeated sampling experiments, but rather the relative frequencies of all the
events of the form Y ] , x]. All these relative frequencies must equal
the theoretical probabilities
P{Y ] , x]} = P{Y x} = F(x). (2.28)
In conclusion, the population may also be seen as a hypothetical innitely
large set of values that occur with the relative frequencies given by the distri-
bution function. This construction of the population is purely mathematical,
but of course, in statistics, there must be a one-to-one relation between this
mathematical population and the real population from which a real sample
can be taken. Thus, in Example 1, the population consists of the weight
reduction of all people of 12 years and older, living in Belgium.
Finally, you may have noticed that we have often reasoned with a sample
of size 1. Especially in the repeated sampling experiments, we have often
said that each repetition consisted of sampling one unit and looking at one
observation Y at a time. Of course, the same reasoning applies when larger
samples are taken at each step in the repeated sampling experiments. In
particular, remember that every sample has to be taken completely at ran-
dom. This implies that any two observations in a sample are independently
sampled. In the repeated sampling experiments, this means that there is no
dierence between two sampled observations in one sample, and two sampled
observations in two dierent repeated samples. Hence, relative frequencies
hold over all observations over all repeated samples.
2.8 Experimental versus Observational Stud-
ies
In this chapter we have put stress on the importance of sampling completely
at random, and on the necessity of randomization. In practice, however,
there are often situation in which this randomness cannot be assured. E.g.
what is the eect of a fat diet on the risk of a heart attack? it is clear that you
cannot randomize people over a low fat and a high fat diet for there whole life
time. Still, statistical procedures exist that take this lack of randomness into
account (or, better, procedures exists that let the randomness come into play
at another stage!). Typically, such studies are the subject of epidemiology.
These methods are, however, not treated in this course.
A study in which no explicit sampling is performed, as in the above example,
is often called an observational study. Another typical example of an obser-
vational study is a survey in which several questions are asked to a sample
of people. It is then often aimed to nd relations between the answers (e.g.
relation between smoking status and health problems) Here we should be
careful: e.g. because of the lack of randomization (e.g. over the group of
smokers and the group of non-smokers), we cannot infer causal relationships.
Experiments that are performed in laboratoria, are typically experimental
studies for the experiment can be designed and analyzed completely accord-
ing to the experimenters demands. These demands include the practical
implications of the typical assumptions that are needed to guarantee the
correctness of a formal statistical inference, e.g. randomization.
Chapter 3
The Mean and the Variance
In this chapter we discuss the sample mean and the sample variance, which
are computed from a sample. We will show that these quantities, which are
generally referred to as statistics, are random variables and that they thus
have a particular distribution. Based on the knowledge on these distributions
we will derive some very important properties which should be kept in mind
whenever one uses the sample mean and sample variance. We will also discuss
condence intervals.
3.1 Estimating a Distribution: Introductory
Example
Example 3.1. Suppose that you need to report on the cholesterol levels in
blood of a particular group of people, say the inhabitants of Italy. Measuring
the cholesterol level in every person, is of course not feasible. Therefore,
we will take only a sample of people. Even if we were able to measure all
cholesterol levels, it is clear that some people will have a low level, and others
will have a high level. We could think of the set of all cholesterol levels as
a population which is characterized by a distribution function or a density
function (Figure 3.1). Based on the sample we now want to estimate the true
distribution, and of course we want this estimate to be as close as possible
to the true but unknown distribution.
One simple solution could be to draw a histogram. This could indeed be
considered as an estimate of the distribution. Figure 3.1 shows a possible
23
x=cholesterol
f(x
)
0 200 400 600 800 1000
0
.0
0
.0
0
1
0
.0
0
2
0
.0
0
3
0
.0
0
4
0
5
10
15
20
25
300 400 500 600 700
cholesterol level
P
e
rc
e
n
t o
f T
o
ta
l
Figure 3.1: The density function of the cholesterol level (left) and a histogram
of a sample of 100 people (right)
0
10
20
30
200 300 400 500 600 700 800
cholesterol level
P
e
rc
e
n
t o
f T
o
ta
l
Figure 3.2: A histogram of a sample of 100 people with changed positions of
the bins
histogram of the cholesterol levels, based on a sample of 100 people. We
will often use the histogram, but we will use it rather as an exploratory
tool, because the histogram has one major drawback: it cannot simply be
used in further calculation. This is easy to see: a histogram is not a single
number. It is rather a collection of numbers. In particular it is a collection
of frequencies (counts) and intervals (bins). Moreover, the exact position
and width of the bins must be chosen by the researcher. This may make the
histogram even a subjective tool. Figure 3.2 shows a histogram, based on
the same sample as for the previous histogram, but now we have changed
the positions of the bins. As you can see, the histogram looks quiet dierent.
It even gives almost the opposite conclusion with respect to the skewness of
the distribution. Thus, histograms may be helpful as an exploratory tool,
but they may also be misleading because of the arbitrariness of the positions
of the bins. Furthermore, it is not a simple number which can be used for
further calculations.
In our discussion thus far, we have nothing assumed about the true distri-
bution of cholesterol levels. Sometimes, however, such an assumption may
simplify the problem considerably. Here, we will assume that the distribu-
tion is a normal distribution (note that the distribution in Figure 3.1 clearly
shows a normal distribution). We have seen in the previous chapter that a
normal distribution is completely determined by two parameters: the mean
and the variance
2
. Thus, when the cholesterol level is indeed normally
distributed, then it is sucient to know these two parameters. In practice,
this implies that we only have to estimate and
2
from the sample. It is
obvious that and
2
are estimated by two numbers and that therefore
they can be used in further calculations.
It can be shown that the mean can be estimated by means of the sample
mean, which is simply the average of the observations in the sample. Suppose,
the sample has n observations denoted by X
i
(i = 1, . . . , n), then the sample
mean is given by
X =
1
n
n
i=1
X
i
. (3.1)
In general, we will denote an estimator of a parameter by the same symbol,
but with a hat added to it. Thus, is estimated by and
2
by
2
. And,
thus, =

X. The variance is estimated by means of the sample variance
which is given by

2
= S
2
=
1
n 1
n
i=1
_
X
i

X
_
2
. (3.2)
Later we will discuss

X and S
2
in more detail.
In this example, we nd for instance, =

X = 510.97 and
2
= 11122.77
(or, = 105.46).
Once and
2
are calculated from the sample, the true distribution is es-
timated as a normal distribution with mean and variance
2
. Figure 3.3
shows the density function of this estimated normal distribution; the density
function of the true distribution is added as a reference.
Finally, we like to note that if we would have assumed another distribution,
e.g. a t-distribution, then not the mean and the variance should have been
estimated for these are no parameters of the t-distribution. The degrees-
of-freedom is the one and only parameter of the t-distribution, and it is
this parameter that in this case should be estimated (we will not discuss
this here any further). Later, we will argue that even if we do not assume
a distribution of which the mean and/or variance are natural parameters, it
x=cholesterol
f(x
)
0 200 400 600 800 1000
0
.0
0
.0
0
1
0
.0
0
2
0
.0
0
3
Figure 3.3: The estimated density function (full line) and the density function
of the true distribution (dotted line)
is still meaningful to compute

X and S
2
. One of the reasons is that these
quantities have a clear and important interpretation (location and scale of
the distribution).
3.2 Statistics
When observations are real numbers (e.g. the cholesterol level), we could e.g.
assume that the observations are normally distributed, i.e.
X
i
N(,
2
), (3.3)
where and
2
are unknown.
In Example 1 we have shown that and
2
can be estimated from a sample
by means of the formulae in Equations 3.1 and 3.2. These expressions also
show that

X and S
2
are functions of the elements of the sample (i.e. the
sampled observations). From the previous chapter we know that these sam-
pled observations (X
i
) are random variables (e.g. with the repeated sampling
reasoning: in repeated experiments are the observations sampled completely
at random, and in each of these repeated experiments the sampled observa-
tions will every time very probably be dierent). And, therefore,

X and S
2
are random variables as well! Indeed, in the repeated sampling reasoning,

X
and S
2
will be dierent for every new sample. Thus

X and S
2
are random
variables themselves. As with every random variable, their behaviour can
be characterized by means of a distribution (distribution function or density
function).
Applet. The behaviour of

X in repeated sampling experiments is illus-
trated in Applet 3a. Applet 3c applies to S
2
.
X and S
2
are only two examples of what we will call statistics. In general a
statistic is a random variable that is dened as a function of the observations
in the sample. Suppose X = {X
1
, . . . , X
n
} denoted the sample, and g
denotes some function, then
T = g(X) (3.4)
is called a statistic. Its distribution is completely determined by the distri-
bution of the observations in X and by the function g. Later we will nd the
distributions of

X and S
2
because these are two very important examples of
statistics.
Sometimes,

X and S
2
are referred to as summary statistics because they
summarize the distributional behaviour of the observations in one number
(Example 1 shows this very clearly when we are prepared to assume that
the data is normally distributed). There are some other statistics that may
be called summary statistics: e.g. the median, the 1st and the 3rd quartile.
The median, M, is dened as the element of the sample, such that half of the
observation (n/2) are smaller than M and half are larger than M (if n is even,
M is typically the average of the two middle observations). The 1st quartile,
Q1, is dened as the element of the sample such that n/4 observations are
smaller than Q1. The 3rd quartile, Q3, is dened similarly, but with 3n/4
observations smaller than Q3.
Since

X and S
2
are estimators of parameters ( and
2
), and since they are
just single numbers, they are also called point estimators. Later we will see
that there are statistics that are not just estimators of a parameter. In thus
course we will discuss three types of statistics:
point estimators (e.g.

X = as a point estimator of )
interval estimators (see later)
test statistics (see later)
3.3 The Distribution of the Sample Mean
The sample mean is given by Equation 3.1 and may be seen as a point estima-
tor of the true mean of the distribution. When the distribution is a normal
distribution, then the mean is one of the two parameters characterizing
the normal distribution. But also when the distribution is not a normal dis-
tribution, the mean is a meaningful quantity (see Section 2.6), i.e. the mean
is always dened as the expected value. Thus, whatever distribution the
observations X
i
have, the expected value E{X
i
} may be of interest. Even if
the distribution is not normal, we will use the notation = E{X
i
} to denote
the mean or the expected value.
3.3.1 The Distribution of the Sample Mean under the
Normal Assumption
Applet. Applet 3a may be used to illustrate the following properties of
the distribution of the sample mean:
the expected value of the sample mean equals the expected value of the
distribution of the observations, i.e. E
_
X
_
= E{X
i
}
the variance of the sample mean decreases as the sample size (n) in-
creases
when the observations are sampled from a normal distribution (i.e.
X
i
N(,
2
)), then the sample mean is also normally distributed
Without proof we will now formulate the conclusions from the above applet
in a more formal way. First we will assume that the observations are normally
distributed. In the next section we will drop this assumption.
Suppose the observations are normally distributed, i.e.
X
i
N(,
2
), (3.5)
and that we consider a sample X
1
, . . . , X
n
of size n. Further, we will assume
that all n observations are independent (e.g. by independently sampling).
Sometimes, when we want to stress that the observations X
i
are independent,
we will write
X
i
i.i.d. N(,
2
), (3.6)
where i.i.d. stands for ...identically and independently distributed as ....
Under these conditions it can be shown that
X N
_
,

2
n
_
. (3.7)
Thus, the sample mean

X is also normally distributed. Moreover it has the
same expected value as the observation, i.e. E
_
X
_
= E{X
i
} (exercise: ex-
plain this in term of repeated sampling experiments). The variance of

X is
proportional to the variance
2
of the observations, and inverse proportional
to the sample size n, i.e. Var
_
X
_
=

2
n
. This is an extremely important
property! It says that, as long as the sample size is nite, there is sample
variability, i.e. in the repeated sampling experiments, the repeatedly cal-
culated sample means will be dierent (i.e. the sample mean has a variance
due to sampling). Thus, as we actually all know, the sample mean is not
exactly equal to the true mean: the sample mean is only an estimator of
the true mean. Moreover, since Var
_
X
_
=

2
n
, this sampling variability be-
comes smaller as the sample size increases. Thus, the larger the sample, the
more accurate the sample mean is as an estimator of the true mean. And, in
the limit as n approaches innity, the Var
_
X
_
even becomes approximately
zero, which means that, in the limit, there is no sampling variability left, and
X becomes exactly equal to the true mean. This property is an asymptotic

property, i.e. it holds when the sample size goes to innity. The property
that for a point estimator T, Var {T} 0 as n is called consistency.
Thus, we may conclude that

X is a consistent estimator for the true mean
= E{X
i
}.
The property that E
_
X
_
= is also extremely important. We say of such
an estimator that it is unbiased. The bias can be determined as E
_
X
_
,
which is zero in our case.
3.3.2 The Distribution of the Sample Mean without
the Normal Assumption
When the observations X
i
are not normally distributed, then, in general, the
distribution of

X is not a normal distribution. Its distribution depends on
the distribution of the observations used in the calculation of

X. Still the
following important properties hold:
E
_
X
_
= E{X
i
} =
Var
_
X
_
=
Var{X
i
}
n
=

2
n
(note that this properties of course also hold when the observations X
i
are
normally distributed; also note that we still use the notation E{X
i
} =
and Var {X
i
} =
2
, even for non-normal situations) Thus also here is

X
a consistent estimator of the true mean. The above mentioned properties
remain valid for discrete distributions!
In the previous chapter we have seen that the square root of the variance
is called the standard deviation. Here, the same applies, except that we
now have to mention to which statistic the standard deviation refers. Thus,
_
Var
_
X
_
is called the standard deviation of the mean. In some literature,
you might read standard error of the mean, or simply standard error.
3.3.3 The Central Limit Theorem
Applet. Applets 3a and 3b may be used to illustrate the following property
of the sample mean: when the observations are sampled from a distribution
dierent from a normal distribution, then the distribution of the sample
mean becomes approximately a normal distribution when the sample size is
suciently large.
Thus, the applet shows that, as long as the sample size is suciently large,
the sample mean is normally distributed, irrespectively of the distribution
of the observations X
i
. This result is know as the Central Limit Theorem
(CLT). We may formulate the CLT as one of the following:
the sample mean

X is asymptotically normally distributed (here, asymp-
totically really means the limit n )
X N
_
,

2
n
_
(3.8)
the sample mean

X is approximately normally distributed (here, ap-
proximately refers to large, but nite, sample sizes; in practice, often
n = 30 or n = 50 is suciently large)
X
.
N
_
,

2
n
_
(3.9)
The CLT applies to discrete distribution as well. This is illustrated in the
next example.
Example 3.2. An example that will also be used later, is the following.
A group of 50 people (a panel) is given 3 glasses of milk. Two glasses are
lled with the same milk, and the 3rd glass is lled with another type of
milk. There is no visual dierence between the 3 glasses. Each member of
the panel is asked to point the glass that tastes dierent from the other two.
From a statistical point of view we could say that the response variable for
person i is dened as
X
i
=
_
1 if person i selects the right glass of milk
0 if person i does not select the right glass of milk
(3.10)
Thus, X
i
is Bernoulli distributed. A Bernoulli distribution is characterized
by only one parameter: the probability (i.e. the probability of selecting
the right glass of milk). In Section 2.6 we have seen that E{X
i
} = . Thus,
the sample mean

X =
1
50
50
i=1
X
i
is a good estimator of . In particular, it
is as unbiased estimator in the sense that E
_
X
_
= , and it is a consistent
estimator in the sense that Var
_
X
_
=
Var{X
i
}
n
0 as n .
From Section 2.6 we also know that Var {X
i
} = (1 ). Thus, Var
_
X
_
=
(1)
n
=
(1)
50
.
Thus far we know already the mean and the variance of the point estimator
X of , but we still do not know its distribution. Fortunately, the CLT tells
us that for suciently large n, the distribution of

X is approximately normal,
even for discrete distributions. Thus we have
X
.
N
_
,
(1 )
50
_
. (3.11)
Here, we believe that n = 50 is indeed suciently large to make the approx-
imation hold.
3.4 Accuracy of the Sample Mean and Sam-
ple Size
Equation 3.2 illustrates very well how the accuracy of an estimator is related
to the sample size. Indeed, as we have seen before, Var
_
X
_
is a measure of
the variability of

X from sample to sample in the repeated sampling experi-
ments. Thus, the smaller Var
_
X
_
the more accurate the mean is estimated
by

X.
Equation 3.2 shows that Var
_
X
_
depends on two quantities: the variance
2
= Var {X
i
} of the observations, and the sample size n. We will suppose
that we cannot inuence the variance of the observations, i.e.
2
is an in-
herent property of the population of interest. what we do have under our
power, is the sample size n. In this section we will see an easy method that
may sometimes be used to determine n such that a minimal level of accuracy
is guaranteed. The method is rst illustrated in the following example.
Example 3.3. Example 3.3.3 continued. The aim is to have an accurate
estimate of the probability . We will measure the accuracy by the absolute
dierence between the estimator

X and the true probability , i.e. |
X |.
Since

X is a random variable, |
X | is a random variable as well. Thus,

we can only make probability statements about the accuracy. Since

X is the
sample mean, i.e. it is the average of the n elements of the sample, we will
write

X
n
to stress the fact that is depends on the sample size n. Let > 0
denote the minimal accuracy imposed by the researcher (here, e.g., = 0.10).
Then, we could e.g. determine n such that the accuracy is attained with a
probability of 95%. Thus, the probability statement becomes,
P
_
|
X
n
| <
_
= 0.95, (3.12)
or, equivalently,
P
_
<

X
n
<
_
= 0.95, (3.13)
In general, probabilities can only be calculated when the distribution is
known. Here, we know that for suciently large n,

X
n
is approximately
normally distributed. In Section 2.5 we have seen that normal distributed
random variables may be standardized by subtracting its expected value, and
dividing by its standard deviation. If we apply it here, we become
X N
_
,
(1 )
n
_
(3.14)
X N
_
0,
(1 )
n
_
(3.15)
X
_
(1)
n
N(0, 1). (3.16)
If we divide all parts in the inequality in the probability operator in Equation
3.13, then we become
P
_
_
_

_
(1)
n
<
X
n
_
(1)
n
<

_
(1)
n
_
_
_
= 0.95, (3.17)
where now the middle part in the probability operator is a standard normal
distributed random variable. For such a standard normal random variable,
say Z, we know that by the denition of the quantiles (Section 2.5),
P
_
z
/2
< Z < z
/2
_
= 1 , (3.18)
which gives with 1 = 0.95,
P{z
0.025
< Z < z
0.025
} = 0.95, (3.19)
and from the table of the quantiles of a standard normal distribution, we nd
z
0.025
= 1.96. Since Equations 3.17 and 3.19 are equivalent, we nd
_
(1)
n
= 1.96, (3.20)
from which we nd
n =
1.96
2
(1 )
2
. (3.21)
Filling in = 0.10 into Equation 3.21 we become
n = 384.16(1 ). (3.22)
The problem with the above solution is, of course, that it still depends on
the unknown parameter . Indeed, is unknown, otherwise we would not
have been interested in estimating it! How should we proceed now??? A
simple solution is to calculated n for all possible values for , for which we
know that it is not smaller than 0 and not larger than 1 (because is a
probability). This is shown in Figure 3.4. It may now be concluded that n
is maximal for = 0.5, i.e. n = 96.04. Since n must be an integer, we will
have to take n = 97 which is the smallest integer larger than 96.04. This
solution is the safest solution, whatever the true value of .
In the above example we have encountered one problem: the solution for n,
given by Equation 3.21, still contains one unknown parameter (). Fortu-
nately, can only take values in the interval [0, 1], and for all these the
pi
n
0.0 0.2 0.4 0.6 0.8 1.0
0
2
0
4
0
6
0
8
0
Figure 3.4: The solution for n as a function of the unknown parameter
solution for n was practically feasible (i.e. n remained nite, it even never
exceeded 100). If, on the other hand, we would have done the same exercise,
but starting from a normal population (i.e. the observations are normally
distributed), then we would have found the following solution for the sample
size n,
n =
1.96
2
2
, (3.23)
where
2
is the variance of the normal distribution of the observations. Very
often
2
is unknown to the researcher. Furthermore, since
2
has no upper
bound, in theory at least,
2
may be innitely large, and by Equation 3.23, n
will become innitely large as well. Clearly, this is no neat solution. However,
when
2
is known by the researcher, then Equation 3.23 may be used.
Sometimes, one does not know
2
, but one can estimate
2
from previously
obtained data, or from a small experiment. In that case we could use
2
=
S
2
in Equation 3.23 instead of the unknown
2
. There are however some
theoretical problems with this approach. These will be discussed later in this
chapter. First the distribution of the sample variance is discussed.
3.5 The Distribution of the Sample Variance
In the previous chapter it was illustrated how the variance,
2
, or the stan-
dard deviation can be interpreted. It is the characteristic of the distribution
that measures the scale. This is clear when looking at the density function:
the larger , the wider the density, which means that the observations are
more variable about the mean.
One can imagine that it is indeed important to know the variance of a distri-
bution. Hence, it is important to have an estimate of the variance. Equation
3.2,

2
= S
2
=
1
n 1
n
i=1
_
X
i

X
_
2
, (3.24)
gives the formula. Exactly as with the mean and the sample mean, is S
2
the
sample variance, and it is a point estimator of the variance
2
. Therefore,
we sometimes write
2
instead of S
2
. Also, even if the observations are not
normally distributed, the variance is still an interesting characteristic of the
distribution.
Example 3.4. In a laboratory, one is often interested to know the variabil-
ity due to a measurement method. This is the variability (or the variance) in
the measurements that would appear if the measurement is repeated many
times on the same sample (note: here sample is not meant in the sta-
tistical sense). Suppose, we want to know the variance of measurements of
the concentration of a chemical compound (e.g. NH
4
) (tritrimetric measure-
ments). Then, we could make a solution of this chemical compound, and
distribute this solution over n recipients. we now may expect that the true
concentration is equal in all recipient. Next, the concentration is n times
independently measured. Thus, we can say that we have n independent ob-
servations from some distribution, with unknown mean (we do not know the
exact concentration) and unknown variance. It is clear that the smaller the
variance, the more accurate the measurement method will be. The variance
can be estimated by using Equation 3.2.
As with the sample mean, we will have to make a distinction between the
situation where all data are normally distributed, and the case where the
distribution of the observations is unknown.
3.5.1 The Distribution of the Sample Variance under
Applet. The distribution of the sample variance is illustrated in Applet
3c. In particular it shows that
S
2
is an unbiased estimator of
2
the variance of S
2
decreases as n increases
Suppose that X
i
i.i.d. N(,
2
) (i = 1, . . . , n). Then, it can be proven that
(n 1)S
2
2

2
n1
. (3.25)
Since, the mean of a
2
n1
distributed random variable is equal to n 1, we
see immediately that
E
_
S
2
_
=
2
, (3.26)
i.e. S
2
2
.
We also know that the variance of a
2
n1
distributed random variable is
equal to 2(n 1). Thus,
Var
_
S
2
_
= 2

4
n 1
, (3.27)
from which we learn that the variance of the sample variance S
2
is inverse
proportional with the sample size n. Thus, Var {S
2
} 0 as n , and,
hence, the sample variance is a consistent estimator of
2
.
3.5.2 The Distribution of the Sample Variance without
As with the sample mean, an assertion as in Equation 3.25 cannot be proven
when the data is not assumed normally distributed.
We will only mention here that, asymptotically (i.e. when n ), S
2
is
still an unbiased, consistent estimator of
2
. There is a form of the CLT that
applies to S
2
, but we will not go into detail here.
3.5.3 Some Final Remarks
Some of the terminology may be confusing. Especially here, where we also
consider the variance of the sample variance. In the next few lines we have
written a few statement. Convince yourself, as an exercise, that these state-
ments are correct.
the mean of the sample variance is the variance
the mean of the sample mean is the mean
under the normal assumption, the variance of the variance estimator is
given by
2
4
n1
3.6 Interval Estimator of the Mean
In Section 3.2 we have already mentioned that point estimators are not the
only type of statistics. They are even not the only type of estimators. In
this section we will discuss the interval estimator of the mean. First we will
argue why there is a need for another type of estimator.
3.6.1 Disadvantage of Point Estimators
It is easy to imagine that it is possible that in a rst experiment n
1
ob-
servations are sampled, resulting in a sample mean

X
1
, and, in a second
experiment with n
2
> n
1
observations sampled from a distribution with the
same variance (
2
) as in the rst experiment, the sample mean

X
2
exactly
equals

X
1
. Although in this articial example the two sample means are
exactly equal, we know that
Var
_
X
1
_
=

2
n
1
>

2
n
2
= Var
_
X
2
_
, (3.28)
i.e. the second sample mean is more accurate than the rst. However, when
we would only report the sample means, the reader of this report will not
know anything about the variability of the estimators. The reader will not
have a clue on how much faith he must but in the sample mean as an estimator
of the true mean.
The above mentioned criticism can be avoided by not only reporting the
sample mean, but also
the sample size n
the variance,
2
, or, when the true variance is not know, its estimator
S
2
Yet another way of presenting an estimator of the mean, is by reporting the
interval estimator of the mean. An interval estimator is not just a number,
but it is an interval which is associated with a certain probability statement.
The smaller the interval is, the more accurate the mean is estimated. The
wideness of the interval immediately reects the degree of uncertainty on the
mean which is present in the dataset. Therefore, interval estimators are often
preferred over point estimators.
3.6.2 Interval Estimator of the Mean
An interval is completely determined by a lower and an upper bound, denoted
by L and U, respectively. Basically, we will calculate L and U from the
sample. Hence, L and U may be called statistics. The interval itself, [L, U],
is also generally known as the condence interval of the mean.
We will determine L and U such that the true mean is in the interval [L, U]
with probability 1 , i.e.
1 = P{ [L, U]} (3.29)
= P{L U} . (3.30)
The solution is very simple when we assume that X
i
i.i.d. N(,
2
). We start
from the standardized sample mean, for which we know that
X N
_
,

2
n
_
(3.31)
X N
_
0,

2
n
_
(3.32)
X
_
2
n
N(0, 1). (3.33)
Hence,
1 = P
_
_
_
z
/2

X
_
2
n
z
/2
_
_
_
(3.34)
= P
_
z
/2
_
2
n

X z
/2
_
2
n
_
(3.35)
= P
_
X z
/2
_
2
n

X + z
/2
_
2
n
_
. (3.36)
Comparing Equations 3.30 and 3.36 gives us immediately
L =

X z
/2
_
2
n
(3.37)
U =

X + z
/2
_
2
n
. (3.38)
Hence, the interval estimator or the 1 condence interval is given by
_
X z
/2
_
2
n
,

X + z
/2
_
2
n
_
. (3.39)
Note that this interval is symmetric about the sample mean

X and that the
width of the interval increases with increasing
2
, and the width decreases
with increasing sample size n.
Applet. Applet 5a shows, in the repeated sampling setting, the interpre-
tation of the interval estimator (or condence interval) of the mean.
When the observations X
i
are not normally distributed, then, at least for
large samples, we could still rely on the CLT, which tells us that
X
.
N
_
,

2
n
_
, (3.40)
and thus, at least approximately, the same solution for L and U is obtained.
Expression 3.39 gives the solution for the interval estimator of the mean,
but, in practice, the variance
2
is most often not known. Fortunately it
can be estimated from the sample by means of S
2
(Equation 3.2). However,
substituting
2
with its estimator S
2
will change the distribution of the stan-
dardized sample mean, on which the whole solution of L and U is based. In
the next section we will point out what is the consequence of replacing the
unknown constant
2
by its estimator S
2
, which is a random variable.
3.7 Studentization
We have already discussed the standardization of a normal distributed ran-
dom variable. We apply it here immediately to the sample mean,
X N
_
,

2
n
_
(3.41)
X N
_
0,

2
n
_
(3.42)
X
_
2
n
N(0, 1). (3.43)
Thus, Z =

X
2
n
follows a standard normal distribution. An important prop-
erty of Z is that this distribution does not depend on any unknown parameter
(i.e. the mean of Z is zero, and its variance is 1, whatever the value and
2
). In particular this is used for obtaining the interval estimator of the mean
(see Equation 3.34).
When
2
is unknown, a natural reex is to replace it with its estimator S
2
.
But, replacing a constant with a random variable will have an eect on the
distribution of a statistic.
When
2
= S
2
is used instead of
2
in a standardization, the standardization
is called a studentization. We dene
T =
X
_
S
2
n
. (3.44)
Intuitively we could understand the dierence between Z and T as follows.
By replacing a constant by a random variable, there is additional variability
introduced. Thus, the variance of T will be larger than the variance of Z.
Further, since S
2
is a consistent estimator of
2
(i.e. the variance of S
2
decreases with increasing sample size), we expect that the eect of using S
2
will decrease as n increases. This is indeed the case. It can be shown that
T t
n1
, (3.45)
i.e. the studentized sample mean, T, is t-distributed with n 1 degrees of
freedom. A property of the t-distribution is that t
n1
converges to a standard
normal distribution as n . This latter property conrms our intuitive
idea that the eect of using the estimator S
2
vanishes as the sample size
becomes very large.
3.8 Interval Estimator of the Mean (Variance
Unknown)
Expression 3.39 gives the interval estimator of the mean when the variance
2
is known. When, however, the variance is not known, we will replace
2
with its estimator S
2
.
In order to nd the solution to
1 = P{L U} , (3.46)
we will now start from the distribution of the studentized sample mean.
X
_
S
2
n
t
n1
. (3.47)
Hence,
1 = P
_
_
_
t
n1,/2

X
_
S
2
n
t
n1,/2
_
_
_
(3.48)
= P
_
t
n1,/2
_
S
2
n

X t
n1,/2
_
S
2
n
_
(3.49)
= P
_
X t
n1,/2
_
S
2
n

X + t
n1,/2
_
S
2
n
_
. (3.50)
We now nd
L =

X t
n1,/2
_
S
2
n
(3.51)
U =

X + t
n1,/2
_
S
2
n
. (3.52)
Hence, the interval estimator or the 1 condence interval is given by
_
X t
n1,/2
_
S
2
n
,

X + t
n1,/2
_
S
2
n
_
. (3.53)
When an interval estimator is based on the CLT (i.e. the data are not
normally distributed, but due to the CLT the sample means is still approx-
imately normally distributed for large samples), there is no need to switch
from a standard normal distribution to a t-distribution when the unknown
variance is replaced by a consistent estimator. Intuitively, this may be seen in
the calculations presented in this section: when the CLT applies, the sample
size must be large, and when the sample size is large, then the t-distribution
behaves very similar to a standard normal distribution.
Example 3.5. Example 3.3.3 continued. We have already found that
X
.
N
_
,
(1 )
50
_
, (3.54)
where we believed that n = 50 is indeed suciently large to make the ap-
proximation hold. Based on 3.54 we can nd the 1 condence interval of
the mean () (cfr. Equation 3.39):
_
X z
/2
_
2
50
,

X + z
/2
_
2
50
_
, (3.55)
where here
2
= (1 ). Thus, the condence interval becomes
_
X z
/2
_
(1 )
50
,

X + z
/2
_
(1 )
50
_
, (3.56)
The parameter is of course unknown (otherwise we would not have been
interested in estimating it), but it can be estimated by

X. Note that in the
present example

X may also be denoted by for it is an estimator of .
Replacing in Expression 3.56 with its estimator , results in
_
z
/2
_
(1 )
50
, + z
/2
_
(1 )
50
_
. (3.57)
And since n = 50 is suciently large, the interval in Expression 3.57 will still
contain with a probability of (approximately) 1 .
3.9 Interval Estimator of the Sample Vari-
ance
Also here the interval estimator is determined by a lower and an upper limit,
denoted by L and U, respectively. These bounds must be such that the true
variance
2
has a probability of 1 to be in the interval [L, U], i.e.
1 = P
_
L
2
U
_
. (3.58)
Under the assumption that the observations are i.i.d. normally distributes,
we have seen in Section 3.5 that
(n 1)S
2
2

2
n1
. (3.59)
Hence,
1 = P
_
2
n1,1/2

(n 1)S
2
2

2
n1,/2
_
(3.60)
= P
_
(n 1)S
2
2
n1,/2

2
(n 1)S
2
2
n1,1/2
_
. (3.61)
Thus, we nd the interval
_
(n 1)S
2
2
n1,/2
,
(n 1)S
2
2
n1,1/2
_
. (3.62)
Note that this interval is not symmetric about S
2
! The reason is that the
2
-distribution is asymmetric.
Applet. Applet 5b shows the interpretation of the interval estimator of
the variance.
Chapter 4
Statistical Tests
In this chapter a third type of statistics is discussed: test statistics. A
test statistic is a statistic that is used in the construction of a statistical test,
which is basically a decision rule to make a decision which of two hypothesis
to choose, based on a sample.
We will discuss 2 important types of statistical tests: one-sample and two-
sample t-tests. But before we come to that, a simple example is given to
introduce and illustrate the most important concepts statistical testing.
4.1 Statistical Testing: Introductory Exam-
ple
4.1.1 Introduction
In a factory producing milk products, two types of cream are produced in
batch: one with 35% and one with 40% of fat. There has gone something
wrong with the labelling of the bottles, so that there is now one batch of
bottles of which nobody knows the fat content. Since there is some variability
of the fat content from bottle to bottle, it is not sucient to take one bottle
and measure its fat content in order to decide for the whole batch. Therefore,
a sample will have to be taken (e.g. 10 bottles), and the average fat content
of these 10 bottles will be computed (i.e. the sample mean

X). As we
know from the previous chapter, the sample mean is a good estimator of the
true mean , which is unknown in our example. But we know at least that
45
there are only two possible values for . These possibilities are stated in two
concurring hypothesis:
H
0
: = 35
H
1
: = 40
H
0
and H
1
are referred to as the null hypothesis and the alternative hypoth-
esis, respectively.
We will throughout this example assume that all observations are normally
distributed, with unknown mean (that is what we have to make a decision
about) and known variance
2
.
4.1.2 Decision Rule, Decision Errors and Associated
Probabilities
Based on the sample mean

X we will apply some decision rule to make
the nal decision which of the two hypotheses is the most likely. Note that
we used the term likely. This is because

X is a random variable, and, by
chance, we may have bad luck, and

X is very close to 40 although the sample
came from a batch of the 35-group. Thus, using

X to make a decision is also
a random phenomenon. This implies that (1) decision errors may occur, and
(2) probabilities of making decision errors can be calculated.
Suppose we adopt the following simple, but logical decision rule:
if

X 37.5 then H
0
: = 35; if

X > 37.5 then H
1
: = 40
We now will take a closer look at the decision errors and the associated
probabilities. Based on the assumption of normally distributed data, one of
the two following situation will be true:
suppose = 35,

X N(35,

2
n
)
suppose = 40,

X N(40,

2
n
)
These two situations are shown in Figure 4.1.
In general one of four situations may occur:
sample mean
f
25 30 35 40 45 50
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
37.5
mu=40 mu=35
Figure 4.1: The two possible distribution of the sample mean (
2
= 60, n =
10)
Table 4.1: Probabilities of Decisions, conditional on the hypothesis (
2
=
60, n = 10). The probabilities in bold correspond to decision errors. Note
that the rows represent the two possible decisions.
H
0
: = 35 H
1
: = 40
P
_
X 37.5
_
0.85 0.15
P
_
X > 37.5
_
0.15 0.85
= 35, but based on the decision rule, we decide H
1
: = 40
decision error
= 35, and based on the decision rule, we decide correctly H
0
: = 35
= 40, but based on the decision rule, we decide H
0
: = 35
decision error
= 40, and based on the decision rule, we decide correctly H
0
: = 40
These 4 situations may be presented in a table which contains the proba-
bilities conditional on one of the two hypotheses. In order to calculate the
probabilities explicitly, we have to know
2
and the sample size n; here we
take e.g.
2
= 60 and n = 10. The table is given in Table 4.1. These
probabilities may be seen as surfaced under the densities in Figure 4.1.
Thus, when the decision rule is applied, we see that the 2 probabilities of
making decision errors are identical (0.15). This may sometimes be a dis-
advantage (see later). Yet another drawback of this approach is that both
Table 4.2: Probabilities of Decisions, conditional on the hypothesis (
2
=
60, n = 25). The probabilities in bold correspond to decision errors. Note
that the rows represent the two possible decisions.
H
0
: = 35 H
1
: = 40
P
_
X 37.5
_
0.95 0.05
P
_
X > 37.5
_
0.05 0.95
sample mean
f
25 30 35 40 45 50
0
.
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
37.5
mu=40
mu=35
Figure 4.2: The two possible distribution of the sample mean (
2
= 60, n =
25)
probabilities of the decision errors depend on the variance
2
and the sam-
ple size n. This can be seen in Table 4.2 which shows similar probabilities,
but now n = 25 (see also Figure 4.2). Both the probabilities of the decision
errors are now only 5%. In a similar way these probabilities depend on the
variance
2
. Later, when we will not assume that the variance is known,
this might become very problematic, because then we would not know these
probabilities prior to the experiment.
Now we will construct a decision rule that controls for one type of decision
error, i.e. whatever the variance
2
and the sample size n, the probabilities
of making this particular decision error remains constant. We say that with
such a rule, the decision error rate is controlled for. (decision error rate
is the same as the probability of making the decision error). In particular
we want
P{decide H
1
: = 40|H
0
: = 35} = 0.05. (4.1)
The decision error of concluding H
1
when in reality H
0
is true is called the
type I error. The corresponding probability (here: 0.05) is referred to as the
type I error rate or the type I error level.
The decision rule is again of the same type as the previous one, but now
the critical value to which

X is compared, is not xed at 37.5, but rather is
represented by

X
c
.

X
c
will now be determined such that Equation 4.1 holds
true (i.e. the type I error rate is controlled). Thus,
P{decide H
1
: = 40|H
0
: = 35} = 0.05
P
_
X >

X
c
|H
0
: = 35
_
= 0.05
P
_

X 35
/
n
>
X
c
35
/
n
|H
0
: = 35
_
= 0.05
Since

X35
/
n
is standard normally distributed, we immediately nd that
X
c
35
/
n
= z
0.05
= 1.64 (4.2)
and thus
X
c
= 35 + 1.64/
n. (4.3)
In the above example with
2
= 60 and n = 10, this means

X
c
= 39.80,
which is very close to 40! When n = 25, this becomes

X
c
= 38.04, which is
already a bit closer to 35.
Note that in the above derivation, all probabilities are conditional on H
0
,
i.e. the probabilities are calculated as if H
0
were true. We say that the
calculations are under the null hypothesis.
The decision rule becomes, for general type I error level :

X 35 + z
n decide = 35

X > 35 + z
n decide = 40
It is, however, not custom to write decision rules directly in terms of the
sample mean

X. Rather, it is written in terms of test statistics. Very often,
these test statistics are statistics that have a distribution, under the null
distribution, that does not depend on any population parameter (e.g. or
). In this example the test statistic is
T =
X 35
/
n
. (4.4)
t
f
(
t
)
-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
1.64
0.05
t_o
Figure 4.3: The null distribution of the test statistic T =

X35
/
n
. The critical
value z
0.05
= 1.64 is shown.
Its distribution under the null distribution is of course the standard normal
distribution. We write this as
T
H
0
N(0, 1). (4.5)
The distribution of a test statistic under the null distribution is called the
null distribution. Thus null distribution is shown in Figure 4.3.
It is obvious that T is indeed a statistic: indeed T is a function of the
observations in the sample, and since the observations are random variables,
the test statistic is also random. However, once a particular sample is taken
and observed, all the observation are xed (this can be seen as one realization
of an experiment in the repeated sampling philosophy). Therefore, we will
make a distinction between the test statistic as a random variable, denoted
by T, and the test statistic obtained after observing a particular sample. In
the latter case, the observed test statistic is denoted by t
o
.
We will also introduce yet another change in terminology. Instead of writing
decide H
0
and decide H
1
in the decision rules, we will write accept H
0
and reject H
0
, respectively.
From Equation 4.2 the decision rule may now be derived in terms of the test
statistic (see also Figure 4.3.
t
o
=

X35
/
n
z
accept H
0
: = 35
t
o
=

X35
/
n
> z
reject H
0
and thus conclude H
1
: = 40
Table 4.3: Decision Errors
H
0
: = 35 H
0
: = 40
accept H
0
: = 35 OK = F
_
z
5
/
n
_
reject H
0
: = 35 OK
Thus, we now have a decision rule which guarantees that the type I error
rate is controlled at . But what with the other decision error, i.e. the error
that occurs when H
0
is decided, when in reality H
1
is true? This type of
decision error is called the type II error and the corresponding probability
is
= P{accept H
0
|H
1
: = 40}
= P
_

X 35
/
n
z
|H
1
: = 40
_
= P
_

X 40
/
n
+
40 35
/
n
z
|H
1
: = 40
_
= P
_

X 40
/
n
z
40 35
/
n
|H
1
: = 40
_
= F
_
z
5
/
n
_
,
where F(.) is the distribution function of a standard normal distribution. It is
important to see that this probability still depends on and n. Therefore, we
say that the type II error rate is not controlled for (this will come especially
clear when is not known).
The two decision errors and their probabilities may be represented in Table
4.3.
A probability that is closely related to is the power of a statistical test. It
is dened as
power = P{reject H
0
|H
1
} = 1 . (4.6)
It is obvious that we want the statistical test to have a power as large as
possible. Later we will see that a large power may be obtained by ensuring
that the sample size n is large enough. At this time we still say that also the
power is not controlled for.
4.1.3 Strong and Weak Conclusions
The construction of the statistical test implies that concluding H
0
or con-
cluding H
1
are not equally strong. This can be seen from Table 4.3. We
consider the two possible conclusions and their consequences.
Suppose that we have concluded to reject H
0
.
Then, according to Table 4.3, in reality there are two possibilities. (1)
indeed H
1
is true, and in that case we have made a correct decision,
or (2), H
0
is true, and then we have made a decision error (type I).
The probability of making this error, is exactly controlled at , which
is typically very small, e.g. = 0.05. Thus, if we reject the null
hypothesis, there is only a very small probability that we have made
an error.
Thus, rejecting H
0
is a strong conclusion.
Suppose that we have concluded to accept H
0
.
Then, according to Table 4.3, there are two possibilities. (1), indeed H
0
is true, and in that case we have made a correct decision, or (2), H
1
is
true, and then we have made a decision error (type II). The probability
of making this error, is given by . We have already seen that
still depends on the variance
2
and the sample size n, and in later
application we will even see that depends on unknown quantities,
and, thus, the probability is unknown as well. In practice it will
happen often that is even large, say larger than 20%, often even
largen than 50%. Thus, when the null hypothesis is accepted, there is
a possibly large probability that this is a decision error. Therefore, this
type of conclusion is a weak conclusion. Accepting a null hypothesis is
actually the same as saying that there was not enough evidence in the
data to prove the opposite.
4.1.4 p-Value
Many statistical computer software calculate the p-value. The p-value may
be used instead of the observed test statistic t
o
in an equivalent decision rule.
The p-value is generally dened as
p = P{T is more extreme than observed t
o
|H
0
} , (4.7)
t
f
(
t
)
-4 -2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
1.64
0.05
t_o
p-value
Figure 4.4: The null distribution of the test statistic T =

X35
/
n
. The critical
value z
0.05
= 1.64 is shown as well as an observed test statistic t
o
= 0.95 <
1.64 and its corresponding p-value=0.17 > 0.05.
where more extreme means more extreme in the direction of the alterna-
tive hypothesis H
1
. This must be interpreted for each statistical test. In
our example, more extreme means that larger than, because the larger
the test statistic, the more evidence there is that H
1
will be true instead of
H
0
. Thus,
p = P{T > t
o
|H
0
} (4.8)
The decision rule in terms of p-values becomes
p accept H
0
p < reject H
0
, conclude H
1
Note that the p-value is a conditional probability, i.e. it is a probability
calculated under the condition that the null hypothesis is true. Suppose that
p is extremely small, say p = 0.00001. Then, this would mean that it is very
unlikely to obtain a test statistic that is at least as extreme as the one that
is observed from the present sample, given that the null hypothesis is true.
Thus, it is very unlikely that the null hypothesis is true. Moreover, since
we further know that, under the alternative hypothesis more extreme values
of T are to be expected, it is indeed a good conclusion to state that it is
very unlikely that the null hypothesis is true, but that rather the alternative
hypothesis is more likely to be true (hence for such a small p-value we reject
the null hypothesis).
The above reasoning immediately implies that the p-value is a very valuable
quantity, measuring in some way the evidence in the sample against the null
hypothesis in favour of the alternative hypothesis.
4.2 Composite Hypothesis
Up to now we have considered only so called simple alternative hypothesis,
i.e. the alternative hypothesis (H
1
: = 40) contains only one possibility for
. Often, however, there are many more possibilities if the null hypothesis
is not true. We will consider 3 types, which can be summarized as
H
1
: = 35 two-sided hypothesis
H
1
: < 35 one-sided to the left
H
1
: > 35 one-sided to the right
The corresponding statistical tests are called one-sample tests. one sample
refers to the fact that all calculations are based on only one sample from one
population (e.g. n bottles cream from one batch).
4.3 One-Sample t-Tests
Up to now we have assumed that the variance
2
is known. In practice,
however, this is almost never the case. Therefore, we will from now on drop
this assumption. The one-sample tests are then called one-sample t-tests.
4.3.1 Unknown Variance
If the variance
2
is not known, a straightforward solution is to replace
2
with its estimator
2
= S
2
in the formula of the test statistic (Equation 4.4).
Thus, we now have
T =
X 35
S/
n
. (4.9)
Under the null hypothesis H
0
: = 35, the test statistic T is now, however,
not normally distributed anymore. We have come across a similar situation
in Section 3.7. There we have seen that, by replacing by S, the standard
normal distribution changes to a t-distribution with n1 degrees of freedom.
Thus, we now have as a null distribution
T =
X 35
S/
n
H
0
t
n1
. (4.10)
This results only holds under the assumption that the observations are nor-
mally distributed.
Remark:
When n is very large, the CLT applies, and the sample mean

X will always
be approximately normally distributed. Is such a case we have, under the
null hypothesis
T =
X 35
S/
n
.
N(0, 1). (4.11)
4.3.2 One-Sided One-Sample t-Test
Example 4.1. Example of Section 4.1 continued. Suppose that we are
still interested in testing that the fat content is equal to 35 percent, but that
we now don not know what the concentration may be if it is not equal to
35 percent (e.g. because the company makes many dierent types of cream
with various fat contents). The only restriction there is, is that it can not
be smaller than 35% because such creams are not produced. Then we would
like to test
H
0
: = 35
against
H
1
: > 35.
Now, the variance is unknown and it will have to be estimated from the
sample as well.
Since the null hypothesis is the same as before, the null distribution is also
not changed:
T =
X 35
S/
n
H
0
t
n1
. (4.12)
Since the alternative hypothesis is dierent from the one considered in Section
4.1, we have to check whether the test statistic is indeed still a good statistic,
i.e. does T in some sense measure the deviation between the null hypothesis
(H
0
: = 35) and the alternative hypothesis (H
1
: > 35)? The answer
is yes: when the alternative hypothesis is true, we expect that

X will
probably be larger than 35, and, hence, the distribution of T will, under the
alternative hypothesis, be shifted to the right. Therefore, we conclude that
T is a suited test statistic for testing H
0
against H
1
. Furthermore, from this
reasoning we also deduce that we will reject H
0
for extreme large positive
values of T. This latter property must be reected in the decision rule, which
will thus in general be of the following form:
t
o
=

X35
S/
n
t
c
accept H
0
t
o
=

X35
S/
n
> t
c
reject H
0
, conclude H
1
where t
c
is the critical value, that must be such that the type I error rate is
controlled at the level. Thus, t
c
is such that
P{T > t
c
|H
0
} = (4.13)
The solution is simple: since we know that under the null hypothesis T is dis-
tributed as t
n1
, the value of t
c
is simply the -quantile of this t-distribution,
thus
t
c
= t
n1,
. (4.14)
Hence, the decision rule becomes
t
o
=

X35
S/
n
t
n1,
accept H
0
t
o
=

X35
S/
n
> t
n1,
reject H
0
, conclude H
1
A major dierence from the situation of the simple alternative hypothesis
of Section ?? is that now we cannot calculate the probability of making
a type II error (). The reason is that this probability must be calculated
under the alternative hypothesis, but now the alternative hypothesis contains
an innite number of possibilities for . Still, when we take on particular
possibility of H
1
, e.g. = 35 +, where > 0, then we can calculate as a
function of . thus,
() = P{reject H
0
| = 35 + } . (4.15)
t
f
(
t
)
-4 -2 0 2 4 6
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
alpha=0.05
b1
b2
null distri. mu=40 mu=45
Figure 4.5: The null distribution (left) of the test statistic T =

X35
S/
20
. The
critical value t
19,0.05
= 1.73 is shown as a vertical line. The middle dis-
tribution is the distribution of T when = 40, and the right distribution
corresponds to = 45. b1 and b2 are the two corresponding type II errors
1
and
2
.
Figure 4.5 shows the null distribution and, for some possibilities of under
H
1
, the corresponding distributions of T. It can be seen that as increases
under H
1
, () decreases, or, equivalently, the power (= 1 ()) increases.
Example 4.2. Example of Section 4.1 continued. Suppose now H
0
:
= 35, but H
1
: < 35. Since the null hypothesis remains the same, the
null distribution of T is still valid as well. In order for T to be a suitable
test statistic, we must think about the behaviour of T under the alternative
hypothesis in the sense that T must measure the deviation of H
0
in the
direction of the alternative hypothesis. Here, since under H
1
the sample
mean

X will probably be smaller than 35, we expect the distribution of T
to be shifted to the left as compared to the null distribution. Thus, T is still
a good test statistic, but now we will reject the null hypothesis for extreme
negative values of T. The decision rule is thus of the form
t
o
=

X35
S/
n
t
c
accept H
0
t
o
=

X35
S/
n
< t
c
reject H
0
, conclude H
1
where t
c
is the critical value, that must be such that the type I error rate is
controlled at the level. Thus, t
c
is such that
P{T < t
c
|H
0
} = (4.16)
The solution is simple: since we know that under the null hypothesis T
is distributed as t
n1
, the value of t
c
is simply the 1 -quantile of this
t-distribution, thus
t
c
= t
n1,1
= t
n1,
, (4.17)
where the last equality holds because of the symmetry of the t-distribution.
Hence, the decision rule becomes
t
o
=

X35
S/
n
t
n1,
accept H
0
t
o
=

X35
S/
n
< t
n1,
reject H
0
, conclude H
1
Example 4.3. Example of Section 4.1 continued. Suppose now that

H
0
: = 35 is to be tested against H
1
: = 35 (i.e. a two-sided alternative
hypothesis). As before, the null distribution of T remains the same because
the null hypothesis is not changed. Further, T still measures the deviation
from the null hypothesis in the direction of the alternative hypothesis: ex-
treme negative values of T indicate that < 35, which is part of H
1
, and
extreme positive values of T indicate that > 35, which is also part of H
1
.
This reasoning implies that the decision rule will now reject H
0
for both ex-
treme large negative and positive values of T. The type I error rate is now
controlled by nding critical values t
c,1
and t
c,2
such that
P{reject H
0
|H
0
} =
P{T < t
c,1
or T > t
c,2
|H
0
} =
P{|T| > t
c
|H
0
} = t
c
= t
c,1
= t
c,2
,
where the last step is explained by the symmetry of the t-distribution (see
Figure ??). Thus, by dividing the total type I error rate equally over the
two tails of the null distribution, we nd
t
c
= t
n1,/2
. (4.18)
And the decision rule is
|t
o
| = |
X35
S/
n
| t
n1,/2
accept H
0
|t
o
| = |
X35
S/
n
| > t
n1,/2
reject H
0
, conclude H
1

4.3.3 Example in S-plus
Suppose the following data is sampled:
35.6 35.8 35.0 34.5 34.9 35.9 36.0 34.9 35.5 35.4
We test H
0
: = 35 against H
1
: = 35, i.e. the two-sided alternative.
We do this because, if the mean fat content is not 35 percent, then
there is no a priori knowledge that allows us to say whether or not the
true mean will be larger or smaller than 35.
The test statistic to use is
T =
X 35
S/
n
(4.19)
of which we know that
T
H
0
t
n1
. (4.20)
S-plus gives the following output:
One-sample t-Test
data: cream
t = 2.2063, df = 9, p-value = 0.0548
alternative hypothesis: true mean is not equal to 35
95 percent condence interval:
34.99113 35.70887
sample estimates:
mean of x
35.35
Thus, t
o
= 2.2063. This has to be compared to the critical value at the
5% level of signicance, which is the /2 = 0.025 quantile of t
9
:
t
9,0.025
= qt(0.975,df=9) = 2.262157.
Since |t
o
| = 2.2063 < 2.262157 we decide to accept the null hypothesis,
and conclude that the mean fat content in the batch of cream bottles
is 35 percent. (although this is a weak conclusion because of the type
II error rate which is not controlled for)
The same conclusion may be made by looking at the p-value. Here we
have p = 0.0548, which is larger than = 0.05, and thus we indeed
accept the null hypothesis. Note that the p-value is only slightly larger
than . This means that actually there is some evidence in the sample
that the true mean fat content is dierent from 35 percent, but it is
considered to be just not sucient enough evidence to conclude the
opposite. Here, you see clearly that the p-value is very valuable to
make nuances in the conclusion. Maybe, when the sample size would
have been larger, a signicant result could have been established!
Finally, the S-Plus output also gives a 95% condence interval for the
mean. This may only be interpreted when the alternative hypothesis
is two-sided. For one-sided alternatives, S-plus gives another type of
condence intervals. Also the sample mean is given.
We do the same exercise again, but now we will test H
0
: = 35
against H
1
: > 35. This may be the case when e.g. we know that
the mean fat content can impossibly be smaller than 35 percent (e.g. a
characteristic of the production process). Here, we do the same t-test,
but in the decision rule we will only reject for large positive values of
T. The critical value is now
t
9,0.05
= qt(0.95,df=9) = 1.833113.
The S-Plus output is
One-sample t-Test
data: cream
t = 2.2063, df = 9, p-value = 0.0274
alternative hypothesis: true mean is greater than 35
35.05919 NA
sample estimates:
mean of x
35.35
Of course, the observed test statistic is the same as before, but now
it has to be compared to another critical value. Since t
o
= 2.2063 >
1.833113 we now conclude to reject the null hypothesis in favour of the
alternative hypothesis and conclude formally that the mean fat concen-
tration in the batch of cream bottles is larger than 35 percent. This may
also be concluded from the p-value, which is now p = 0.0274 < 0.05.
This is a strong conclusion.
It is a general property of one-sided tests that they are more powerful
(i.e. the power of the test is higher) than their two-sided counterparts.
Intuitively, the reason is that the alternative hypothesis is more specic,
and the more specic the questions are that have to be answered, the
the more the statistical test can concentrate (i.e. specialize) on them.
Actually we should have checked whether or not the assumptions hold:
normality of the observations. But since there are only 10 observations,
this seems not a sensible thing to do.
4.4 Paired t-test
4.4.1 Construction of the Test
It sometimes happens that one sample is taken from a population, but that on
each of the elements of the population (e.g. humans) two measurements are
taken (e.g. a measurement before, and one after a particular treatment). We
denote the two observations on the i element as X
i,1
and X
i,2
, respectively.
We want to test the null hypothesis H
0
:
1
=
2
, where
1
and
2
are the
means of the rst and the second series of observations, respectively. It can
be shown that this null hypothesis is equivalent to H
0
:
12
= 0, where
12
is the mean of the transformed observations (dierences)
X
i
= X
i,1
X
i,2
. (4.21)
Hence, the initial null hypothesis may be tested by means of a one-sample
t-test (i.e. the paired t-test is essentially a one-sample t-test on the dierence
of the two paired observations). This implies that the assumption under the
construction of this test is that the transformed observations X
i
= X
i,1
X
i,2
must be normally distributed.
4.4.2 Example in S-Plus
The dataset turkey.sdd has 3 variables. We consider only the variables PRE
and POST.A, which are the weights of 50 turkeys before and after a certain
treatment. The researcher knows that the mean weight cannot increase by
giving the turkeys the treatment. Thus the hypotheses are H
0
:
PRE
=
POST.A
against H
0
:
PRE
>
POST.A
.
The S-plus output is:
-2 -1 0 1 2 3
diff
-3
-2
-1
0
1
2
Figure 4.6: QQ-plot of the dierence variable
Paired t-Test
data: x: PRE in turkey , and y: POST.A in turkey
t = 1.9138, df = 49, p-value = 0.0307
alternative hypothesis: true mean of dierences is greater than 0
0.03440644 NA
sample estimates:
mean of x - y
0.2775549
Since it is basically again a one-sample t-test, we give here only the interpre-
tation based on the p-value.
p = 0.0307 < = 0.05, and thus we decide the reject the null hypothesis, and
conclude that the mean change in weight (PRE - POST.A) is signicantly
larger than zero at the 5% level.
Actually the assumptions have to be assessed. Thus we should calculate a
new variable which is dened as the dierence between the PRE and the
POST.A variable, and check the normality of this variable by means of e.g.
a QQ-plot (Figure 4.6). The QQ-plots shows some minor deviation from the
straight line, indicating that the might be a slight deviation from normality.
But since there are 50 observations involved, the CLT comes into play, telling
us that the sample mean will be approximately normally distributed whatever
the distribution of the observations. These two arguments (acceptable QQ-
plots and large sample) brings us to the conclusion that we do not have to
worry about the assumption of normality.
4.5 Two-Sample t-Test
4.5.1 Construction of Test Statistic: variances known
Example 4.4. The elasticity of puddings is an important physical property
of puddings in the sense that customers do not only like puddings because
of the taste, but also because of the experience of pudding in the mouth.
This particular feeling can be measured by e.g. the elasticity of the pudding.
This elasticity is inuenced by the ingredients of a pudding. It is believed
that it is especially inuenced by the carrageen concentration. Therefore,
an experiment is set up in which two formulas of puddings are compared
w.r.t. the elasticity. n
1
puddings are prepared with a high concentration of
carrageen, and n
2
puddings with a low concentration.
We are interested in testing whether the mean elasticities of both types of
pudding are equal, i.e.
H
0
:
1
=
2
,
where
1
and
2
are the mean elasticity of the high and low concentration
group, respectively. Suppose that the researcher is certain that the elasticity
is denitely not smaller in the high concentration group, then the alternative
hypothesis of interest is
H
1
:
1
>
2
.
The above example is a special case of a setting in which one wants to

compare means of two possible dierent populations, by sampling n
1
and n
2
observations from the populations. For this reason the statistical test for this
type of problem is called a two-sample test. In the example, the alternative
hypothesis is one-sided to the right (H
1
:
1
>
2
). Later we will consider
the other possibilities for the alternative hypothesis as well.
We will make the following assumptions.
the observations in the rst sample are normally distributed, i.e.
X
1,i
i.i.d. N(
1
,
2
1
) (i = 1, . . . , n
1
)
the observations in the second sample are normally distributed, i.e.
X
2,i
i.i.d. N(
2
,
2
2
) (i = 1, . . . , n
2
)
variances
2
1
and
2
2
are known (this assumption will later be dropped)
For the construction of the test statistic, we rst note that

X
1

X
2
is an
estimate of
1
2
. Thus, we will reject H
0
in favour of H
1
for large values
of

X
1

X
2
. It can be shown that
X
1

X
2
N
_
2
,

2
1
n
1
+

2
2
n
2
_
(4.22)
which becomes under H
0
:
1
=
2
,
X
1

X
2
H
0
N
_
0,

2
1
n
1
+

2
2
n
2
_
. (4.23)
And, by standardization,
X
1

X
2
_
2
1
n
1
+

2
2
n
2
H
0
N(0, 1) (4.24)
Thus, the following statistic seems to be a good test statistic (it has a nice
null distribution and it measures the deviation from the null hypothesis in
the direction of the alternative),
T =
X
1

X
2
_
2
1
n
1
+

2
2
n
2
H
0
N(0, 1). (4.25)
Since we will reject N
0
for large values of T, the decision rule will be of the
form
t
o
=

X
1
X
2
2
1
n
1
+
2
2
n
2
t
c
accept H
0
t
o
=

X
1
X
2
2
1
n
1
+
2
2
n
2
> t
c
reject H
0
, conclude H
1
and the critical value t
c
must be determined such that the type I error rate
is , i.e.
P{reject H
0
|H
0
} = (4.26)
P{T > t
c
|H
0
} = (4.27)
Since the null distribution of T is a standard normal distribution, we nd
t
c
= z
. (4.28)
4.5.2 Variances Unknown
Since usually the variances
2
1
and
2
2
are not known, they will have to be
replaced by their estimates, S
2
1
and S
2
2
. As in the one-sample case, we may
expect that the null distribution will not be a standard normal distribution.
Unfortunately, it turns out that we may not do this simple replacement.
There has to be made a distinction between two situations:

2
1
=
2
2

2
1
=
2
2
2
1
=
2
2
When in reality the two variances are equal (note: this can be tested with a
statistical test), the two points estimators of these variances, S
2
1
and S
2
2
can
be pooled into one single estimator of the common variance
2
=
2
1
=
2
2
.
This estimator is the so called pooled variance estimator and it is given by
S
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
. (4.29)
Under these conditions it can be shown that
T =
X
1

X
2
_
S
2
p
n
1
+
S
2
p
n
2
H
0
t
n
1
+n
2
2
(4.30)
Still, the alternative hypothesis will be rejected for large values of the test
statistic. Thus we can now easily nd the decision rule:
t
o
=

X
1
X
2
S
2
p
n
1
+
S
2
p
n
2
t
n
1
+n
2
2,
accept H
0
t
o
=

X
1
X
2
S
2
p
n
1
+
S
2
p
n
2
> t
n
1
+n
2
1,
reject H
0
, conclude H
1
2
1
=
2
2
In this situation, it is proposed to use the test statistic
T =
X
1

X
2
_
S
2
1
n
1
+
S
2
2
n
2
. (4.31)
Unfortunately, the exact null distribution is not a simple distribution, but
approximation exist. We will not give the details here, but in S-plus this
test can be performed. As with all statistical tests, the interpretation can be
based on the p-value given by S-plus.
When, on the other hand, the sample sizes n
1
and n
2
are large, then the CLT
comes into play, and the null distribution of T may be simply approximated
by a normal distribution.
4.5.3 Other Alternative Hypothesis
We have only given the solutions for the one-sided alternative hypothesis
H
1
:
1
>
2
.
When H
1
:
1
<
2
, then we will reject H
0
for extreme negative values of T.
This implies that we only have to change the inequality signs in the decisions
rules given for H
1
:
1
>
2
.
When H
1
:
1
=
2
, then we will reject H
0
for both extreme negative and
extreme positive values of T. In that case, the probability will have to be
divided over the two tails of the null distribution and the critical value t
c
will
thus be a /2 quantile. Since all null distributions considered in this section
are symmetric, the decision rule will be based on |t
o
|, rejecting for extreme
large values of this absolute value of t
o
.
4.5.4 Example in S-Plus
We consider the dataset nutrition.sdd and we use the variables gender and bm.
The former is the gender of the subject and the latter is the basal metabolism
of the subject. We are interested in testing whether men and women have
the same mean basal metabolism. Let group 1 contain the man, and group
2 the women (n
1
= 69, n
2
= 146). Thus,
H
0
:
1
=
2
H
1
:
1
=
2
.
Before we can test these hypotheses, we must decide which test to use. Since
the variances are unknown, the choice of the test depends on whether the
variances
2
1
and
2
2
are equal or not. One way to check this, is to make a
800
1000
1200
1400
1600
1800
B
M
men women
Figure 4.7: Boxplots of bm for man and women
boxplot of both groups (see Figure 4.7). This plot suggests that the variances
are more or less equal. Thus we will use the t-test with the pooled variance,
which is
S
2
p
=
(n
1
1)S
2
1
+ (n
2
1)S
2
2
n
1
+ n
2
2
= 22568.38. (4.32)
The observed test statistic is thus
t
o
=
X
1

X
2
_
S
2
p
n
1
+
S
2
p
n
2
= 9.863827. (4.33)
According to the decision rule, this value has to be compared to the critical
value t
n
1
+n
2
2,0.025
= 1.971164. Since |t
o
| > 1.971164 we reject the null
hypothesis and conclude that the mean basal metabolism of men and women
are signicantly dierent at the 5% level of signicance. Moreover, since

X
1
=
1380.87 > 1164.392 =

X
2
we can conclude that the mean basal metabolism
of men is larger than that of women.
The S-Plus output is (also assuming equal variances)
Standard Two-Sample t-Test
data: x: BM with SEXE = men , and y: BM with SEXE = women
t = 9.8638, df = 213, p-value = 0
alternative hypothesis: true dierence in means is not equal to 0
173.2176 259.7385
sample estimates:
mean of x mean of y
1380.87 1164.392
Of course, S-Plus has computed the same t
o
. The same conclusion could be
based on the p-value, which is p = 0! This mean that p is extremely small,
approximately 0. Thus we can very strongly reject the null hypothesis in
favour of the alternative.
The t-test is only formally valid for normal distributed observations. The
boxplots in Figure 4.7 suggest that the distribution are indeed symmetric,
but the gure shows also some outliers. On the other hand, sample sizes
n
1
= 69 and n
2
= 146 are large, and thus, the CLT could apply. (note
that if the CLT applies, there was no need to calculate the pooled variance
estimator, but still this is a better way when the data may be normally
distributed). In this specic example, both approaches seem to be justied.
Moreover, since the p-value is extremely small, there will be no dierence in
nal conclusion.
4.6 Tests for Normality
In the previous chapter as well as in this chapter, we have come across many
methods that are based on the assumption of normality, i.e. it is assumed
that X
i
i.i.d N(,
2
). In this section we briey discuss a statistical test to
test the hypothesis that the data is normally distributed. First, a graphical
method for assessing normality is given: the QQ-plot.
4.6.1 QQ-Plot
A QQ-plot is a graphical tool to assess normality. It plots the observed
data (X
i
) against the expected observations if the data were indeed normally
distributed. These expected outcomes can be calculated as
F
1
(

F(X
i
)), (4.34)
where

F(X
i
) is the estimate of the distribution function at the observation
X
i
. Since the distribution function is basically a probability, it can be very
easily estimated. The denition of the distribution function is
F(x) = P{X x} , (4.35)
5 7 9 11 13
5
7
9
11
13
15
5 7 9 11 13
5
7
9
11
13
-30 -10 10 30 50
0
20
40
60
80
-4 -2 0 2 4
-10
-5
0
5
Figure 4.8: QQ-plots. The two upper graphs are examples where the data is
normally distributed, and the two lower graphs are examples where the data
are not normally distributed.
which is estimated as
F(x) =
number of observation in sample that are x
n
. (4.36)
The distribution function F in Equation 4.34 (actually the inverse distri-
bution function) is the distribution function of a normal distribution with
the mean equal to the sample mean and the variance equal to the sample
variance (i.e. =

X and
2
= S
2
).
Since a QQ-plot is a plot of the observed versus the expected observations
if they were normally distributed, we would expect the points in the plot
to lay on a straight line. But since the sample is only a nite collection
of observations randomly selected from e.g. a normal population, the plot
may show sampling variability. Thus, even if the observations come from a
normal distribution, the points may show some scattering around the straight
line. But, as soon as the plot shows a clear systematic deviation from the
straight line, there is evidence in the sample that the data are not normally
distributed.
This plot is only useful when there is sucient data, say at least n > 20.
Figure 4.8 shows some examples.
4.6.2 Kolmogorov-Smirnov Test for Normality
Probably one of the best known statistical tests for normality is the Kolmogorov-
Smirnov (KS) test. It tests the null hypothesis
H
0
: X is normally distributed
against
H
1
: X is not normally distributed.
Theoretically, it is possible to derive the construction of the KS test, i.e.
we could explain the test statistic, we could prove the null distribution of
the test statistic and show how the decision rule is constructed and how the
critical values have to be computed. Nevertheless, we will not do it here (it is
not an easy statistical test). Fortunately, we have learnt that any statistical
test can be interpreted only by its p-value. This is the way we will use the
KS-test in this course. Thus, if a KS test results in p > , we will accepts
the null hypothesis of normality, and when p , we will reject the null
hypothesis and conclude that the data is not normally distributed.
Example 4.5. To illustrate the KS test, we consider the data of the
example given in Section 4.4.2. The S-Plus output is
One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = normal
data: di in turkey
ks = 0.1052, p-value = 0.6006
alternative hypothesis:
True cdf is not the normal distn. with the specied parameters
From this output we see that p = 0.6006 > = 0.05. Thus, we accept
the null hypothesis of normality. Although we know that accepting a null
hypothesis is a weak conclusion, we like to note that
the p-value is considerably larger than = 0.05. Taking into account
that the sample size is not very small (n = 50), this p-value implies
that there is indeed not much evidence against normality.
The QQ-plot (Figure 4.6) did not indicate any substantial systematic
departure from the straight line.
-30 -10 10 30 50
0
20
40
60
80
-2 0 2 4 6 8
0
2
4
6
8
Figure 4.9: QQ-plots of two non-normal distribution. Left: a symmetric
distribution; right: a asymmetric distribution
The two argument suggest that there is no good reason to say that the data
are not normally distributed. Thus, we are quite convinced that the data
are normally distributed, at least suciently approximate in order to apply
statistical tests in a valid way (see also the next section).
4.6.3 Consequences of Nonnormal Data
In this section we comment briey on the situation that the observations are
not normally distributed (according to the KS test and/or the QQ-plot).
As we have seen, many of the statistical methods about the mean (con-
dence intervals and statistical tests) are based on the assumption that the
observations are normally distributed (except when the CLT applies). Many
studies have indicated, however, that these methods are not very sensitive
to violations to the normality assumption. In particular when the distribu-
tion of the observations is symmetric, the methods are not sensitive to this
type of deviation from normality. The symmetry of a distribution can be
assessed by means of e.g. a QQ-plot. Figure 4.9 shows a symmetric and a
asymmetric non-normal distribution. (it may also be detected by means of
e.g. a histogram or a boxplot)
According to the above mentioned argument, the statistical inference (con-
dence intervals and statistical tests) remain valid as long as the distribution
of the observations is symmetric. This is true, at least it is approximately
true. Thus, one must be careful when the decision is not convincing, e.g.
when the p-value is close to . E.g. when the sample size is small, say
n = 20 (thus the CLT does not apply), and the data is not normal, but it
seems to be symmetric, and a statistical test results in p = 0.053, which is
only slightly larger than = 0.05, then the it is hard to formulate the con-
clusion formally. The reason is that maybe due to the small deviation from
normality, the calculated p-value is not exactly the probability that refers to
the correct null distribution, and, maybe, the exact p-value (i.e. the p-value
calculated under the exact null distribution, which is unfortunately unknown
due to the non-normality) is a little bit smaller than the calculated one, such
that this exact p-value might turns out to be smaller than = 0.05, resulting
in an opposite conclusion.
The same reasoning applies to a situation where e.g. p = 0.048, which is
only slightly smaller than the cut o value = 0.05.
From this discussion we also learn again that it is important to always report
the p-value.
In the next section an alternative class of tests is proposed. This class of
tests can be used whenever the observations are not normally distributed.
4.7 Rank Tests
We will only very briey discuss this class of tests. As before, we will in
practice only use the p-value to interpret the test.
In general rank tests are similar to t-tests, except that not the original ob-
servations are used, but instead their so called ranks are used. Thus, the
observations are rank transformed. The rank transformation is illustrated
in Table 4.4 (the rank transformation is denoted as R(Y )). This illustrates
clearly, that whatever the distribution of the observations, they are trans-
formed to a scale which is independent of the original one. A very important
feature of this transformation is that the eect of outliers is eliminated. E.g.
in Table 4.4 the largest observation of Y is an outlier, but once it is rank
transformed, it just recieved the largest rank R = 10, which is exactly one
unit larger than the second largest rank (R = 9).
In the next to sections two important rank tests are briey discusses. Because
of their property of being independent of the distribution of the original data,
they are often referred to as nonparametric tests.
4.7.1 Wilcoxon Signed Rank Test
The Wilcoxon signed rank test is a nonparametric alternative to the paired
t-test. We will use the same notation as in Section 4.4.1.
Table 4.4: Observations Y and the rank transformed data R
Y 12.1 14.3 15.1 15.5 15.9 16.0 16.4 18.9 20.1 40.1
R 1 2 3 4 5 6 7 8 9 10
As with the paired t-test, all paired observations are substracted, i.e.
X
i
= X
i,1
X
i,2
, (4.37)
(i = 1, . . . , n). Some of these dierences will be positive, and some will be
negative. Next all dierences are rank transformed, resulting in the ranks
R(X
i
, which are by denition all integers between 1 and n. The Wilcoxon
signed rank test statistic is dened as the sum of the ranks R(X
i
) of only
the positive X
i
s. The null distribution of this test statistic can be easily
enumerated because it only depends on all possible arrangements of n ranks
over the positive and negative dierences under the null hypothesis, whatever
the original form of the distribution of X
i
.
Finally, we give the S-Plus output for the same example is in Section 4.4.1.
Wilcoxon signed-rank test
data: x: PRE in turkey , and y: POST.A in turkey
signed-rank normal statistic without correction Z = 1.8679, p-value = 0.0618
alternative hypothesis: true mu is not equal to 0
From this analysis we see that p = 0.0618 > 0.05, and thus we conclude
to accept the null hypothesis. Note that with the paired t-test we found
p < 0.05 and there we rejected H
0
. This seems to be a contradiction, but it
is not. It can be shown that, if the the data is normally distributed, than the
paired t-test is more powerful, i.e. the paired t test will have more chance
to reject H
0
when H
1
is true. Thus, if the data is normally distributed, it is
recommended to use the paired t-test. This also illustrated the importance
of checking the assumptions underlying the paired t-test. Finally, we like to
remark that the p-values of the paired t-test and this Wilcoxon signed rank
test are not much dierent, but unfortunately, = 0.05 lies in between the
two.
4.7.2 Wilcoxon Rank Sum Test
This test is also known as the Mann-Whitney test. They are alternatives to
the two sample t-test. We will use the same notation as in Section 4.5.
In a rst step all data are pooled, i.e. X
1
= X
1,1
, X
2
= X
1,2
, . . . , X
n
1
=
X
1,n
1
, X
n
1
+1
= X
2,1
, . . . , X
n
1
+n
2
= X
2,n
2
. Next these pooled observations
are ranked, i.e. R(X
i
) (i = 1, . . . (n
1
+ n
2
)). The test statistic is calculated
as the sum of the ranks of the observations in the rst sample.
Again, the null distribution can be exactly enumerated by considering all
possible arrangements of ranks under the null hypothesis. These arangements
do not depend on the distributions of the original observations. Therefore
also this test is called a nonparametric test.
Finally, we give the S-Plus output for the same example is in Section 4.5.
Wilcoxon rank-sum test
data: x: BM with SEX = men , and y: BM with SEX = women
rank-sum normal statistic without correction Z = 7.8811, p-value = 0
alternative hypothesis: true mu is not equal to 0
Here we have p = 0 < 0.05, and thus we conclude to reject H
0
and conclude
that the mean BM is not the same for men and women. This is the same
conclusion as obtained with the two-sample t-test.
Chapter 5
Analysis of Variance
In the previous chapter we have seen methods to test the hypothesis that two
means are equal. In this chapter these methods are extended to test whether
an arbitrary number of means are equal. In doing so we will have to introduce
statistical models, which, in later a chapter, will be further extended to e.g.
regression models.
5.1 Statistical Models
5.1.1 Two populations
In the previous chapter we considered two populations of which we were
mainly interested in their means. We have seen methods to estimate these
means (point and interval estimators) and we have introduced the two sample
t test to test the equality of the two means. When we assume that the two
variances are equal, then this setting can be represented as a statistical model,
or, in particular, an ANOVA model.
The assumptions underlying the traditional two sample t-test are
population 1: Y
1,j
i.i.d. N(
1
,
2
) (j = 1, . . . , n
1
)
population 1: Y
2,j
i.i.d. N(
2
,
2
) (j = 1, . . . , n
2
)
75
Thus, for population i = 1 and i = 2, the transformed variable
Y
i,j
i
=
ij
i.i.d. N(0,
2
). (5.1)
This may also be written as
Y
i,j
=
i
+
ij
, (5.2)
where
ij
i.i.d. N(0,
2
), i = 1, 2; j = 1, . . . , n
i
. Equation 5.2 is a statistical
model. The index i refers to the population, and the index j refers to the
j-th observation of the sample from population i. Typically, one the rst
steps in the application of a statistical model, is the tting of the model to
the data, i.e. the unknown parameters in the model must be estimated. In
Model 5.2, 3 parameters must be estimated:
1
,
2
and
2
. Of course, the
solution is given in the previous chapter:

1
=

X
1
,
2
=

X
2
,
2
= S
2
p
,
and the corresponding interval estimators or condence intervals are also
as seen previously. Finally, hypotheses can be formulated in terms of the
parameters (e.g. H
0
:
1
=
2
) and can be tested.
5.1.2 p Populations
Example 5.1. In the nutrition.sdd dataset, many variables are measured
at 3 dierent time periods. An interesting research question is how the mean
QI evolves over time. In the study, the researchers have taken 3 independent
samples of men and women at 3 dierent periods. These 3 periods correspond
to 3 populations, each with a possibly dierent mean QI. We will assume that
the variance of the QI is the same in the 3 populations.
The corresponding statistical model is
Y
ij
=
i
+
ij
, (5.3)
where
ij
i.i.d. N(0,
2
) (i = 1, 2, 3; j = 1, . . . , n
i
).
More generally, we may be interested in the means of p populations. Then
the observations Y
ij
of population i may be written as the statistical model
Y
ij
=
i
+
ij
, (5.4)
where
ij
i.i.d. N(0,
2
) (i = 1, . . . , p; j = 1, . . . , n
i
). Note that this last
statement about the error terms
ij
constitute the assumptions underlying
the statistical model, which are often needed for the statistical analysis based
on this model.
Before we say something more generally about the parameter estimation
and the analysis of variance, we shall introduce another formulation of the
statistical model given in Equation 5.3. The reason for changing to another
model formulation will become clear only later.
The model in Equation 5.3 may equivalently written as
Y
ij
= +
i
+
ij
, (5.5)
where
ij
i.i.d. N(0,
2
) (i = 1, . . . , p; j = 1, . . . , n
i
), and with the additional
restriction that
p
i=1
i
= 0 (5.6)
(this restriction is called the -restriction). Further, is given by
=
1
p
p
i=1
i
. (5.7)
By comparing the two equivalent models (Equations 5.3 and 5.5), we see
immediately that
i
= +
i
. (5.8)
The parameter serves as some kind of general mean, and it is referred to
as the constant or intercept in the model. The parameters
i
(i = 1, . . . , p)
are the eects. When the p population means (
i
) are equal, i.e.
1
=
2
=
. . . =
p
, then, by Equation 5.8,
+
1
= +
2
= . . . = +
p
, (5.9)
which can only be true if
1
=
2
= . . . =
p
= 0 (5.10)
(note that this solution still satises the -restriction). Hence, the null hy-
pothesis of equality of means is equivalent to Equation 5.10.
As mentioned before, the parameters
i
are called the eects, or the treatment
eects (i.e. often the p populations correspond to p dierent treatments;
this was the historical reason for treatment eects). In particular, these p
populations (or treatments) are specied by a factor. A factor may be seen
as a discrete variable that indicates to which of the p populations every
observation belongs. For instance, let X
ij
denote this factor variable, then
X
ij
= i. We also say that this factor has p levels, i.e. the factor determines
p populations or treatments.
5.2 Parameter Estimation
5.2.1 Least Squares Criterion
Basically, a statistical model reects the assumptions made about the obser-
vations / populations. In particular, the ANOVA models introduced in this
chapter may be seen as a parameterization of the means of p populations
of which it is further assumed that all observations are normally distributed
with the same variance
2
. I.e. we allow the p populations to be dierent
only in terms of their mean. Equation 5.8 shows that
i
is the dierence in
mean of population i to the general mean .
The unknown parameters are ,
1
,
2
, . . . ,
p
and
2
. Based on a sample of
observations of each population, these parameters must be estimated (i.e.
point estimators). We will denote the point estimators by ,
1
,
2
, . . . ,
p
and
2
. Sometimes the latter is also denoted by S
2
, which is again to be
interpreted as a pooled variance estimator.
The general method of parameter estimation is the method of least squares.
We will only briey discuss this method here.
Consider again the ANOVA model
Y
ij
= +
i
+
ij
, (5.11)
where
ij
i.i.d. N(0,
2
) (i = 1, . . . , p; j = 1, . . . , n
i
). Dene the tted values
Y
ij
= +
i
, (5.12)
and the residuals
e
ij
= Y
ij

Y
ij
, (5.13)
and note that the residuals e
ij
may be seen as estimates of the error terms
ij
in the statistical model. Thus both the tted values

Y
ij
and the residuals
e
ij
depend on the parameter estimates ,
1
,
2
, . . . ,
p
. To stress this de-
pendence we write

Y
ij
( ,
1
,
2
, . . . ,
p
) and e
ij
( ,
1
,
2
, . . . ,
p
). Intuitively,
it is clear that good parameter estimates will be such that the errors will
be small. Since the errors are estimated by the residuals, this requirement
means that the residuals e
ij
should be small. In particular, the least squares
method consists in nding the parameter estimates such that
L( ,
1
,
2
, . . . ,
p
) =
p
i=1
n
i
j=1
e
2
ij
( ,
1
,
2
, . . . ,
p
) (5.14)
is minimized.
We now try to explain why this is a good approach.
Each tted value

Y
ij
= +
i
is the estimator of the mean
i
= +
i
of the
corresponding i-th population. Thus, for population i, the statistic
S
2
i
=
1
n
i
1
n
i
j=1
_
Y
ij

Y
ij
_
2
(5.15)
=
1
n
i
1
n
i
j=1
e
2
ij
(5.16)
is the estimator of
2
. Equation 5.14 can now be written as
L( ,
1
,
2
, . . . ,
p
) =
p
i=1
(n
i
1)S
2
i
. (5.17)
Remember from the 2-sample t-test (Section 4.5) that, when
2
1
=
2
2
=
2
,
we have calculated a pooled variance estimator (S
2
p
= ((n
1
1)S
2
1
+ (n
2
1)S
2
2
)/(n
1
+ n
2
2)) as the estimator of the common
2
. A straightforward
extension is thus
S
2
p
=
L( ,
1
,
2
, . . . ,
p
)
p
i=1
n
i
p
(5.18)
as a pooled variance estimator of the common variance
2
. Let N =
p
i=1
n
i
denote the total sample size, then
2
= S
2
p
= L/(N p) is our estimator
of
2
. We will often simply use the notation S
2
to denote this estimator.
Without proof, we state that S
2
2
.
In conclusion, we could say that the parameters are estimated such that the
residual variance (i.e. the unexplained variance) is minimized. In the next
section, where the statistical properties of the parameter estimators are given,
other arguments will be given to illustrate that the least squares method is
indeed a good method.
In general, minimizing the least squares criterion L may be done by means
of a numerical algorithm. But for simple linear model (e.g. the ANOVA
model), there exist an analytical solution of the minimization problem. In
particular, the point estimators are given by
=
1
p
p
i=1
Y
i
(5.19)

i
=
1
n
i
n
i
j=1
Y
ij
=

Y
i
(5.20)

2
= L/(N p) =
1
N p
p
i=1
n
i
j=1
e
2
ij
. (5.21)
We will also use the notation

Y and

Y
i
to denote the sample mean over all
samples, and the sample mean of the i-th sample, respectively.
5.2.2 Statistical Properties of the Estimators
In this section we give the statistical properties of the estimators, i.e. how
they are distributed. We will, however, not go into the theoretical details.
Equations 5.19 and 5.20 show that the estimators and
i
are simple function
of sample means. From Chapter 3 we know that sample means are normally
distributed when the observations are assumed to be normally distributed.
In particular,
Y
i
N
_
i
,

2
n
i
_
. (5.22)
From these properties, the distributions of the parameter estimators may be
derived. Let
1
n
r
=
1
p
p
i=1
1
n
i
, (5.23)
i.e. n
r
is the reciprocal mean of the sample sizes n
i
(note that if all sample
sizes are equal, n
1
= n
2
= . . . = r, then n
r
= r). Then,
N
_
,

2
p n
r
_
(5.24)

i
N
_
i
,

2
n
i

2
p
_
2
n
i
1
n
r
__
, (5.25)
where

2
p
_
2
n
i
1
n
r
_
> 0. Further, it can be shown that the variance of
i
is minimal if n
1
= n
2
= . . . = n
p
= n
r
. Thus, such a study design is
most ecient (it is called a balanced design). Clearly, the estimators are
unbiased and consistent. Sometimes, we will use the notation
2
=

2
N
and
i
=

2
n
i

2
p
_
2
n
i
1
n
r
_
.
The variances
2
and
2
i
may be estimated by simply replacing with its
unbiased estimator S
2
(Equation 5.18), resulting in
S
2
=
2
=
S
2
N
(5.26)
S
2
i
=
2
i
=
S
2
n
i
S
2
p
_
2
n
i
1
n
r
_
. (5.27)
An important characteristic is that both the estimated variance of and of
i
are proportional to the variance estimator S
2
. This variance estimator,
however, is proportional to the minimized least squares criterion L. Hence,
among all possible denitions of estimators, the least squares estimators are
those that have the smallest estimated variance! This may be considered
as an optimality property. Thus, the least squares estimators are optimal
estimators. Finally, since S
2
is an unbiased estimator, the estimators S
2
and
S
2
i
are unbiased as well.
The statistical properties of S
2
are more dicult. It may be shown that
(N p)
S
2
2

2
Np
. (5.28)
Its obvious that this results only hold if indeed all observations have the same
common variance
2
.
5.2.3 Interval Estimators (condence intervals)
Once the statistical properties of the point estimators are known, it is straight-
forward to obtain the condence intervals (=interval estimators). Since this
is completely analogous to Section 3.6, we will omit the details here, and give
only the solutions.
The lower and upper limits of the (1 )-condence interval of are given
by
L = S
t
Np,/2
(5.29)
U = + S
t
Np,/2
. (5.30)
The lower and upper limits of the (1 )-condence interval of
i
are given
by
L =
i
S
i
t
Np,/2
(5.31)
U =
i
+ S
i
t
Np,/2
. (5.32)
Of course, it is also interesting to know the condence interval of the mean
i
of the i-th population. One solution would be to use only the observations of
the i-sample, and to apply the methods given in Section 3.6. This is indeed
a correct approach, but when the observations of all samples have a common
variance
2
, there is a more ecient method. In particular, in this case the
lower and upper limits of the (1 )-condence interval of
i
are given by
L = +
i
n
i
t
Np,/2
=

Y
i
n
i
t
Np,/2
(5.33)
U = +
i
+
S
n
i
t
Np,/2
=

Y
i
+
S
n
i
t
Np,/2
. (5.34)
The reason why this is a better approach is that now the observations of the
p 1 other samples are used to estimate the common variance
2
. Thus S
2
is a better estimator than the estimator S
2
i
, which is only calculated from
the n
i
observations in the i-th sample. The fact that S
2
is a better estimator
is reected in the degrees of freedom of the t-distribution that is used in the
condence interval: N p degrees of freedom instead of n
i
1 if only the
i-th sample would have been used. (the larger the degrees of freedom, the
smaller the t-quantile, resulting in a smaller condence interval)
5.3 ANOVA Table and Hypothesis Tests
A central role in the analysis of variance is played by the ANOVA table, in
which the decomposition of the Total Sum of Squares is presented. Also the
results of the statistical tests are presented in this table.
The hypotheses that are of interest here, are
H
0
:
1
=
2
= . . . =
p
or, equivalently H
0
:
1
=
2
= . . . =
p
= 0
(5.35)
versus
H
1
: at least two means are dierent. (5.36)
5.3.1 Decomposition of the Total Sum of Squares
The name analysis of variance refers to the Decomposition of the Total Sum
of Squares, where the total sum of squares measures the total variability in the
data, and where each of the components represents the variance that can be
explained by a factor in the statistical model. All these components are sums
of squares (SS), and with each sum of squares is a mean sum of squares
(MS) associated and a degrees of freedom (df). The latter are determined
such that each MS is an unbiased estimator of a variance. Furthermore, the
MS is always dened as the SS divided by the corresponding df.
The total sum of squares is dened as
SSTot =
p
i=1
n
i
j=1
_
Y
ij

Y
_
2
. (5.37)
There are N 1 degrees of freedom associated with SSTot, and the total
mean sum of squares is given by
MSTot =
SSTot
N 1
. (5.38)
The interpretation of SSTot is very easy. From Equations 5.37 and 5.38 it
is seen that MSTot has the form of a sample variance estimator where no
distinction between the p populations is made (i.e.

Y is used as an estimator
of a common mean). Thus, since under the null hypothesis there is indeed
one common mean in the p populations, SSTot is an unbiased estimator of
2
. We will show that, when H
0
does not hold true, SSTot will be larger as
what we would expect under H
0
.
The second sum of squares is the Treatment Sum of Squares, which is dened
as
SST =
p
i=1
n
i
j=1
_
Y
i

Y
_
2
=
p
i=1
n
i
_
Y
i

Y
_
2
, (5.39)
which is associated with p1 degrees of freedom. The Treatment Mean Sum
of Squares is given by
MST =
SSTot
p 1
. (5.40)
Again MST has the general form of a sample variance, but this time the
p sample means

Y
i
are treated as observations, and

Y is considered as the
estimator of a population from which the sample means

Y
i
are sampled. From
this reasoning, it may be understood that MST will be small when in reality
all
i
are equal, and MST will be large when the
i
are dierent. We could
think of MST or SST as a measure of sample information against the null
hypothesis. This will be helpful in the construction of the statistical test.
The third sum of squares is the Error Sum of Squares or the Residual Sum
of Squares, which is dened as
SSE =
p
i=1
n
i
j=1
_
Y
ij

Y
i
_
2
=
p
i=1
n
i
j=1
e
2
ij
= L (5.41)
which is associated with N p degrees of freedom. The Error Mean Sum of
Squares, or Residual Mean Sum of Squares is given by
MSE =
SSE
N p
= S
2
. (5.42)
Thus, MSE, which is exactly equal to S
2
, is the unbiased estimator of the
residual variance
2
. This characteristic holds independently of whether or
not the null hypothesis is true.
So far we have dened the sums of squares. The decomposition is given by
the following important property, which holds for all samples:
SSTot = SST + SSE. (5.43)
Thus, SSTot, which is a measure for the total variability of the observations,
is decomposed into (1) SST, which is the part of the total variance that can
be explained by the dierences in the p sample means, and (2) the residual
variance, which is the part of the total variance that cannot be explained by
the dierences in the p sample means.
Finally, the same decomposition as in Equation 5.43 holds for the degrees of
freedom, i.e.
N 1 = (p 1) + (N p). (5.44)
5.3.2 Hypothesis Tests
As before, we will specify the statistical test by (1) looking for a suitable test
statistic, (2) nding its null distribution, and (3) specifying the decision rule
such that the type I error rate is exactly controlled at .
1. From the previous section it is clear that SST, or MST, is a good start
as a test statistic. It can be proven that
E{MST} =
2
+
p
i=1
2
i
p 1
. (5.45)
This, indeed, the distribution of MST shifts to the right, the more the
i
s are dierent from zero. On the other hand, MST will also depend on
the variance
2
. Therefore, we suggest to normalize MST by dividing
it by MSE = S
2
which is an unbiased estimator of
2
.
Thus, the test statistic is dened as
T =
MST
MSE
, (5.46)
and we will reject the null hypothesis for large values of T.
2. It can be shown that
T
H
0
F
p1,Np
. (5.47)
(note that the degrees of freedom of the F-distribution are those of
MST and MSE)
3. Since (1) the null hypothesis will be rejected for large values of the test
statistic, and (2) its null distribution is F
p1,Np
, the decision rule is
Table 5.1: An ANOVA Table
Source SS df MS F-value p-value
Treatment SST p 1 MST
MST
MSE
p
Error SSE N p MSE
Total SSTot N 1 MSTot
t
o
=
MST
MSE
F
p1,Np;
accept H
0
t
o
=
MST
MSE
> F
p1,Np;
reject H
0
, conclude H
1
Since the null distribution is an F-distribution, the test is often called the
F-test. Therefore, we will also often use the notation F =
MST
MSE
.
5.3.3 ANOVA Table
In an ANOVA table, the decomposition of the sum of squares and the result
of the statistical test are summarized. In general the ANOVA table looks
like Table 5.1. Sometimes MSTot is not mentioned in the table.
5.4 Example
Dataset: nutrition.sdd
The daily spend energy (ENEX) is measured at three dierent time points
(periods). In particular, at these three time points, independent samples are
taken from the population.
An interesting question to the researcher is whether or not the mean ENEX
is the same in the three periods, i.e.
H
0
:
1
=
2
=
3
(5.48)
versus
H
1
: at least two means are dierent. (5.49)
period 1 period 2 period 3
PERIOD
1000
1500
2000
2500
3000
3500
4000
E
N
E
X
-3 -2 -1 0 1 2
Normal Distribution
2000
4000
2000
4000
2000
4000
E
N
E
X
period 1
period 2
period 3
Figure 5.1: A box-plot (left) and QQ-plots (right) of the ENEX at the 3
periods
(note: the variable ENEX is the factor variable; it has p = 3 levels in this
example)
The data is shown in Figure 5.1. The box-plot clearly suggests that the 3
variances are equal, and the QQ-plots illustrate that the observations in the
3 samples are normally distributed. Hence, all assumptions of the ANOVA
are satised.
The S-Plus output is given below.
*** Analysis of Variance Model ***
Type III Sum of Squares
Df Sum of Sq Mean Sq F Value Pr(F)
PERIOD 2 13715701 6857851 28.19607 1.386768e-011
Residuals 212 51562663 243220
Estimated K Coefficients for K-level Factor:
$"(Intercept)":
(Intercept)
2358.651
$PERIOD:
356.2778 -103.1324 -253.1454
Tables of means
Grand mean
2335
PERIOD
2714.9 2255.5 2105.5
rep 63.0 73.0 79.0
Tables of adjusted means
Grand mean
2358.651
se 33.782
PERIOD
2714.9 2255.5 2105.5
se 62.1 57.7 55.5
Interpretation and conclusions:
1. From the ANOVA table we read p = 1.386768e 011 0 < 0.05.
Hence, we may reject the null hypothesis very strongly. We conclude
that, at the 5% level of signicance, the mean ENEX is not the same
in the three periods. (later we will see methods to determine which
means are dierent, and which means are not)
2. Under the title Estimated K Coecients for K-level Factor: we can
nd the parameter estimates. In particular,
= 2358.651 (5.50)

1
= 356.2778 (5.51)

2
= 103.1324 (5.52)

3
= 253.1454 (5.53)
3. In the Tables of means, the sample means are found:

Y = 2335,
Y
1
= 2714.9,

Y
2
= 2255.5 and

Y
3
= 2105.5. You can also read the
number of observations in each sample.
4. In the Tables of adjusted means, you can nd the Grand Mean,
which is exactly the same as =

Y
1
+
Y
2
+
Y
3
3
. The means given for the
three periods, are again the sample means

Y
i
. The standard errors (se)
that are given here, are calculated only from the data of the corre-
sponding sample, i.e. for sample i the se is simply
S
1
n
1
. We have seen
that S
2
= MSE is a better estimator for
2
. Thus, better standard er-
rors can be obtained by using S =
243220 = 493.17 instead of S

1
, S
2
and S
3
, resulting in S
1
= S/
63 = 62.13, S
2
= S/
73 = 57.72 and
S
3
= S/
79 = 55.49. With these standard errors, condence intervals

of the corresponding
i
s can be calculated (note: t-distributions with
Np degrees of freedom must be used because S
2
is used as a variance
estimator)
5.5 The Kruskal-Wallis Test
When the observations in one of the p populations are not normally dis-
tributed, then the F-test in ANOVA and the condence intervals may not
be formally interpreted. However, when this occurs for population i, and the
i-th sample size (n
i
) is suciently large for applying the CLT in that sample,
then F-tests and condence intervals may still be interpreted. In the other
cases, we can still rely on a nonparametric alternative to the F-test: the
Kruskal-Wallis test. As Wilcoxons tests, the Kruskal-Wallis test is a rank
test, which means that this test is rather insensitive to outliers. We will not
give the details of this test, but, as with all other tests, it can be interpreted
by only looking at the p-value.
Example 5.2. We take the same example as in Section 5.4, but now we
will act as if the observations were not normally distributed such that we
had to apply the Kruskal-Wallis test. Below is the S-Plus output.
Kruskal-Wallis rank sum test
data: ENEX and PERIOD from data set nutrition
Kruskal-Wallis chi-square = 45.3196, df = 2, p-value = 0
alternative hypothesis: two.sided
Also here, p = 0 < 0.05 which results in the same conclusion as in Section
5.4.
5.6 Multiple Comparison of Means
5.6.1 Situation of the Problem: Multiplicity
Again we consider the example of Section 5.4. Based on the F-test we have
concluded that at least two means are dierent, but we were not able to
say which of the three means are dierent and which are equal. Methods
that solve this type of question are called multiple comparison or pairwise
comparison methods.
One solution, which at rst sight may look very obvious, is to apply 2-sample
t-tests to test the three null hypotheses H
0,1
:
1
=
2
, H
0,2
:
2
=
3
and
H
0,3
:
1
=
3
. (note that all three null hypothesis are true if and only if the
general null hypothesis H
0
:
1
=
2
=
3
is true) The type I error rate for
this procedure is thus
P{type I error} = P{reject H
0,1
or reject H
0,2
or reject H
0,3
|H
0
} .
(5.54)
Unfortunately, this probability is very hard to calculate, but basic probability
calculus tells us that at least
P{reject H
0,1
or reject H
0,2
or reject H
0,3
|H
0
} (5.55)
P{reject H
0,1
|H
0
} + P{reject H
0,2
|H
0
} + P{reject H
0,3
|H
0
} .
Each of these three terms represent exactly the type I error rates that hold
for the three 2-sample t-tests. Thus, if each of these t-tests is performed at
the -level, then we have for the overall type I error rate
P{type I error} + + = 3. (5.56)
Hence, the type I error rate is inated which is due to applying several tests
to obtain a conclusion. This phenomenon is called multiplicity.
It is easy to extend the above discussion to the situation where p means have
to be compared in a pairwise fashion. In particular, when all p means have
to be compared, then q =
p!
2(p2)!
two-sample t-tests must be performed. The
overall type I error rate is then (extending Equation 5.56)
P{type I error} q. (5.57)
5.6.2 Bonferroni Correction
A simple approximate solution can be derived directly from Equation 5.57.
Equation 5.57 suggests that each of the individual t-tests should be performed
at a lower type I error rate. In particular, suppose that each of the q t-tests
is performed at some
t
-level. If we want to have the overall type I error rate
controlled at , then we could take
t
=

q
. Equation 5.57 gives
P{type I error} q
t
= . (5.58)
This guarantees that the overall type I error rate does not exceed . However,
it turns out that, when q increases, the dierence between the true type I
error rate and the upper bound typically becomes larger and larger. (the
property of having a testing procedure that results in a type I error rate
that is smaller than the nominal -level is called conservative) Thus, the
Bonferroni correction is actually a too strong correction procedure, but it is
a safe method which has the advantage of being very simple to apply.
In the above procedure 2-sample t-tests have to be performed. When the
ANOVA assumptions hold, however, the data of all p samples may be used
to estimate the common variance
2
. Therefore, S
2
= MSE is used to calcu-
late the t-tests. The null distributions of the t-tests is thus t
Np
instead of
t
n
i
+n
j
2
.
Since there exists an equivalence between -level t-tests and (1 )-level
condence intervals, the results of a multiple comparison procedure may
also be presented as (1 )-simultaneous condence intervals of the true
dierence in mean
i
j
(i = j). In particular, the lower and upper limits
of the simultaneous (1 )-condence interval of
i
j
is given by
L =
i

j
S
2
n
i
+
S
2
n
j
t
Np,(/q)/2
(5.59)
U =
i

j
+
S
2
n
i
+
S
2
n
j
t
Np,(/q)/2
. (5.60)
Note that t
Np,(/q)/2
is the critical value of the individual t-tests in the
Bonferroni correction procedure. Condence intervals that do not contain 0
correspond to
i
=
j
(this is the equivalence between condence intervals
and statistical tests).
Example 5.3. Again we consider the example of Section 5.4. The S-plus
output is given below.
95 % simultaneous confidence intervals for specified
linear combinations, by the Bonferroni method
critical point: 2.4131
response variable: ENEX
intervals excluding 0 are flagged by ****
Estimate Std.Error Lower Bound Upper Bound
period 1-period 2 459 84.8 255.0 664 ****
period 1-period 3 609 83.3 408.0 810 ****
period 2-period 3 150 80.1 -43.2 343
The critical point is t
Np,(/q)/2
= t
212,0.008333
= 2.4131. From the simulta-
neous condence intervals, we conclude that the mean ENEX in period 1
is signicantly dierent of the mean ENEX in periods 2 and 3. There is
insucient evidence that there would be dierence in mean ENEX between
periods 2 and 3; therefore we consider them equal (accepting H
0,2
:
2
=
3
).
Finally, it is interesting to remark that the key step in the Bonferroni method
(Equation 5.55) does not depend on the type of the q individual statistical
tests. Thus, also when the data are not normally distributed, the Bonferroni
method may be applied to e.g. Wilcoxon rank sum tests.
5.6.3 The Tukey Method
When all the ANOVA assumptions hold, the Tukey method is denitely the
best choice to correct for multiplicity. The Tukey method consists in the
construction of simultaneous condence intervals for all q dierences
i
j
in an exact way, i.e. the method guarantees that the coverage probability of
the Tukey simultaneous condence intervals is exactly 1 (Bonferroni only
guarantees that this coverage probability is not smaller than 1 ). Then,
by the equivalence between statistical tests and condence intervals, it can
be directly deduced which means are dierent and which are equal.
Example 5.4. Again we consider the example of Section 5.4. The S-Plus
output is given below.
95 % simultaneous confidence intervals for specified
linear combinations, by the Tukey method
critical point: 2.3603
response variable: ENEX
intervals excluding 0 are flagged by ****
Estimate Std.Error Lower Bound Upper Bound
period 1-period 2 459 84.8 259 660 ****
period 1-period 3 609 83.3 413 806 ****
period 2-period 3 150 80.1 -39 339
The same conclusion as with the Bonferroni method is obtained. As for the
Bonferroni method, the critical point (here: 2.3603) is the critical value to
be used in the individual 2-sample t-tests (or, equivalently, in the calculation
of the simultaneous condence intervals).
5.6.4 LSD method
Finally, we briey comment on the LSD = Least Signicant Dierence method
of Fisher. This method consists in applying 2-sample t-tests to test the par-
tial hypotheses
i
=
j
without correcting for multiplicity. Thus, this is not
a correction procedure!
5.7 Two-way ANOVA
in this section, the ANOVA model of Section 5.1.2 is extended such that it
includes eects of two factor variables.
Example 5.5. Dataset: nutrition.sdd.
With the ANOVA model that we have seen so far, it is possible to model
e.g. the mean QI for the 3 populations dened by the 3 periods. It is also
possible to use the ANOVA model to model the mean QI for the 2 populations
dened by the 2 genders (men and women). It is, however, also possible to
consider the data as samples from 6 populations which are dened by the
combinations of the 3 periods and the 2 genders. Figure 5.2 shows the data
in this manner.
20
30
20
30
Q
I
men men men
women women women
Figure 5.2: Box-plots of the 6 samples, obtained by considering all combina-
tions of periods and genders
5.7.1 Two-way ANOVA Model

We consider two factor variables, denoted by T and B. Suppose T has t
levels, and B has b levels. By combining both factors, tb populations are
dened. We will allow these tb population to dier in mean, but we will
assume that all tb populations have the same variance
2
. In particular, the
population dened by the i-th level of factor T and the j-th level of factor B
is assumed to be a normal distribution with variance
2
and mean
ij
. An
observation from the sample from population ij is denoted by Y
ijk
, where
k = 1, . . . , n
ij
. Thus, the model becomes
Y
ijk
=
ij
+
ijk
, (5.61)
where
ijk
i.i.d. N(0,
2
) (i = 1, . . . , t; j = 1, . . . , b; k = 1, . . . , n
ij
).
As in the one-factor ANOVA, the population means are modelled in terms
of eects. Since we now have two factors, we will consider two sets of eects.
Since there are bt possibly dierent population means
ij
, there should be
an exactly equal number of independent parameters parameterizing these
means. By using factor eects, we could consider the additive model
ij
= +
i
+
j
, (5.62)
where for both sets of eects the -restrictions hold, i.e.
t
i=1
i
=
b
i=1
j
=
0. Although this model might hold in reality, there are only 1 + (t 1) +
(b 1) = t +b 1 independent parameters involved. Imposing such a model
implies a restriction (additivity of the factor eects).
A possible extension of Model 5.62 is
ij
= +
i
+
j
+ ()
ij
, (5.63)
where
i
and
j
are as before, and for the bt parameters we imply (t1)(b1)
-restrictions:
t
i=1
()
ij
=
b
j=1
()
ij
=. Hence, we now have 1 + (t
1) + (b 1) + (t 1)(b 1) = tb independent parameters. The model in
Equation 5.63 is called the saturated model. The parameters ()
ij
are the
eects of the interaction between the factors T and B.
The eects ()
ij
are called interaction eects and the eects
i
and
j
are
called the main eects.
5.7.2 Interaction
Example 5.6. Example 5.7 continued. In terms of this example we will
explain the interpretation of interaction between the two factors (period and
gender).
Suppose there is not interaction, i.e. the additive model of Equation
5.62 applies. From the model given in Equations 5.62 and 5.63 this
means that all ()
ij
= 0. Thus, the mean of the i-th period and the j-
th gender is given by adding the corresponding eects:
ij
= +
i
+
j
.
Or, when put in a dierent way, the eect of period i is the same for
both genders (j = 1 and j = 2).
Consider the dierence in means of period i between e.g. the two
genders, i.e.
i1
i2
= ( +
i
+
1
) ( +
i
+
2
) =
1
2
, (5.64)
which is independent of period i. Hence, the dierence in means of any
period i between the two genders, is always
1

2
, irrespectively of
the period. A similar reasoning holds for the dierences in means of
e.g. men (j = 1, say) between two periods; e.g.
11
21
= ( +
1
+
1
) ( +
2
+
1
) =
1
2
, (5.65)
and the same dierence is found for the women (j = 2).
The level plot in the left panel of Figure 5.3 illustrates the no-interaction
situation: parallel lines.
19,5
20
20,5
21
21,5
22
22,5
23
23,5
men
women
19,5
20
20,5
21
21,5
22
22,5
23
23,5
men
women
Figure 5.3: Level plots of an additive model (left) and an interaction model
(right), showing the mean QI for all combinations of gender and period
When there is an interaction eect between the two factors on the mean,
then the lines in the level plot will typically no longer be parallel. This
is illustrated in the right panel in Figure 5.3). In that gure we see that
the dierence in mean in period 2 between men and women is larger
than in period 1, and the dierence in mean in period 3 between men
and women is even dierent in sign. Thus, the eects of gender are not
additive, in the sense that the eect of gender is dierent from period
to period. A consequence is that we cannot conclude anything about
the eect of gender without specifying the period.
The interpretation of interaction can also be seen in the saturated
model (Equation 5.63). The dierence in mean in period i between
mean and women is given by
i1
i2
= ( +
i
+
1
+ ()
i1
) ( +
i
+
2
+ ()
i2
) (5.66)
= =
1
2
+ (()
i1
()
i2
). (5.67)
Thus, indeed, this dierence still depends on the period.
In general, when there is interaction, the eect of factor T depends on
the level j of factor B, and, equivalently, the eect of factor B depends
on the level i of factor T.

5.7.3 Parameter Estimation
The method of least squares can be extended directly, resulting in point
estimators ,
i
,

j
and (

)
ij
. Again it can be shown that these estimators
unbiased and consistent. Furthermore, they are normally distributed and
their variance is again proportional to
2
. The residual variance
2
can
again be estimated by S
2
= MSE (see later).
Condence intervals (i.e. interval estimators) can also be calculated as before.
Only the degrees of freedom has to be changed to the number of degrees of
freedom associated with MSE (see later).
5.7.4 Decomposition of the Total Sum of Squares
In Section 5.3.1 the total sum of squares was introduced. It is straightforward
extended to
SSTot =
t
i=1
b
j=1
n
ij
k
_
Y
ijk

Y
_
2
. (5.68)
As before, SSTot does not depend on the statistical model. It is only a
measure for the total variability in the dataset (pooled over all samples).
The decomposition of SSTot depends on the statistical model.
Suppose the additive model (Equation 5.62) is used. Then, the decom-
position is
SSTot = SST + SSB + SSE, (5.69)
where SST and SSB are the sum of squares of factor T and of factor
B, respectively. They are associated with t 1 and b 1 degrees of
freedom, and SSTot is associated with N 1 degrees of freedom. The
degrees of freedom of the residual sum of squares (SSE) is most easily
calculated as (N 1) (t 1) (b 1).
Suppose the saturated model (Equation 5.63) is used. Then, the de-
composition is
SSTot = SST + SSB + SSTB + SSE, (5.70)
where SSTB is the sum of squares of the interaction between the factors
T and B. It is associated with (t 1)(b 1) degrees of freedom. Thus
the degrees of freedom of SSE is calculated as (N 1) (t 1) (b
1) (t 1)(b 1).
5.7.5 Statistical Tests
Statistical tests can be constructed in a similar way as for the one-way
ANOVA, but now three dierent null hypotheses may be of interest: the
no-eect hypothesis of factor T, the no-eect hypothesis of factor B, and the
no-interaction-eect hypothesis of TB.
Before we give some more details on the construction of these tests, we will
argue that rst the no-interaction hypothesis must be tested.
Suppose there is an interaction eect of TB. Then we have already shown
that the dierence in means between two levels of factor T depends on the
level of factor B. Hence, it is meaningless to say something about the main
eect of factor T without specifying a level of the other factor, i.e. the
parameters
i
have no clear interpretation in the presence of interaction (see
also Equation 5.67). The same reasoning applies to the main eect of factor
B. Thus, in conclusion, we will always rst test for interaction, and when
we conclude that the interaction in present, then we will not test the main
eects. When, on the other hand, no interaction turns out to be present,
then we will rst eliminate the interaction eects from the saturated model
(i.e. we change to the additive model) before testing for main eects.
The no-interaction null hypothesis is
()
11
= ()
12
= . . . = ()
tb
= 0. (5.71)
The test (F-test) is based on
F =
SSTB
MSE
H
0
F
(t1)(b1),Ntb(t1)(b1)+1
. (5.72)
The test for testing the main eects of factor T is based on
F =
MST
MSE
H
0
F
t1,Ntb+1
. (5.73)
(note that the degrees of freedom of MSE are already calculated in the addi-
tive model, i.e. after the interaction eects are eliminated) In a similar way,
Table 5.2: The ANOVA Table of the interaction model
T SST t 1 MST
MST
MSE
p
B SSB b 1 MSB
SSB
MSE
p
BT SSBT (t 1)(b 1) MSTB
SSTB
MSE
p
Error SSE N 1 (t 1) MSE
(b 1) (t 1)(b 1)
the test for testing the main eects of factor B is based on
F =
SSB
MSE
H
0
F
b1,Ntb+1
. (5.74)
5.7.6 The ANOVA Table
The ANOVA table of the saturated model is of the form given in Table 5.2.
For the additive model, only the BT line must be deleted and the degrees of
freedom of SSE must be increased with (t 1)(b 1).
5.7.7 Example
Example 5.7 continued. The main purpose of the analysis is to test whether
or not the mean QI depend on the gender and/or on the period. We start
the analysis with the interaction model (Equation 5.63) (although actually
all assumptions must be assessed, we do not show these analysis here).
The S-Plus output is shown below.
SEX 1 70.713 70.71346 8.543015 0.0038503
PERIOD 2 28.876 14.43795 1.744273 0.1773064
SEX:PERIOD 2 25.786 12.89308 1.557635 0.2130687
Residuals 209 1729.965 8.27734
$"(Intercept)":
(Intercept)
22.05973
$SEX:
men women
-0.6195476 0.6195476
$PERIOD:
0.08609352 0.4353718 -0.5214653
$"SEX:PERIOD":
menperiod 1 womenperiod 1 menperiod 2 womenperiod 2 menperiod 3
0.08940449 -0.08940449 0.4062557 -0.4062557 -0.4956602
womenperiod 3
0.4956602
In the ANOVA table we only look at the interaction eect. There we read
p = 0.213. Since p = 0.213 > 0.05 we conclude that, at the 5% level of
signicance, there is no signicant interaction eect between period and sex
on the mean QI.
Since we have concluded that there is no interaction, we can adopt the ad-
ditive model (Equation 5.62). The S-plus output is given below.
SEX 1 70.877 70.87659 8.517701 0.0038983
PERIOD 2 12.363 6.18139 0.742858 0.4769926
Residuals 211 1755.751 8.32109
$"(Intercept)":
(Intercept)
22.07786
$SEX:
men women
-0.6202568 0.6202568
$PERIOD:
0.05222359 0.255565 -0.3077886
Now the interaction eects are eliminated from the model, we can safely
look to the main eects of both period and sex. For period, we read from the
ANOVA table p = 0.477 which is greater than 0.05. Thus, we conclude that,
at the 5% level of signicance, there are no dierences in mean QI between
the three periods. For sex, we have p = 0.004 < 0.05, which results in
the rejection of the corresponding null hypothesis. Thus, we may conclude
that, at the 5% level of signicance, the mean QI of the men is dierent
from the mean QI of the women. Moreover, from the parameter estimates
(
1
= 0.620 and

2
= 0.620) we may even conclude that the mean QI of
men is smaller than the mean of QI of women.
Chapter 6
Regression Analysis
6.1 Introductory Example
Dataset: fatplasma.sdd
Three dierent diets are compared w.r.t. the change (reduction) in fat con-
centration in blood plasma. The diets dier in their fat content. Since one
suspects that the fat reduction does very probably depend on the age of the
subject, the study was designed such that one person of each of 5 age cate-
gories was included in the study. Figure 6.1 shows the interaction plot. This
plot indeed suggests that fat reduction depends on the age. The signicance
of this observation can be assessed by means of the ANOVA methods dis-
cussed in the previous chapter. However, since the dataset does even contain
the exact ages of the subjects (i.e. not only the age category, which is a fac-
tor, but also the exact numeric value of the age), other statistical techniques
can be used: regression analysis.
For illustrating the regression methods, we will only consider the data of the
extreme low fat diet. Figure 6.2 shows a scatter plot of the fat reduction
w.r.t. the age of the subject. The plot suggests that there is more or less
a linear relation between both (continuous) variables. In particular, a re-
gression model will model the mean of the fat reduction distribution as a
function of the age. If indeed this specic linear association between both
variables exists, and if it is possible to estimate this linear association based
on a sample, then this linear relation can be used to e.g. estimate the mean
fat reduction for people of an age that was not included in the sample used
for the estimation (e.g. for a person of 40 years old). Thus, regression anal-
103
AGE.BLOCK
m
e
a
n
o
f F
A
T
R
E
D
U
C
0
.5
1
.0
1
.5
Ages 15-24 Ages 25-34 Ages 35-44 Ages 45-54 Ages 55-64
DIET
Extremely low
Fairly low
Moderately low
Figure 6.1: Interaction plot of the fatplasma.sdd dataset
10 20 30 40 50 60
AGE
0.6
0.8
1.0
1.2
1.4
1.6
F
A
T
R
E
D
U
C
Figure 6.2: A scatter plot of fat reduction w.r.t. the age of the subjects
ysis will allow us to predict outcomes. Such a prediction can be considered
as a point estimate. The regression methods will also allow us to calculate
condence intervals on such predictions.
In the same way as we have reasoned in ANOVA, it is not because we see a
linear relation in Figure 6.2, which is entirely based on a very small sample,
that in reality there is indeed a true linear relation between age and mean fat
reduction. In order to prove that a linear relation is present, the regression
analysis will provide us with statistical tests.
In this example, the age is referred to as the independent variable, or the
predictor, or the regressor. The fat reduction is again referred to as the
dependent variable, or the response variable.
6.2 The Regression Model
6.2.1 Reformulation of the ANOVA Model
Example 6.1. Example 6.1 continued.
If we would consider the age as a factor and the fat reduction as the dependent
variable, then the ANOVA model would be
Y
ij
= +
i
+
ij
=
i
+
ij
, (6.1)
(i = 1, . . . , 5; j = 1, . . . , n
i
), where
ij
i.i.d. N(0,
2
). The model is equiva-
lent to
Y
ij
i.i.d. N(
1
,
2
) if age of subject (i, j) is 15 (6.2)
Y
ij
i.i.d. N(
2
,
2
Y
ij
i.i.d. N(
3
,
2
Y
ij
i.i.d. N(
4
,
2
Y
ij
i.i.d. N(
5
,
2
Thus, the distribution of Y
ij
is always normal with the same variance
2
.
Only the mean depends on the age of the subject. Let X
ij
denote the age of
the corresponding subject. Then we could write
i
= (X
ij
). (6.7)
With this notation, the index i becomes obsolete because its function is taken
over by the specic age which is given by the (independent) variable X. Let
n denote the total number of observations, i.e. N =
p
i=1
n
i
(here, p = 5).
Then, we will adopt the following notation: the index i refers directly to the
subject. Thus, we have i = 1, . . . , N. The corresponding fat reductions are
denoted by Y
i
and the ages by X
i
.
With this new notation, the original ANOVA model can be written as
Y
i
= (X
i
) +
i
, (6.8)
(i = 1, . . . , N) where
i
i.i.d. N(0,
2
) and where X
i
takes one of the values
in {15, 25, 36, 47, 60}. The latter restriction is here still needed to make to
model equivalent to the ANOVA model which is here specically dened for
only 5 dierent age levels. In a linear regression model, however, we will
assume that the mean (X
i
) is a linear function of the independent variable
X
i
. In particular,
(X
i
) = + X
i
, (6.9)
where the parameters and are referred to as the intercept or constant, and
the regression coecient or the slope, respectively. Furthermore, Equation
6.9 is often assumed to hold not only for the ages X
i
which are present in
the dataset, but, more generally, for a large interval of ages (e.g. ages in the
interval [15, 64]).
6.2.2 The Regression Model
In general the regression model is dened as
Y
i
= + X
i
+
i
, (6.10)
(i = 1, . . . , N), where and are parameters, and where
i
i.i.d. N(0,
2
).
Thus, the major dierence between a regression model and an ANOVA model
is that now the mean of the dependent variable is determined by a continu-
ous linear function of the independent variable (X), whereas with ANOVA
the mean of the dependent variable was determined by the level of a factor
variable.
6.3 Parameter Estimation
6.3.1 Least Squares Criterion
The unknown parameters in the regression model 6.10 are estimated by
means of exactly the same method as for the ANOVA model: least squares.
The unknown regression parameters are now and . The parameter esti-
mates are denoted by and

, respectively.
The tted value, or sometimes also called the predicted value, is denoted by
Y
i
=

Y ( ,

, X
i
) (the latter notation is to stress that the prediction depends
on the parameters as well as on the independent variable X
i
). Thus,
Y
i
=

Y ( ,

, X
i
) = +

X
i
. (6.11)
As with ANOVA, the residuals are given by
e
i
= e
i
( ,

) = Y
i

Y
i
( ,

) = Y
i

Y
i
. (6.12)
Based on a sample, the the least squares criterion is given by
L( ,

) =
N
i=1
e
2
i
( ,

). (6.13)
The least squares parameter estimators are those values of and

that
minimize the least squares criterion L( ,

) (i.e. those values for the unknown
parameters that minimize the total squared error).
Explicit formulae for the parameter estimates exist, but we will not give them
here.
The residual variance
2
is estimated as

2
= S
2
=
L( ,

)
N 2
. (6.14)
Note that n 2 is equal to the total number of observations minus the
number of unknown parameters in the regression model ( and ). The
residual variance of the the ANOVA model was estimated in a similar way,
i.e. the minimized least squares criterion divided by the total number of
observation minus the number of unknown parameters (
1
, . . . ,
p
): N p.
As with ANOVA, n2 is referred to as the degrees of freedom of the variance
estimator.
6.3.2 Statistical Properties of the Estimators
We will give here only a very general treatment of the statistical properties
of the estimators (more is not needed for the interpretation of the output of
statistical software).
Let denote any of the parameters or . And let

denote the corresponding
least squares estimator. Then it can be shown that
N
_
, g(X
1
, . . . , X
N
)
2
_
, (6.15)
where g(X
1
, . . . , X
n
) represents a function of the observed independent vari-
ables for which holds that g(X
1
, . . . , X
n
) 0 as the sample size N .
Thus, the parameter estimators are unbiased and consistent. Since Var
_
_
is
also a function of the independent variables, we may conclude that Var
_
_
is
a function of the study design, i.e. the accuracy of the parameter estimators
depends on the choice of the values of the independent variables. We will not
go into detail here, but this property opens a whole lot of techniques that
allows us to choose the values of the independent variables such that Var
_
_
is minimized, i.e. maximal accuracy is obtained.
As with the ANOVA model parameter estimators, the regression model pa-
rameter estimators have the property that their variance is proportional to
the residual error variance
2
. Moreover,
2
is the only unknown parameter
in Var
_
_
. By replacing
2
by its estimator

2
= S
2
(Equation 6.14), the
normal distribution becomes a t-distribution. In particular,
t
N2
, (6.16)
where s
=
_
g(X
1
, . . . , X
N
)
2
. Based on Equation 6.16 condence intervals
(i.e. interval estimates) and statistical tests can be constructed.
6.3.3 Interval Estimators (condence intervals)
From Equation 6.16 lower and upper limits of condence intervals of and
can be calculated. In particular, the (1) condence interval of , which
is any of or , is given by
L =

s
t
N2;/2
(6.17)
U =

+ s
t
N2;/2
. (6.18)
6.3.4 Statistical Tests
Since there is an equivalence between (1) condence intervals and -level
statistical tests, we can immediately give the statistical test for testing
H
0
: = 0. (6.19)
where is any of or . Again based on Equation 6.16, we nd the null
distribution of the test statistic,
T =
H
0
t
N2
. (6.20)
Depending on the specic alternative hypothesis (one-sided or two-sided), the
decision rule is found. For instance, for a two-sided hypothesis (H
1
: = 0),
the decision rule is
|t
o
| = |

| t
N2;/2
accept H
0
|t
o
| = |

| > t
N2;/2
reject H
0
, conclude H
1
.
In practice, often only the hypothesis H
0
: = 0 is of interest.
6.3.5 Example
Example 6.1 continued.
Although we have not yet seen all about regression analysis, we give already
a small example. Below the S-Plus output is given for the Example of Section
6.1.
*** Linear Model ***
Call: lm(formula = FATREDUC ~ AGE, data = fatplasma.Extremely.low,
na.action = na.exclude)
Residuals:
1 10 5 14 9
0.06946 -0.008626 -0.1575 0.0736 0.02309
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.3484 0.1226 2.8409 0.0656
AGE 0.0208 0.0031 6.7672 0.0066
Residual standard error: 0.109 on 3 degrees of freedom
Multiple R-Squared: 0.9385
F-statistic: 45.8 on 1 and 3 degrees of freedom, the p-value is 0.006593
In this output we nd:
The residuals e
i
for all 5 observations (note that the numbers above
the residuals are the observation numbers (identication numbers) in
the original dataset (i.e. the fatplasma.sdd dataset before splitting ac-
cording to AGE.CLASS).
As with ANOVA, the assumption of normality ( i.i.d. N(0,
2
)) can be
assessed by looking at the QQ-plot of these residuals (we have not done
this here because it is quite meaningless with as few as 5 observations).
The parameters estimates are
= 0.3484 (6.21)
= 0.0208 (6.22)
and their respective estimated standard deviations are
s
= 0.1226 (6.23)
s
= 0.0031. (6.24)
From the estimated parameters, the estimated regression model can be
written down:
Y = +

X (6.25)
= 0.3484 + 0.0208X. (6.26)
The interpretation of the model is as follows: for each increase in age
of 1 year, the mean fat reduction increases with 0.0208 units.
The hypothesis H
0
: = 0 against H
1
: = 0 (two-sided) is performed
by calculating t
o
=

= 6.7672 (this is directly read from the output).

We now that the null distribution of the test statistic is t
N2;/2
. Thus,
at = 0.05, we have as a critical value t
N2;/2
= 3.1824. Since
|t
o
| > 3.1824 we reject the null hypothesis at the 5% level of signicance,
in favour of the alternative hypothesis. Thus, we conclude that there
is a signicant (positive) linear relation between the age and the mean
fat reduction. This could also have been concluded from the p-value
that is given by S-Plus: p = 0.0066 < 0.05 (since p is even very small,
the conclusion of a positive linear relation can be even stated very
strongly).
Further, S-Plus also given the Residual standard error: S = =
0.109 (with n 2 = 3 degrees of freedom).
The multiple R-squared and the F-statistic is discussed later.
Finally, S-Plus does not calculate condence intervals of the parame-
ters. But these can, of course, be calculated very easily by applying
the methods of Section 6.3.3. The lower and upper limits of the 95%
condence interval of are given by
L =

s
t
3;0.025
(6.27)
U =

+ s
t
3;0.025
, (6.28)
where

= 0.0208, s
= 0.0031 and t
3;0.025
= 3.1824. Thus, the 95%
condence interval becomes
[0.0109; 0.0307]. (6.29)
(note that 0 is not within the condence interval, which is equivalent
to concluding that = 0 at the 5% level of signicance.
6.4 Predictions
In Section 6.1 we have build the regression model from an ANOVA model
by modelling the mean of the response variable as (X
i
) = + X
i
, where
X
i
takes values in a set of p values (e.g. {15, 25, 36, 47, 60} in the example).
However, the model (X
i
) = + X
i
also allows us to predict the mean
response in other values of the predictor X (e.g. predicting the mean fat
reduction for a subject of 50 years of age). For an arbitrary X, the corre-
sponding prediction of the mean response is simply given by
Y =

Y (X) = (X) = +

X, (6.30)
where, as before, and

are the estimators of and based on the sample,
which does not necessarily includes the value of X.
The prediction

Y (X) is to interpreted as the estimation of the mean of the
response variable Y at the value X of the predictor, i.e.

Y (X) is an estimator
of (X) = + X. Therefore, we will often use (X) instead, to stress the
fact that it is an estimator for the mean.
Since (X) is an estimator (in particular it is a point estimator), a condence
interval (interval estimator) can be calculated as well. The lower and upper
limits of the (1 ) condence interval are given by
L = (X) s
(X)
t
N2;/2
(6.31)
U = (X) + s
(X)
t
N2;/2
, (6.32)
where the variance estimator s
2
(X)
is calculated as (without proof)
s
2
(X)
= s
2
+ X
2
s
2
2X
X
(n 1)s
2
X
, (6.33)
where

X is the sample mean of the predictors (

X =
1
n
n
i=1
X
i
) and s
2
X
=
1
n1
n
i=1
(X
i

X)
2
is the sample variance of the predictors. Equation 6.33
can also be written as (without proof)
s
2
(X)
=
_
1
n
+
(X

X)
2
n
i=1
(X
i

X)
2
_

2
. (6.34)
(note that also s
2
(X)
is proportional to
2
)
Sometimes, however, it is of more interest to predict one single observation
of Y at a given value X, rather than predicting the mean (X) at X. It is
obvious that the point estimator is again given by Equation 6.30 (this is our
best possible guess of an observation of Y at X). The interval estimator, on
the other hand, must take additional uncertainty into account. This can best
be understood by considering the following. When a mean is estimated, then
the uncertainty (i.e. variance) decreases as the sample size increases. More-
over, in the limit when the sample size becomes innitely large, the variance
of the estimator of the mean becomes zero (consistent estimator). When,
on the other hand, a single observation is to be predicted, we must realize
that the statistical model states that, for a given value of X, Var {Y } =
2
,
i.e. the variance of one single observation does not depend on the sample
size and it is always equal to the residual variance
2
. It is thus exactly this
variance that has to be added to the variance of the estimator of the mean
response (Equation 6.34). In order to make the distinction clear between
the estimator of the mean and the estimator of a single response Y , the
variance of the latter is denoted by s
2
Y (X)
. Thus,
s
2
Y (X)
=
_
1 +
1
n
+
(X

X)
2
n
i=1
(X
i

X)
2
_

2
. (6.35)
The interval estimator (condence intervals) of the prediction of a single
observation are given by
L =

Y (X) s
Y (X)
t
N2;/2
(6.36)
U =

Y (X) + s
Y (X)
t
N2;/2
. (6.37)
6.5 The ANOVA Table and F-test
6.5.1 The Decomposition of the Total Sum of Squares
?? As with an ANOVA model, there exists a decomposition of SSTot, which
is calculated in exactly the same way as with ANOVA, i.e.
SSTot =
N
i=1
_
Y
i

Y
_
2
. (6.38)
SSTot measures the total variability of the data. SSTot is associated with
N 1 degrees of freedom. The regression sum of squares, SSR, is calculated
as
SSR =
N
i=1
_
Y
i

Y
_
2
. (6.39)
It measures the variability that is attributed to the regression relation be-
tween the mean of Y and the regressor X. SSR has 1 degree of freedom.
Finally, the residual sum of squares, SSE, is given by
SSE =
N
i=1
_
Y
i

Y
i
_
2
, (6.40)
and, as before, SSE measures the variability in the data that is not explained
by the regression relation. SSE has N 2 degrees of freedom.
The sum of squares, SSR and SSE, divided by its degrees of freedom, 1 and N-
2, results in the mean sum of squares, MSR = SSR/1 and MSE = SSE/N2,
respectively.
it can be shown that again MSE is an unbiased estimator of the residual
variance
2
.
The following decomposition holds,
SSTot = SSR + SSE, (6.41)
and a similar decomposition holds for the degrees of freedom, i.e.
N 1 = 1 + (N 2). (6.42)
The decomposition of SSTot means: the total variability of the data (SSTot)
is decomposed into a part (SSR) that is explained by the regression relation,
and a residual unexplained part (SSE). The larger SSR is as compared to
SSE, the more evidence there is that there is a true linear regression relation
(X) = +X ( = 0). This argument will again result in a statistical test
for testing H
0
: = 0.
6.5.2 F-test
As explained in the previous section, SSR and SSE can be used to construct
a test for H
0
: = 0. In particular, we will use MSR and MSE instead. It
can be shown that
F =
MSR
MSE
H
0
F
1,N2
. (6.43)
From the above null distribution an -level test can be constructed.
Note that both when < 0 or > 0, SSR will probably increase. Thus, H
0
will only be rejected for large values of F. Hence, the F-test can only be used
to test against the two-sided alternative hypothesis H
1
: = 0, whereas the
t-test of Section 6.3.4 also can be used for one-sided alternative hypotheses.
Finally, we remark that there is a one-to-one relation between the F-test
statistic and the t-test statistic for testing H
0
: = 0. In particular, it can
be shown that
T
2
=
_

_
2
=
MSR
MSE
= F. (6.44)
6.5.3 ANOVA-Table
Often a regression analysis is accompanied by an ANOVA table. Table 6.1
shows its general form.
Table 6.1: An ANOVA Table for a regression analysis
Regression SSR 1 MSR
MSR
MSE
p
Error SSE N 2 MSE
6.5.4 Example
Example 6.1 continued.
Below the ANOVA table in the S-plus output is given.
Analysis of Variance Table
Response: FATREDUC
Terms added sequentially (first to last)
AGE 1 0.5443411 0.5443411 45.79564 0.006593352
Residuals 3 0.0356589 0.0118863
Note that S-Plus does not give the line of SSTot in the ANOVA table, but,
of course, this can be simply calculated.
SSTot = SSR + SSE = 0.5443411 + 0.0356589 = 0.58. (6.45)
At the line of AGE we read the F-test for testing H
0
: = 0 against H
0
:
= 0. Since
F = 45.79564 > F
1,N2;
= F
1,3;0.05
= 10.13 (6.46)
we conclude that, at the 5% level of signicance, the regression coecient
is dierent from zero. Thus, there is a signicant linear relation between the
mean fat reduction and the age of the subjects. From Section 6.3.5 we known
that

= 0.0208, and thus we may even conclude that there is a signicant
positive regression relation. Of course, the same conclusion is obtained by
only looking at the corresponding p-value (p = 0.006593352 < 0.05). Since
the F-test and the t-test are equivalent, the p-values are exactly equal.
6.6 Coecient of Determination: R
2
In Section ?? we have seen that SSR measures the variability that is explained
by the regression, that SSE measures the residual (unexplained) variability,
and that SSTot is a measure for the total variability of the observations in
the sample. Since SSTot = SSR + SSE, it is also meaningful to consider
the fraction of SSR over SSTot. This proportion is called the coecient of
determination, and it is calculated as
R
2
=
SSR
SSTot
, (6.47)
i.e. R
2
is the proportion of explained variance over the total variance. Since
0 R
2
1, R
2
is easy to interpret: the larger R
2
, the better the regression
relation.
Example 6.2. In the S-Plus output of the example of Section ??, we nd
that
R
2
= 0.9385. (6.48)
Thus, about 94% of the total variability of the fat reduction in the sample is
explained by the regression relation between the mean fat reduction and the
age.
6.7 Assessing the Assumptions and Diagnos-
tics
We will illustrate the assessment of the assumptions by means of an example.
We will, however, not use the example of Section 6.1 here because when there
are only N = 5 observations it is almost impossible to assess the assumptions.
Example 6.3. In the nutrition data set, an interesting question concerns
the relation between the QI and the ENEX. In particular, we want to study
1000 1500 2000 2500 3000 3500 4000
ENEX
16
21
26
31
Q
I
Figure 6.3: Scatter plot of QI against ENEX. The line represents the tted
regression line.
how the QI depends on ENEX. The regression model is
Y
i
= + + X
i
+
i
, (6.49)
where Y
i
and X
i
are the QI and ENEX of subject i, respectively, and where
and are the unknown parameters which will be estimated from the sam-
ple. Further, it is assumed that
i
i.i.d. N(0,
2
). The interesting research
questions are:
is there a relation between QI and ENEX? (this will be checked by
means of a t- or F-test for H
0
: = 0.
is = 0, what is the exact relation between QI and ENEX? This
question will be solved be estimating and from the sample. By
presenting the condence intervals, we will have an idea on the accuracy
of the estimation. The established (estimated) regression model can be
used for prediction.
Figure 6.3 shows the raw data in a scatter plot as well as the tted regression
line.
In the beginning of this chapter we have explained how the regression models
is constructed and how the regression model must be seen as a summary of
the assumptions. In particular, the regression model (Equation 6.49) states
the following assumption: all observations Y
i
are normally distributed with
mean +X
i
and constant variance
2
. This assumption can be decomposed
in:
The normality of the observations is transferred to the error terms
i
, which have to be normally distributed with mean 0 and constant
variance
2
.
This assumption can be assessed by checking the normality of the resid-
uals e
i
= Y
i

Y
i
by means of e.g. a QQ-plot (see Figure 6.4). This
plot shows a slight asymmetric deviation from normality, but for the
moment we will not consider it as a problem. Only if the p-value of the
tests are close to = 0.05 we will worry.
The mean of the response Y at a given value of X is assumed to be
equal to +X, i.e. it is assumed that there is a linear relation between
the regressor X and the mean of the response. Moreover, it is assumed
that the error term,
i
, has mean zero for all X. If this is indeed true,
then there should be no dependence between the mean of the residuals,
e
i
, and the X
i
. This can be visualized by plotting the residuals against
the regressor X
i
.
Figure 6.5 shows this residual plot. The line through the points is a
so called smoother. It can be interpreted as a nonlinear line that ts
the data best. If indeed the linear model for the mean holds, then the
mean of the residuals will be indeed equal to zero for all X. In the plot,
this means that the smooth line should be more or less constant and
equal to zero. In Figure 6.5 we see that this is more or less the case.
At least there is no evidence for a systematic deviation.
In the previous paragraphs, we have already checked the normality of
i
and the constant zero mean of
i
. Further, it is assumed that the
error terms,
i
, have constant variance
2
for all X. Again this can be
assessed by looking at the residuals in the residual plot of Figure 6.5.
If indeed the error terms have constant variance, then we would expect
the residuals e
i
to be spread around zero with more or less the same
variability for every X. In Figure 6.5 the variability of the residuals
seems indeed to be fairly constant. Maybe for regressor values 2000 <
X < 3000 there might be a slight increase in variance observed, but, on
the other hand, there are more observations in this region, such that
there is more chance of observing extreme large or small observation
which visually result in the impression of an increased variance.
Another way to assess the assumption of constancy of variance, is to
plot the absolute value of the residuals against the predictor X. This
is shown in Figure 6.6. If the assumption holds, then we would expect
that the mean of the absolute values of the residuals is constant w.r.t.
Quantiles of Standard Normal
R
e
s
id
u
a
ls
-3 -2 -1 0 1 2 3
-
5
0
5
193
65
87
Figure 6.4: QQ-Plot of the residuals
1000 1500 2000 2500 3000 3500 4000
ENEX
-5
0
5
10
r
e
s
id
u
a
ls
Figure 6.5: Residual plot
the regressor X. Figure 6.6 shows that, in the example, there is almost
no substantial evidence that the variance changes with X.
Finally we show residual plots of an articial example where the the linear
model does not hold and where the assumption of constancy of variance does
not hold either. Figure 6.7 shows the scatter plot of the raw data, the residual
plot and the plot of the absolute values of the residuals against the predictor.
1000 1500 2000 2500 3000 3500 4000
ENEX
0
2
4
6
8
a
b
s
Figure 6.6: Plot of the absolute values of the residuals against the regressor
ENEX.
0 20 40 60 80 100
X
-2
2
6
10
Y
0 20 40 60 80 100
X
-5
-3
-1
1
3
5
re
s
id
u
a
ls
0 20 40 60 80 100
X
0
1
2
3
4
5
a
b
s
Figure 6.7: The scatter plot (above) of an articial example, its residual plot
(down, left) and the plot of the absolute values of the residuals (down, right).
6.8 Multiple Linear Regression
In the previous sections of this chapter we have modelled the mean of the
response variable as a linear function of one single continuous regressor. Geo-
metrically, this resulted in a straight line. Such a regression model is referred
to as a simple linear regression model. In this section we will extend the
linear regression model to a model in which we will model the mean of the
dependent variable as a function of more than one regressor. This extended
model will be referred to as a multiple linear regression model.
6.8.1 Introductory Example
Dataset: CheeseTaste.sdd
As cheese ages, various chemical processes take place that determine the
taste of the nal product. This dataset contains concentrations of various
chemicals in 30 samples of mature cheddar cheese, and a subjective measure
of taste given by a panel of professional tasters. The chemicals are: acetic
acid (Acetic), lactic acid (Lactic) and hydrogen sulphide (H2S).
In this example we will initially only be interested in looking at the depen-
dence of the mean of the taste score as a function of the lactic acid and
hydrogen sulphide concentration. Figure 6.8 shows a scatter plot matrix.
This plot suggests that there might be linear relation between taste and
both lactic acid and hydrogen sulphide concentration. Instead of considering
two regression models (one with lactic and another with H2S as a regres-
sor) we will now consider one single regression model with the two regressors
simultaneously in it.
6.8.2 Statistical Model (additive model)
Suppose we want to model the mean of the response variable Y as a linear
function of two regressors X
1
and X
2
. The n observations in the sample can
than be denoted by (Y
1
, X
11
, X
21
), . . . , (Y
n
, X
1n
, X
2n
).
The regression model now becomes
Y
i
= +
1
X
1i
+
2
X
2i
+
i
, (6.50)
where
i
i.i.d. N(0,
2
) and where ,
1
and
2
are the unknown parameters.
Geometrically, the model in Equation 6.56 represents a plane (Figure 6.9).
Lactic
3
5
7
9
0.80 1.05 1.30 1.55 1.80 2.05
3 5 7 9
H2S
0.80
1.05
1.30
1.55
1.80
2.05
taste
0
10
20
30
40
50
60
0 10 20 30 40 50 60
Figure 6.8: The scatter plot matrix of the variables taste, lactic and H2S
Basically, the regression model in Equation 6.56 implies the following as-
sumptions:
the error terms are normally distributed with mean zero and common
variance
2
the mean of the response variable is a linear, additive function of the
regressors X
1
and X
2
(note that additive again means that the (linear) eect of the regressor X
1
is
additive to the (linear) eect of the regressor X
2
)
The regression model can be further extended to include (p1) 1 regressors
X
1
, . . . , X
p1
. We will not treat this here in detail, though all properties that
are discussed in this section can be readily applied to it.
As with the simple linear regression model, the unknown parameters in the
multiple linear regression model can be estimated by means of the least
squares methods. As before, the estimates are denoted by ,

beta
1
and

2
.
Furthermore, they are again normally distributed with mean ,
1
and
2
,
respectively, and a variance that is proportional to the residual variance
2
.
The latter variance is again estimated as the residual mean squared error (for
general p),

2
= MSE =
1
n p
n
i=1
_
Y
i

Y
i
_
2
, (6.51)
where

Y
i
= +
1
X
1i
+. . .+
p1
X
p1,i
are the tted values. Since the param-
eter estimators are again normally distributed and since their variances are
Figure 6.9: The geometrical representation of the additive regression model
of Equation 6.56
proportional to the residual variance, the standardized parameter estimators,
with the variance replaced by its estimator, is again t-distributed with n p
degrees of freedom. Hence, statistical tests and condence intervals may be
computed in exactly the same way as before.
6.8.3 The F-Test and R
2
As with the simple linear regression model, the total sum of squares (SSTot)
can be decomposed into a term measuring the variability that can be ex-
plained by the regression (SSR) and a term measuring the residual, unex-
plained variability (SSE): SSTot = SSR + SSE (the corresponding degrees
of freedom decompose similarly: n 1 = (p 1) + (n p)). Sometimes the
regression sum of squares, SSR, is further decomposed in terms measuring
the variability that can be explained by each regression separately, but we
will not discuss this here any further.
Based on the decomposition of the total sum of squares a hypothesis test can
be constructed for testing
H
0
:
1
= . . . =
p1
= 0 (6.52)
against
H
1
: at least one
j
is dierent from zero. (6.53)
In particular it is an F-test. The test statistic and its null distribution are
F =
MSR
MSE
=
SSR/(p 1)
SSE/(n p)
H
0
F
p1,np
. (6.54)
Thus, if the the null hypothesis is excepted, than there is no linear regres-
sion relation at all, and when the null hypothesis is rejected, then the mean
response is linearly related to at least one predictor.
The R
2
-value (coecient of determination) can again be calculated in exactly
the same way as before,
R
2
=
SSR
SSTot
. (6.55)
Its interpretation is also as before: the relative fraction of the total variability
(SSTot) that is explained by the regression (SSR).
6.8.4 Example
Example 6.8.1 continued.
We consider the regression model in Equation 6.56,
Y
i
= +
1
X
1i
+
2
X
2i
+
i
, (6.56)
where
i
i.i.d. N(0,
2
), and where X
1
and X
2
represent lactic and H2S, re-
spectively, and Y is the dependent variable taste.
The S-Plus output of the regression analysis is given below.
Call: lm(formula = taste ~ Lactic + H2S, data = cheese, na.action =
na.exclude)
Residuals:
Min 1Q Median 3Q Max
-17.34 -6.53 -1.164 4.844 25.62
Coefficients:
(Intercept) -27.5918 8.9818 -3.0720 0.0048
Lactic 19.8872 7.9590 2.4987 0.0188
H2S 3.9463 1.1357 3.4748 0.0017
F-statistic: 25.26 on 2 and 27 degrees of freedom, the p-value is 6.551e-007
Quantiles of Standard Normal
R
e
s
id
u
a
ls
-2 -1 0 1 2
-
1
0
0
1
0
2
0
12
8
15
Figure 6.10: QQ-plot of the residuals
It is again important to realize that we can only formally interpret the sta-
tistical tests in the output if the assumptions are fullled:
normality of the error terms.
this can be assessed by looking at the QQ-plot of the residuals (see
Figure 6.10). Figure 6.10 does not suggest any serious deviation from
the assumption of normality.
constancy of the variance, i.e. the variance of Y (or, equivalently, of
should be equal for all X
1
and all X
2
.
This assumption may be assessed by looking at the residual plots which
are constructed as the residuals against X
1
(Figure 6.11, left) and
against X
2
(Figure 6.11, right). In non of the two residual plots there
is found a clear evidence against the assumption of constancy of the
variance. (note: (1) never mind the eect of one single outlier when
checking for constancy of variance; (2) maybe Figure 6.11 suggests a
slight increase in variance with increasing lactic, but we will not con-
sider this increase substantial)
linearity of the model, i.e. the statistical model implies that in reality
there is indeed a linear relation between the mean response and the
two regressors X
1
and X
2
.
The residual plots in Figure 6.11 may again be used. In particular
we expect now that the residuals are nicely spread around zero for
all predictor values. Or, equivalently, we expect that the mean of the
residuals does not depend anymore on the predictors. The smooth line
in the residual plots may help in assessing this assumption. In both
residual plots we do not see any substantial deviation from what we
0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1
Lactic
-20
-10
0
10
20
r
e
s
id
u
a
ls
3 4 5 6 7 8 9 10
H2S
-20
-10
0
10
20
r
e
s
id
u
a
ls
Figure 6.11: Residual plots against lactic (left) and H2S (right)
expect under the assumption of linearity. Therefore, we conclude that
the assumption of linearity holds in this example, i.e. the statistical
model that we work with is appropriate.
Thus, so far we have only assessed the assumptions underlying the statistical
model, and, we have concluded that all assumptions hold. Thus, we may
proceed with interpreting the S-Plus output of the regression analysis. In
particular, we conclude from the S-Plus output:
The estimated regression model, which may be used for e.g. prediction,
is
Y (X
1
, X
2
) = +

1
X
1
+

2
X
2
(6.57)
= 27.5918 + 19.8872X
1
+ 3.9463X
2
. (6.58)
(note: as with the simple linear regression model, prediction intervals
may be calculated, but we will not discuss this here)
Often it is of interest to test whether the observed regression relation
is indeed present in reality or not, i.e. we like to test
H
0
:
1
=
2
= 0. (6.59)
This can be tested by mean of the F-test. In the output we read
F = 25.26 on 2 and 27 degrees of freedom, resulting in a p-value equal
to 6.551e007 0 < 0.05. Hence we conclude that there is a regression
relationship at the the 5% level of signicance, with at least one of the
two regressors.
The see with which of the two regressors the mean taste score has a
linear relation, we can test the eect of both regressors separately.
For lactic we test
H
0
:
1
= 0 against H
1
:
1
= 0. (6.60)
In the output we nd p = 0.0188 < 0.05. Thus we conclude, at the 5%
level of signicance, that there is a linear relation between the mean
taste score and the lactic acid concentration of the cheese.
For H2S we test
H
0
:
2
= 0 against H
1
:
2
= 0. (6.61)
In the output we nd p = 0.0017 < 0.05. Thus we conclude, at the
5% level of signicance, that there is also a linear relation between
the mean taste score and the hydrogen sulphide concentration of the
cheese.
From R
2
= 0.6517 we learn that about 65% of the total variability in
the taste score is explained by the regression relation. Although this
is not a large value, there was sucient evidence in the data the the
regression relation is signicant. Thus, we have proven that the mean
taste score does linearly depend on lactic and H2S, but since R
2
is not
very large, there is probably still a lot of variability of the observed
taste scores unexplained.
Finally, we return to the estimated regression model,
Y (X
1
, X
2
) = 27.5918 + 19.8872X
1
+ 3.9463X
2
, (6.62)
for some more details on the interpretation of the model. When more than
one predictor is included in the model, the interpretation of the -parameters
needs some careful consideration. In particular, in this example, we have
for lactic (X
1
):
two types of cheese with the same H2S, but with a lactic of one unit
dierence, have an estimated dierence in mean taste score of 19.8872
units.
for H2S (X
2
):
two types of cheese with the same lactic, but with a H2S of one unit
dierence, have an estimated dierence in mean taste score of 3.9463
units.
We say that the parameters have a conditional interpretation. Or,
1
is the
eect of X
1
, controlled for X
2
(actually, controlled for the linear eect of X
2
.
And, visa versa,
2
is the eect of X
2
, controlled for X
1
. In a simple linear
regression model, where e.g. the mean of Y is modelled as a linear function
of one predictor, say X
1
, the interpretation of
1
is in some way averaged
over all other possible predictors. Thus, in such a simple linear regression
model,
1
would have the interpretation of the dierence in mean taste score
between two types of cheese with a dierence in lactic of one unit, but without
any knowledge about the H2S of the two types of cheese.
Consider the (tted) multiple linear regression model of Equation 6.62:
Y (X
1
, X
2
) = 27.5918 + 19.8872X
1
+ 3.9463X
2
. (6.63)
According to the discussion on the interpretation of this model, it is clear
that the eect of X
1
does not depend on a particular value of X
2
, i.e.
1
(estimated as

1
= 19.8872) is the change in the mean of Y for an increase
in X
1
of one unit, for a given value of X
2
, but the particular value of X
2
does not matter! For example, the estimated increase in the mean of Y for a
cheese with X
1
= 3 as compared to a cheese with X
1
= 4, and both cheeses
have X
2
= 2, is
Y (4, 2)

Y (3, 2) = (27.5918 + 19.8872 4 + 3.9463 2)
(27.5918 + 19.8872 3 + 3.9463 2)
= 19.8872(=

1
).
And the estimated increase in the mean of Y for a cheese with X
1
= 3 as
compared to a cheese with X
1
= 4, but now both cheeses have X
2
= 7, is
Y (4, 2)

Y (3, 2) = (27.5918 + 19.8872 4 + 3.9463 7)
(27.5918 + 19.8872 3 + 3.9463 7)
= 19.8872(=

1
).
Thus, the eect of X
1
does not depend on the value of X
2
(and visa versa).
This property refers to an additive model, i.e. there is no interaction eect
between X
1
and X
2
on the mean of Y .
Figure 6.12: The geometrical representation of the interaction regression
model of Equation 6.64
6.8.5 Interaction
In the previous section we have worked with an additive multiple linear re-
gression model, i.e. there was no interaction eect between X
1
and X
2
on
the mean of Y . The model in Equation 6.56 can be extended to include an
interaction term:
Y
i
= +
1
X
1i
+
2
X
2i
+
3
X
1
X
2
+
i
, (6.64)
where
i
i.i.d. N(0,
2
) and where ,
1
,
2
and
3
are the unknown param-
eters. The parameters
1
and
2
refer to the main linear eects of the
regressors X
1
and X
2
. The parameter
3
refers to the interaction eect.
Geometrically, the model in Equation 6.56 represents again a plane (Figure
6.12). This gure (of an articial example) illustrates e.g. that for small
values of X
2
there is a positive linear eect of X
1
(positive slope), but for
large values of X
2
there is a negative linear eect of X
1
(negative slope).
Thus, in such a model, we cannot conclude anything about the eect of X
1
without specifying the value X
2
.
To make the interpretation of the interaction model more clear, we rewrite
model 6.64 as
Y
i
= + (
1
+
3
X
2
) X
1i
+
2
X
2i
+
i
. (6.65)
This representation shows that the regression coecient of X
1
is equal to
1
+
3
X
2
, i.e. the regression coecient of X
1
is a function of X
2
. Thus,
when e.g.
3
< 0, then the eect of X
1
decreases with increasing value of
X
2
. Model 6.64 could also have been rewritten as
Y
i
= +
1
X
1
+ (
2
+
3
X
1
) X
1i
+
i
, (6.66)
which illustrates that the same reasoning applies to the eect of X
2
.
Since basically we could dene a third regressor X
3
= X
1
X
2
, the interaction
model is just a special case of a multiple linear model with 3 regressors
(p = 4). This implies that all parameters can again be estimated by means
of the least squares method. Again all parameter estimators are normally
distributed, etc. Thus, we can simply test the hypothesis that there is no
interaction, by tting model 6.64 and apply a t-test to test
H
0
:
3
= 0 against H
1
:
3
= 0. (6.67)
6.8.6 Model Selection
From the previous section we have learnt that in an interaction model the
parameters
1
and
2
(cfr. main eects) are meaningless when we want to
conclude anything about the eect of one of the regressors without specifying
the other regressors. This way of reasoning seems very similar to what we
have seen in Chapter 5 with ANOVA models. Therefore, it seems reasonable
to adopt a similar way of data modelling:
start with a model including the interaction term
3
X
1
X
2
test for interaction, i.e. H
0
:
3
= 0 against H
1
:
3
= 0
when H
0
is accepted, then eliminate the interaction term from the
model and t the additive model in order to conclude anything
about the main eects of the regressors
when H
0
is rejected, we cannot proceed as with an ANOVA model
with interaction, i.e. it is now not possible to split the dataset
according to one regressor (reason: a regressor is a continuous
variable; splitting according to such a variable might results in
sub-datasets containing only one observation each!). Thus, in this
case we will have to try to interpret the interaction model.
Chapter 7
General Linear Model
In the two previous chapters we have seen two types of statistical models:
ANOVA models and regression models. Both models have in common that
they model the mean of a continuous response variable. The former uses one
or more factors, whereas the latter uses one or more continuous regressors.
In this chapter, for a very simple special case, we will see that both types of
models are equivalent. In particular, an ANOVA model can be formulated as
a regression model. Once this equivalence is known, both types of models can
be combined to include simultaneously a factor and a continuous regressor.
Such a model is called a General Linear model (GLM).
In this chapter we will restrict the discussion to one factor with only 2 levels,
in combination with 1 continuous regressor.
7.1 Regression Formulation of an ANOVA Model
Consider an ANOVA model with one factor with two levels. The ANOVA
model is given by
Y
ij
= +
i
+
ij
, (7.1)
where
ij
i.i.d. N(0,
2
) and
1
+
2
= 0 (i = 1, 2; j = 1, . . . , n
i
).
Next we will construct a regression model of which we will show that it is
completely equivalent to the above ANOVA model.
Let now the index k denote the number of the observations. In particu-
lar, Y
1
= Y
1,1
, Y
2
= Y
1,2
, . . . , Y
n
1
= Y
1,n
1
, Y
n
1
+1
= Y
2,1
, . . . , Y
n
1
+n
2
= Y
2,n
2
.
131
Further, let n = n
1
+ n
2
. Next, we dene a regressor X
k
as follows,
X
k
=
_
1 if observation k is in group 1
0 if observation k is in group 2
. (7.2)
A regressor that is dened as above, is referred to as a dummy variable or an
indicator variable. Consider the regression model
Y
k
= + X
k
+
k
, (7.3)
where
k
i.i.d. N(0,
2
) (k = 1, . . . , n). This model looks exactly like a
simple linear regression model.
In order to show the equivalence of both models, we will write model 7.3 for
the observations in the rst group (X = 1),
Y
k
= + +
k
, (7.4)
k = 1, . . . , n
1
. For the observations in the second group (X = 0), model 7.3
becomes
Y
k
= +
k
, (7.5)
k = n
1
+ 1, . . . , n
2
. Thus, the mean dierence in the response variable
between the two groups is simply . This mean dierence is according to the
ANOVA model (Equation 7.1) equal to
1
2
= 2
1
. Thus,
= 2
1
, (7.6)
and also, the ANOVA null hypothesis H
0
:
1
=
2
= 0 is exactly equivalent
to H
0
: = 0. Moreover (without proof), the statistical test for testing
H
0
: = 0 in the regression model 7.3 is exactly the same as the test for
testing H
0
:
1
=
2
= 0 in the original ANOVA model 7.1.
7.2 General Linear Model: Example
We will introduce and discuss the GLM by means of an example.
Dataset: RBC.sdd
Two groups of people are compared w.r.t. their red blood cell count (rbc).
One group consists of people that have a very busy life and sleep on average
less than 7 hours per day. The other group contains people that sleep on
0 1
group
160
210
260
r
b
c
Figure 7.1: Boxplot of RBC in both groups
average more than 7 hours per day. In the dataset, the groups are coded as
0 and 1, respectively. From both groups a random sample of 10 persons are
sampled. From each of these persons the RBC is measured and their age
(age) is written down as well.
Figure 7.1 shows the boxplots of the RBC in both groups. From this plot
one may immediately see that there seems to be a clear dierence in RBC
between both groups. This is indeed formally conrmed by means of a t-test,
of which the S-Plus output is given here:
Standard Two-Sample t-Test
data: x: rbc with group = 0 , and y: rbc with group = 1
t = -10.1674, df = 18, p-value = 0
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-98.29703 -64.63062
sample estimates:
mean of x mean of y
169.868 251.3319
From the t-test we would conclude that there is a very signicant (p =
0) dierence in mean RBC between both groups. The mean dierence is
estimated as 251.3319 169.868 = 81.4639 in the sense that people with
more than 7 hours of sleep have on average 81.4639 RBC units more than
people with less than 7 hours of sleep. A 95% condence interval of the mean
0 1
group
20
30
40
50
60
a
g
e
Figure 7.2: Boxplot of age in both groups
dierence is [64.63062, 98.29703] (0 is clearly far away from the boundary of
this interval).
Since we also have information on the age of the subjects, it may also be
interesting to look to the distribution of the age over the two groups. This
is shown in the boxplots in Figure 7.2. This plot suggests that there is no
balanced distribution on the age over the two groups, although the sampling
was completely at random! A possible explanation is that the population
of long sleeping people is to a large extend characterized by order persons,
whereas the population of short sleeping people is characterized by younger
persons. Furthermore, if there would exist a relation between the mean RBC
and the age, than the observed dierence in mean RBC may be partly due to
the dierence in age between both groups instead of the dierence in hours
sleep! This should be further investigated. We will use a GLM that contains
both the factor for group (as a dummy variable) and a regressor for age.
Before we continue with the GLM, we will present the data in a plot in which
we can see both the eect of group and the eect of age (Figure 7.3). In this
gure we see a very clear linear relation between mean RBC and age. As
in the boxplot (Figure 7.1) we see again the large dierence in RBC sample
mean.
First we dene the dummy variable as (i = 1, . . . , 20)
X
1i
=
_
0 if observation i is in group 0 (< 7)
1 if observation i is in group 1 (> 7)
. (7.7)
Thus, X
1i
is the rst regressor, dened as a dummy variable to act as the
two-level factor. The age is the second regressor X
2k
. The GLM model is
Y
i
= +
1
X
1i
+
2
X
2i
+
i
, (7.8)
20 30 40 50 60
age
160
210
260
r
b
c
Figure 7.3: A scatter plot of RBC against age. The circles and the triangles
represent the short and the long sleepers, respectively. The two horizontal
lines represent the sample means of RBC of both groups.
where
i
i.i.d. N(0,
2
) (i = 1, . . . , 20). Since the model in Equation 7.8
is basically a multiple linear regression model, the regression parameters
1
and
2
have the same kind of interpretation as in a multiple linear regression
model. Thus, the eect of group, measured by
1
, is to be interpreted as the
mean dierence in RBC between a group of short sleeping people of age X
2
and a group of long sleeping people of the same age X
2
(it does not matter
what particular age they have). This is easily seen by writing model 7.8 for
people in the rst group (X
1
= 0),
Y
i
= +
2
X
2i
+
i
. (7.9)
This model basically says that the RBC of a short sleeping person of age X
2
is
Y N( +
2
X
2
,
2
). (7.10)
For the subjects in the second group (X
1
= 1), the model becomes
Y
i
= +
1
+
2
X
2i
+
i
, (7.11)
which implies that the RBC of a long sleeping person of age X
2
is
Y N( +
1
+
2
X
2
,
2
). (7.12)
Thus,
1
is indeed the dierence in mean RBC between both groups of people
of age X
2
(whatever the value of X
2
). Thus,
1
measures the eect of hours
sleep, controlled for age (i.e. the eect of age is eliminated).
The S-Plus output of the GLM is given below.
Call: lm(formula = rbc ~ group + age, data = RBC, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.857 -0.8047 0.3009 0.6348 1.623
Coefficients:
(Intercept) 78.9711 1.2497 63.1904 0.0000
group 0.9030 1.1621 0.7770 0.4478
age 3.0400 0.0404 75.2259 0.0000
F-statistic: 19130 on 2 and 17 degrees of freedom, the p-value is 0
From this output we conclude the following:
The tted GLM is
Y (X
1
, X
2
) = 78.9711 + 0.9030X
1
+ 3.04X
2
. (7.13)
(we used = 78.9711,

1
= 0.9030 and

2
= 3.04). This tted model
may be used for prediction. Note again that

1
= 0.903 is the estimated
eect of group, i.e. the mean dierence in RBC between both groups,
controlled for age, is estimated as 0.903 in the sense that long sleeping
persons have on average large RBC than short sleeping people.
Thus, the eect of sleeping is estimated by

1
= 0.903. To take the
sampling variability into account, it is better to give the 95% condence
interval:
[0.903 1.1621t
17,0.025
, 0.903 + 1.1621t
17,0.025
] = [1.545, 3.355].
From this interval estimator we may immediately conclude that H
0
:
1
= 0 is not rejected at the 5% level of signicance. The same con-
clusion is obtained by looking at the result of the t-tests for this null
hypothesis: p = 0.4478 > 0.05.
Hence, by controlling for age (i.e. eliminating the eect of age on the
mean RBC) there is no signicant dierence between to two groups
anymore!!! Thus, the dierence that we concluded from the t-test was
mainly due to the eect of age on the mean RBC and the fact that the
groups were dierent in their age distribution.
We may also look to the eect of age:

2
= 3.04, which is highly
signicantly dierent from zero (p = 0). This proves that age is linearly
related to the mean RBC.
If, however, we would have had concluded here that the eect of age
was not signicant, then still the eect of age may have been retained
in the model. (reason: a non-signicant result is a weak conclusion,
and if the researcher believes that it may have an eect, it may always
be retained in the model. Moreover, by keeping it in the model, it will
change the interpretation of the other parameters in the model such
that this interpretation guarantees that the eect of age is eliminated
from it).
Finally, Figure 7.4 shows a scatter plot of the residuals of the regression
model
Y
i
= +
2
X
2i
+
i
(7.14)
(i.e. a simple linear regression model with only age in it). In Figure 7.4
the residuals are plotted against age. Now the linear eect of age is of
course eliminated. The dierence in mean RBC between the two groups is
now indeed much smaller. This dierence is approximately the dierence as
measured by

2
in the GLM.
20 30 40 50 60
age
-2
-1
0
1
2
r
e
s
id
u
a
ls
Figure 7.4: A scatter plot of residuals against age. The circles and the
triangles represent the short and the long sleepers, respectively. The two
horizontal lines represent the sample means of the residuals of both groups.

Cursus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cursus

Uploaded by

Copyright:

Available Formats

FACULTY OF AGRICULTURAL AND

APPLIED BIOLOGICAL SCIENCES

2.4 Random Variables

is such that the probability that Y is larger than y

N(0, 1), (2.17)

is a standard normal variable. More gen-

X becomes exactly equal to the true mean. This property is an asymptotic

X | is a random variable as well. Thus,

Example 4.3. Example of Section 4.1 continued. Suppose now that

Applied Statistics for the Food Sciences Chapter 4 p. 58

The above example is a special case of a setting in which one wants to

243220 = 493.17 instead of S

79 = 55.49. With these standard errors, condence intervals

5.7.1 Two-way ANOVA Model

Applied Statistics for the Food Sciences Chapter 5 p. 96

= 6.7672 (this is directly read from the output).

You might also like