You are on page 1of 89

MSc Quantitative Techniques

Statistics
Ali C. Tasiran and Roald J. Versteeg
Department of Economics, Mathematics and Statistics
Malet Street, London WC1E 7HX
September 2013
MSc Economics & MSc Financial Economics (FT & PT2)
MSc Finance & MSc Financial Risk Management (FT & PT2)
MSc Financial Engineering (FT & PT1)
PG Certicate in Econometrics (PT1)
Syllabus
Who is this course for?
This part of the MSc Quantitative Techniques course reviews basic statistics. The course is
designed for full-time and second year par-time students in MSc Economics, MSc Finance,
MSc Financial Economics, and MSc Financial Risk Management; full-time and rst year
part-time students in the MSc Financial Engineering;and rst year part-time students in
the PG Certicate in Econometrics.
Course Objectives
This course is a refresher of mathematical statistics. The emphasis will be on probability
and distribution theories together with estimation and hypothesis testing involving several
parameters. These theories are presented with a view to applying them to econometric
theory.
Outline of Topics
The range of topics covered in this course are listed below, split into the two main parts.
I Probability and Distribution Theories
1. Preliminaries.
2. Probability.
3. Random variables and probability distributions.
4. Expectations and moments.
II Statistical Inference
5. Sampling.
6. Large sample theory.
7. Estimation.
8. Maximum likelihood.
9. Tests of statistical hypotheses.
i
ii QT 2013: Statistics
Textbooks
Lecture notes are provided. However, these are not a substitute for a textbook. We do
not recommend any particular text, but the following two textbooks both provide a good
overview of the topics covered in this class.
Rice, J. (2006). Mathematical Statistics and Data Analysis, 3rd. ed., Cengage.
Wackerly, D., Mendenhall, W. and Schaeer, R. (2008) Mathematical Statistics with
Applications, 7th ed., Cengage.
Students who desire a more advanced treatment of the materials might want to consider:
Casella, G. and Berger, R. (2001). Statistical Inference, 2nd ed., Duxbury press.
The following books are recommended for students that plan to take further courses in
econometrics. The appendices of these books also contain summaries of the material covered
in this class.
Greene, W. (2011). Econometric Analysis, 7th ed., Prentice-Hall.
Verbeek, M. (2012). A Guide to Modern Econometrics, 4th ed., Wiley.
Teaching arrangements and Assessment
There will be three lectures per week for four weeks. Performance in this course is assessed
through a written exam. You are required to pass this examination to continue on the MSc
programme. No resits are held.
Instructors
The instructors of this course are
Ali Tasiran, atasiran@ems.bbk.ac.uk
Roald Versteeg, r.versteeg@bbk.ac.uk
Contents
I Probability and Distribution Theories 1
1 Some preliminaries 2
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Probability 5
2.1 Classical denition of probability . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Frequency denition of probability . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Subjective denition of probability . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Axiomatic denition of probability . . . . . . . . . . . . . . . . . . . . . . . . 5
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Random variables and probability distributions 10
3.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 The Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Uniform Distribution on an Interval . . . . . . . . . . . . . . . . . . . 12
3.2.2 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Expectations and moments 14
4.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Mean of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Moments of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.3 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . 15
4.1.4 Expectations and Probabilities . . . . . . . . . . . . . . . . . . . . . . 16
4.1.5 Approximate Mean and Variance of g(X) . . . . . . . . . . . . . . . . 17
4.1.6 Mode of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.7 Median, Upper and Lower Quartiles, and Percentiles . . . . . . . . . . 17
4.1.8 Coecient of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.9 Skewness and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Joint Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
iii
iv QT 2013: Statistics
4.2.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II Statistical Inference 21
5 Sampling 22
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Random, Independent, and Dependent Samples . . . . . . . . . . . . . . . . . 23
5.3 Sample Statistics and Sampling Distributions . . . . . . . . . . . . . . . . . . 24
5.3.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3.2 Sample Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.3 Finite Population Correction . . . . . . . . . . . . . . . . . . . . . . . 27
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6 Large sample theory 29
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1.1 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1.2 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1.3 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1.4 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 32
6.1.5 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.1 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2.2 Denition CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Relations Between Convergence Modes . . . . . . . . . . . . . . . . . . . . . . 35
6.4 Distributions derived from the normal . . . . . . . . . . . . . . . . . . . . . . 35
6.4.1 Chi-Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.4.2 Student-t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.4.3 F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Estimation 40
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1.1 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.2 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.1.3 Suciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1.4 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.5 Asymptotic Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 Pivotal-Quantity Method of Finding CI . . . . . . . . . . . . . . . . . 44
7.2.2 CI for the mean of a normal population . . . . . . . . . . . . . . . . . 45
7.2.3 CI for the variance of a normal population . . . . . . . . . . . . . . . . 46
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
CONTENTS v
8 Maximum likelihood 48
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.1.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.2 Large sample properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.2 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.3 Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.4 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.3 Variance - Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9 Hypothesis testing 54
Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.1 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
9.2 The Elements of a Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . 54
9.2.1 Null and Alternative Hypotheses . . . . . . . . . . . . . . . . . . . . . 54
9.2.2 Simple and Composite Hypotheses . . . . . . . . . . . . . . . . . . . . 55
9.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.1 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.4 Test statistics derived from maximum likelihood . . . . . . . . . . . . . . . . . 56
9.4.1 Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.4.2 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.4.3 Lagrange Multiplyer Test . . . . . . . . . . . . . . . . . . . . . . . . . 57
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
A Distributions 59
A.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.1.1 The Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 59
A.1.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.1.3 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
A.1.4 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.1.5 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 61
A.1.6 Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . 61
A.1.7 Simple Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.2.1 Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.2.2 Cauchy Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2.3 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2.4 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2.5 F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2.6 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2.7 Lognormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2.8 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2.9 Students t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.2.10 Uniform Distribution on an Interval . . . . . . . . . . . . . . . . . . . 65
vi QT 2013: Statistics
A.2.11 Extreme Value Distribution (Gompertz Distribution) . . . . . . . . . . 65
A.2.12 Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2.13 Logistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2.14 Pareto Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.2.15 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B Multivariate distributions 67
B.1 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.1.1 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 68
B.1.2 Mixture Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
B.2 Multivariate Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 69
B.2.1 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . 69
B.2.2 Standard multivariate normal density . . . . . . . . . . . . . . . . . . 69
B.2.3 Marginal and Conditional Distributions of N(, ) . . . . . . . . . . . 70
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
C Practice Exams 71
C.1 Exam I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
C.2 Exam II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Part I
Probability and Distribution Theories
1
Chapter 1
Some preliminaries
Statistics is the science of observing data and making inferences about the characteristics
of a random mechanism that has generated data. It is also called as science of uncertainty.
In Economics, theoretical models are used to analyze economic behavior. Economic
theoretical models are deterministic functions but in real world, the relationships are not
exact and deterministic rather than uncertain and stochastic. We thus employ distribution
functions to make approximations to the actual processes that generate the observed data.
The process that generates data is known as the data generating process (DGP or Super
Population). In Econometrics, to study the economic relationships, we estimate statistical
models, which are build under guidance of the theoretical economic models and by taking
into account the properties in the data generating process.
Using parameters of estimated statistical models, we make generalisations about the
characteristics of a random mechanism that has generated data. In Econometrics, we use
observed data in the samples to draw conclusions about the populations. Populations are
either real which the data came or conceptual as processes by which the data were generated.
The inference in the rst case is called design-based (for experimental data) and used
mainly to study samples from populations with known frames. The inference in the second
case is called model-based (for observational data) and used mainly to study stochastic
relationships.
The statistical theory that used for such analyses is called as the Classical inference
one will be followed in this course. It is based on two premises:
1. The sample data constitute the only relevant information
2. The construction and assessment on the dierent procedures for inference are based
on long-run behavior under similar circumstances.
The starting point of an investigation is an experiment. An experiment is a random
experiment if it satises the following conditions:
- all possible distinct outcomes are known ahead of time
- the outcome of a particular trial is not known a priori
- the experiment can be duplicated.
The totality of all possible outcomes of the experiment is referred to as the sample
space (denoted by S) and its distinct individual elements are called the sample points or
elementary events. An event, is a subset of a sample space and is a set of sample points
that represents several possible outcomes of an experiment.
2
1. Some preliminaries 3
A sample space with a nite or countably innite sample points (with a one to one
correspondence to positive integers) is called a discrete space.
A continuous space is one with an uncountable innite number of sample points (that
is, it has as many elements as there are real numbers).
Events are generally represented by sets, and some important concepts can be explained
by using the algebra of sets (known as Boolean Algebra).
Denition 1 The sample space is denoted by S. A = S implies that the events in A must
always occur. The empty set is a set with no elements and is denoted by . A = implies
that the events in A do not occur.
The set of all elements not in A is called the complement of A and is denoted by

A.
Thus,

A occurs if and only if A does not occur.
The set of all points in either a set A or a set B or both is called the union of the two
sets and is denoted by . AB means that either the event A or the event B or both occur.
Note: A

A = S.
The set of all elements in both A and B is called the intersection of the two sets and
is represented by . A B means that both the events A and B occur simultaneously.
AB = means that A and B cannot occur together. A and B are said to be disjoint
or mutually exclusive. Note: A

A = .
A B means that A is contained in B or that A is a subset of B, that is, every
element of A is an element of B. In other words, if an event A has occurred, then B must
have occurred also.
Sometimes it is useful to divide elements of a set A into several subsets that are disjoint.
Such a division is known as a partition. If A
1
and A
2
are such partitions, then A
1
A
2
=
and A
1
A
2
= A. This can be generalized to n partitions; A =
n
1
A
i
with A
i
A
j
= for
i ,= j.
Some postulates according to the Boolean Algebra:
Identity: There exist unique sets and S such that, for every set A, A S = A and
A = A.
Complementation: For each A we can dene a unique set

A such that A

A = and
A

A = S.
Closure: For every pair of sets A and B, we can dene unique sets A B and A B.
Commutative: A B = B A; A B = B A.
Associative: (A B) C = A (B C).
Also (A B) C = A (B C).
Distributive: A (B C) = (A B) (A C).
Also, A (B C) = (A B) (A C).
Morgans Laws: A B) =

A

B.
(A B) =

A

B.
Problems
1. Let the set S contains the ordered combination of sexes of two children
S = FF, FM, MF, MM.
4 QT 2013: Probability and Distribution Theories
Let A denote the subset of possibilities containing no males, B the subset of two males,
and C the subset containing at least one male. List the elements of A, B, C, A B,
A B, A C, A C, B C, B C, and C

B.
2. Verify Morgans Laws by drawing Venn Diagrams.
A B =

A

B.
(A B) =

A

B.
Chapter 2
Probability
Probability denitions and concepts
2.1 Classical denition of probability
If an experiment has n(n < ) mutually exclusive and equally likely outcomes, and if n
A
of
these outcomes have an attribute A (that is, the event A occurs in n
A
possible ways), then
the probability of A is n
A
/n, denoted as P(A) = n
A
/n
2.2 Frequency denition of probability
Let n
A
be the number of times the event A occurs in n trials of an experiment. If there
exists a real number p such that p = lim
n
(n
A
/n), then p is called the probability of A
and is denoted as P(A). (Examples are histograms for frequency distribution of variables).
2.3 Subjective denition of probability
Our personal judgments to assess the relative likelihood of various outcomes. They are based
on our educated guesses or intuitions. The weather will be rainy with a probability 0.6,
tomorrow.
2.4 Axiomatic denition of probability
The probability of an event A is a real number such that
1) P(A) 0 for every A ,
2) the probability of the entire sample space S is 1, that is P(S) = 1, and
3) if A
1
, A
2
, ..., A
n
are mutually exclusive events (that is, A
i
A
j
= for all i ,= j),
then P(A
1
A
2
...A
n
) =

i
P(A
i
), and this holds for n = also.
Where is a set of all sub-sets in the sample space, S. The triple (S, , P()) is referred
to as the probability space, and P() is a probability measure.
We can derive the following theorems by using the axiomatic Denition of probability.
Theorem 2 P(

A) = 1 P(A).
5
6 QT 2013: Probability and Distribution Theories
Theorem 3 P(A) 1.
Theorem 4 P() = 0.
Theorem 5 If A B, then P(A) P(B).
Theorem 6 P(A B) = P(A) +P(B) P(A B).
Denition 7 Let A and B be two events in a probability space (S, , P(.)) such that P(B) >
0. The conditional probability of A given that B has occurred, denoted by P(A [ B), is
given by P(A B)/P(B). (It should be noted that the original probability space (S, , P())
remains unchanged even though we focus our attention on the subspace, this is (S, , P( [
B))
Theorem 8 Bonferronis Theorem: Let A and B be two events in a sample space S. Then
P(A B) 1 P(

A) P(

B).
Theorem 9 Bayes Theorem: If Aand B are two events with positive probabilities, then
P(A [ B) =
P(A) P(B [ A)
P(B)
Law of total probability
Assume that S = A
1
A
2
... A
n
where A
i
A
j
= for i ,= j. Then for any event
B S
P(B) =
n

i=1
P(A
i
)P(B [ A
i
).
Theorem 10 Extended Bayes Theorem: If A
1
, A
2
, ..., A
n
constitute a partition of the sam-
ple space, so that A
i
A
j
= for i ,= j and
i
A
i
= S, and P(A
i
) ,= 0 for any i, then for
a given event B with P(B) > 0,
P(A
i
[ B) =
P(A
i
) P(B [ A
i
)

i
P(A
i
) P(B [ A
i
)
Denition 11 Two events A and B with positive probabilities are said to be statistically
independent if and only if P(A [ B) = P(A). Equivalently, P(B [ A) = P(B) and
P(A B) = P(A)P(B).
The other type of statistical inference is called Bayesian inference where sample infor-
mation is combined with prior information. This is expressed of a probability distribution
known as the prior distribution. When it is combined with the sample information then a
posterior distribution of parameters is obtained. It can be derived by using Bayes Theorem.
If we substitute Model (the model that generated the observed data) for A and Data
(Observed Data) for B, then we have
P(Model [ Data) =
P(Data [ Model)P(Model)
P(Data)
(2.1)
2. Probability 7
where P(Data [ Model) is the probability of observing data given that the Model is true.
This is usually called the likelihood, (sample information). P(Model) is the probability
that the Model is true before observing the data (usually called the prior probability).
P(Model [ Data) is the probability that the Model is true after observing the data (usually
called posterior probability). P(Data) is the unconditional probability of observing data
(whether the Model is true or not). Hence, the relation can be written
P(Model [ Data) P(Data [ Model)P(Model) (2.2)
That is, that Posterior probability is proportional to likelihood (sample information) times
prior probability. The inverse of an estimators variance is called as the precision. In
Classical Inference, we use only parameters variances but in Bayesian Inference, we have
both sample precision and prior precision. Also, the precision (or inverse of the variance) of
the posterior distribution of a parameter is the sum of sample precision and prior precision.
For example, the posterior mean will lie between the sample mean and the prior mean. The
posterior variance will be less than the both the sample and prior variances. These are the
reasons behind the increasing popularity of Bayesian Inference in the practical econometric
applications.
When we speak in econometrics of models to be estimated or tested, we refer to sets
of DGPs in Classical Inference context. In design-based inference, we restrict our attention
to a particular sample size and characterize a DGP by the law of probability that governs
the random variables in a sample of that size. In model based inference, we refer to a
limiting process in which the sample size goes to innity, it is clear that such a restricted
characterization will no longer suce. When we indulge in asymptotic theory, the DGPs
in question must be stochastic processes. A stochastic process is a collection of random
variables indexed by some suitable index set. This index set may be nite, in which case we
have no more than a vector of random variables, or it may be innite, with either a discrete
or a continuous innity of elements. In order to dene a DGP, we must be able to specify the
joint distribution of the set of random variables corresponding to the observations contained
in a sample of arbitrarily large size. This is a very strong requirement. In econometrics,
or any other empirical discipline for that matter, we deal with nite samples. How then
can we, even theoretically, treat innite samples? We must in some way create a rule that
allows one to generalize from nite samples to an innite stochastic process. Unfortunately,
for any observational framework, there is an innite number of ways in which such a rule
can be constructed, and dierent rules can lead to widely asymptotic conclusions. In the
process of estimating an econometric model, what we are doing is to try to obtain some
estimated characterization of the DGP that actually did generate the data. Let us denote
an econometric model that is to be estimated, tested, or both, as M and a typical DGP
belonging to M as .
The simplest model in econometrics is the linear regression model, one possibility is to
write
y = X +u, u N(0,
2
I
n
) (2.3)
where y and u are n-vectors and X is a nonrandomnxk matrix and y follows the N(X,
2
I
n
)
distribution. This distribution is unique if the parameters and
2
are specied. We may
therefore say that the DGP is completely characterized by the model parameters. In
8 QT 2013: Probability and Distribution Theories
other words, knowledge of the model parameters and
2
uniquely identify an element of
in M.
On the other hand, the linear regression model can also be written as
y = X +u, u IID(0,
2
I
n
) (2.4)
with no assumption of normality. Many aspects of the theory of linear regressions are just
applicable, the OLS estimator is unbiased, and its covariance matrix is
2
(X

X)
1
. But the
distribution of the vector u, and hence also that of y, is now only partially characterized
even when and
2
are known. For example, errors u could be skewed to the left or to the
right, could have fourth moments larger or smaller than 3
4
.Let us call the sets of DGPs
associated these regressions M
1
and M
2
., respectively. M
1
being in fact a proper subset of
M
2
. For a given and
2
there is an infinite number of DGPs in M
2
(only one of which
is M
1
) that all correspond to the same and
2
. Thus we must consider these models as
dierent models even though the parameters used in them are the same. In either case, it
must be possible to associate a parameter vector in a unique way to any DGP in the model
M, even if the same parameter vector is associated with many DGPs. We call the model M
with its associated parameter-dening mapping as a parametrized model The main
task in our practical work is to build the association between the DGPs of a model and the
model parameters. For example, in the Generalized Method of Moments (GMM) context,
there are many possible ways of choosing the econometric model, i.e., the underlying set of
DGPs. One of the advantages of GMM as an estimation method is that it permits models
which consist of a very large number of DGPs. In striking contrast to Maximum Likelihood
estimation, where the model must be completely specied, any DGP is admissible if it
satises a relatively small number of restrictions or regularity conditions. Sometimes, the
existence of the moments used to dene the parameters is the only requirement needed for
a model to be well dened.
Problems
1. A sample space consists of ve simple events E
1
, E
2
, E
3
, E
4
, and E
5
.
(a) If P(E
1
) = P(E
2
) = 0.15, P(E
3
) = 0.4 and P(E
4
) = 2P(E
5
), nd the probabili-
ties of E
4
and E
5
.
(b) If P(E
1
) = 3P(E
2
) = 0.3, nd the remaining simple events if you know that the
remaining events are equally probable.
2. A business oce orders paper supplies from one of three vendors, V
1
, V
2
, and V
3
.
Orders are to be placed on two successive days, one order per day. Thus (V
2
, V
3
)
might denote that vendor V
2
gets the order on the rst day and vendor V
3
gets the
order on the second day.
(a) List the sample points in this experiment of ordering paper on two successive
days.
(b) Assume the vendors are selected at random each day and assign a probability to
each sample point.
2. Probability 9
(c) Let A denote the event that the same vendor gets both orders and B the event
that V
2
gets at least one order. Find P(A), P(B), P(A B), and P(A B) by
summing probabilities of the sample points in these events.
Chapter 3
Random variables and probability
distributions
Random variables, densities, and cumulative distribution func-
tions
A random variable X, is a function whose domain is the sample space and whose range is a
set of real numbers.
Denition 12 In simple terms, a random variable (also referred as a stochastic vari-
able) is a real-valued set function whose value is a real number determined by the outcome
of an experiment. The range of a random variable is the set of all the values it can assume.
The particular values observed are called realisations x. If these are countable, x
1
, x
2
, ...,
it is said to be discrete with associated probabilities
P(X = x
i
) = p(x
i
) 0,

i
p(x
i
) = 1; (3.1)
and cumulative distribution P(X x
j
) =

j
i=1
p(x
i
).
For a continuous random variable, dened over the real line, the cumulative distribution
function is
F(x) = P(X x) =
_
x

f(u)d(u), (3.2)
where denotes the probability density function
f(x) =
dF(x)
dx
(3.3)
and
_

f(x)d(x) = 1.
Also note that the cumulative distribution function satises lim
x
F(x) = 1 and
lim
x
F(x) = 0.
Denition 13 The real-valued function F(x) such that F(x) = P
x
(, x] for each x 1
is called the distribution function, also known as the cumulative distribution (or
cumulative density) function, or CDF.
10
3. Random variables and probability distributions 11
Theorem 14 P(a X b) = F(b) F(a)
Theorem 15 For each x 1, F(x) is continuous to the right of x.
Theorem 16 If F(x) is continuous at x 1, then P(X = x) = 0.
Although f(x) is dened at a point, P(X = x) = 0 for a continuous random variable.
The support of a distribution is the range over which f(x) ,= 0.
Let f be a function from R
k
to R. Let x
0
be a vector in R
k
and let y = f(x
0
) be its
image. The function f is continuous at x
0
if whenever x
n

n=1
is a sequence in R
k
which
converges to x
0
, then the sequence f(x
n
)

n=1
converge to f(x
0
). The function f is said to
be continuous if it is continuous at each point in its domain.
All polynomial functions are continuous. As an example of a function that is not con-
tinuous consider
f(x) =
_
1, if x > 0,
0, if x 0.
If both g and f are continuous functions, then g(f(x)) is continuous.
3.1 Discrete Distributions
Denition 17 For a discrete random variable X, let f(x) = P
x
(X = x). The function
f(x) is called the probability function (or as probability mass function).
3.1.1 The Bernoulli Distribution
f(x; ) = f(x; p) = p
x
(1 p)
1x
for x = 0, 1(failure, success) and 0 p 1.
The Binomial Distribution
f(x; ) = B(x; n, p) =
_
n
x
_
p
x
(1 p)
nx
=
n!
x! (n x)!
p
x
(1 p)
nx
(3.4)
x = 0, 1, ..., n (X is the number of success in n trials) 0 p 1.
3.2 Continuous Distributions
Denition 18 For a random variable X if there exists a nonnegative function f(x), dened
on the real line, such that for any interval B,
P(X B) =
_
B
f(x) dx (3.5)
then X is said to have a continuous distribution and the function f(x) is called the
probability density function or simply density function (or pdf).
The following can be written for the continuous random variables:
12 QT 2013: Probability and Distribution Theories
F(x) =
_
x

f(u) d(u) (3.6)


f(x) = F

(x) =
F(x)
x
(3.7)
_
+

f(u) d(u) = 1 (3.8)


F(b) F(a) =
_
b
a
f(u) d(u) (3.9)
3.2.1 Uniform Distribution on an Interval
A random variable X with the density function
f(x; a, b) =
1
(b a)
(3.10)
in the interval a X b is called the uniform distribution on an interval.
3.2.2 The Normal Distribution
A random variable X with the density function
f(x; , ) =
1

_
(2)
e

1
2

(x )
2

(3.11)
is called a Normal (Gaussian) distributed variable.
3.2.3 Example
1. Toss of a single fair coin. X =number of heads
F(x) =
_
_
_
0, if x < 0
1
2
, if 0 x < 1
1, if x 1
the cumulative distribution function (cdf) of discrete random variables are always step
functions because the cdf increases only at a countable of number of points.
f(x) =
_
1
2
, if x = 0
1
2
, if x = 1
F(x) =

x
j
x
f(x
j
)
3. Random variables and probability distributions 13
Problems
1. Write P(a x b) in terms of integrals and draw a picture for it.
2. Assume the probability density function for x is:
f(x) =
_
cx, if 0 x 2
0, elsewhere
(a) Find the value of c for which f(x) is a pdf.
(b) Compute F(x).
(c) Compute P(1 x 2).
3. The large lot of electrical is supposed to contain only 5 percent defectives assuming
a binomial model. If n = 20 fuses are randomly sampled from this lot, nd the
probability that at least three defectives will be observed.
4. Let the distribution function of a random variable X be given by
F(x) =
_

_
0, x < 0
x
8
, 0 x < 2
x
2
16
, 2 x < 4
1, x 4
(a) Find the density function (i.e., pdf) of x.
(b) Find P(1 x 3)
(c) Find P(x 3)
(d) Find P(x 1 [ x 3).
Chapter 4
Expectations and moments
Mathematical Expectation and Moments
The probability density and the cumulative distributions functions determine the proba-
bilities of random variables at various points or in dierent intervals. Very often we are
interested in summary measures of where the distribution is located, how it is dispersed
around some average measure, whether it is symmetric around some point, and so on.
4.1 Mathematical Expectation
Denition 19 Let X be a random variable with f(x) as the PMF, or PDF, and g(x) be
a single-valued-function. The integral is the expected value (or mathematical expec-
tation) of g(x) and is denoted by E[g(X)]. In the case of a discrete random variable
this takes the form E[g(X)] =

g(x)f(x
i
), and in the continuous case, E[g(X)] =
_
+

g(x)f(x)dx
4.1.1 Mean of a Distribution
For the special case of g(X) = X, the mean of a distribution is = E(X).
Theorem 20 If c is a constant, E(c) = c.
Theorem 21 If c is constant, E[cg(X)] = cE[g(X)].
Theorem 22 E[u(X) +v(X)] = E[u(X)] +E[v(X)].
Theorem 23 E(X ) = 0, where = E(X).
Examples
Ex1: Let X have the probability density function
x 1 2 3 4
f(x)
4
10
1
10
3
10
2
10
14
4. Expectations and moments 15
E(x) =

x
xf(x) = 1
_
4
10
_
+ 2
_
1
10
_
+ 3
_
3
10
_
+ 4
_
2
10
_
=
_
23
10
_
.
Ex2: Let X have the pdf
f(x) =
_
4x
3
, 0 < x < 1
0, elsewhere
.
E(x) =
_
+

xf(x)dx =
_
1
0
x(4x
3
)dx = 4
_
1
0
x
4
dx = 4
_
x
5
5
_
1
0
= 4
_
1
5

=
4
5
.
4.1.2 Moments of a Distribution
The mean of a distribution is the expected value of the random variable X. If the following
integral exists

m
= E(X
m
) =
_
+

x
m
dF (4.1)
it is called the mth moment around the origin, and it is denoted by

m
. Moments can
also be obtained around the mean or the central moments (denoted by
m
)

m
= E[(X )
m
] =
_
+

(x )
m
dF (4.2)
4.1.3 Variance and Standard Deviation
The central moment of a distribution that corresponds to m = 2 is called the variance of
this distribution, and is denoted by
2
or V ar(X). The positive square root of the variance
is called standard deviation and is denoted by or Std(x). The variance is an average of
the squared deviation from the mean. There are many deviations from the mean but only
one standard deviation. The variance shows the dispersion of a distribution and by squaring
deviations one treats positive and negative deviations symmetrically.
Mean and Variance of a Normal Distribution
A random variable X is normal distributed as N(,
2
) the mean is , and variance is
2
.
The operation of substracting the mean and dividing by the standard deviation is called
standardizing. Then the standardized variable Z = (X )/ is SN(0, 1).
Mean and Variance of a Binomial Distribution
The random variable X is binomial distributed B(n, p) with the mean np and a variance
with np(1 p). (Show this!)
Theorem 24 If E(X)= and Var(X)=
2
, and a and b are constants, then V ar(a +bX) =
b
2

2
. (Show this!)
16 QT 2013: Probability and Distribution Theories
Example
Ex3: Let X have the probability density function
f(x) =
_
4x
3
, 0 < x < 1
0, elsewhere
.
E(x) =
4
5
.
V ar(x) = E(x
2
)E
2
(x) =
_
1
0
x
2
(4x
3
)dx
_
4
5

2
= 4
_
x
6
6
_
1
0

_
4
5

2
=
4
6

16
25
=
2
75
= 0.0266.
4.1.4 Expectations and Probabilities
Any probability can be interpreted as an expectation. Dene the variable Z which is equal
to 1 if event A occurs, and equal to zero if event A does not occur. Then it is easy to see
that Pr(A) = E(Z).
How much information about the probability distribution of a random variable X is
provided by the expectation and variance of X? There are three useful theorems here.
Theorem 25 Markovs Inequality If X is nonnegative random variable, that is, if Pr(X <
0) = 0, and any k is any constant, then Pr(X k) E(X)/k.
Theorem 26 Chebyshevs Inequality Let b a positive constant and h(X) be a nonnegative
measurable function of the random variable X. Then
Pr(h(X) b)
1
b
E[h(X)]
For any constant c > 0 and
2
= V ar(X),
Corollary 27 Pr([ X [ c)

2
c
2
Corollary 28 Pr([ X [ c) 1
_

2
c
2
_
Corollary 29 Pr([ X [ k)
1
k
2
For linear functions the expectation of the function is the function of the expectation.
But if Y = h(X) is nonlinear, then in general E(Y ) ,= h[E(X)]. The direction of the
inequality may depend on the distribution of X. For certain functions, we can be more
denite.
Theorem 30 Jensens Inequality If Y = h(X) is concave and E(X) = , then E(Y )
h().
For example, the logarithmic function is concave, so E[log(X)] log[E(X)] regardless
of the distribution of X. Similarly, if Y = h(X) is convex, so that it lies everywhere
above its tangent line, then E(Y ) h(). For example, the square function is convex, so
E(X
2
) [E(X)]
2
regardless of the distribution of X.
4. Expectations and moments 17
4.1.5 Approximate Mean and Variance of g(X)
Suppose X is a random variable dened on (S, , P()) with E(X) = and V ar(X) =
2
,
and let g(X) be a dierentiable and measurable function of X. We rst take a linear
approximation of g(X) in the neighborhood of . This is given by
g(X) g() +g

()(X ) (4.3)
provided g() and g

() exist. Since the second term zero expectation E[g(X)] g(), and
variance is V ar[g(X)]
2
[g

()]
2
.
4.1.6 Mode of a Distribution
The point(s) for which f(x) is maximum are called mode. It is the most frequently observed
value of X.
4.1.7 Median, Upper and Lower Quartiles, and Percentiles
A value of x such that P(X < x) (1/2), and P(X (x)) (1/2) is called a median of
the distribution. If the point is unique, then it is the median. Thus the median is the point
on either side of which lies 50 percent of the distribution. We often prefer median as an
average measure because the arithmetic average can be misleading if extreme values are
present.
The point(s) with an area 1/4 to the left is (are) called the lower quartile(s), and the
point(s) corresponding to 3/4 is (are) called upper quartile(s).
For any probability p, the values of X, for which the area to the right is p are called the
upper pth percentiles (also referred to as quantiles).
4.1.8 Coecient of Variation
The coecient of variation is dened as the ratio (/)100, where the numerator is the
standard deviation and the denominator is the mean. It is a measure of the dispersion of
a distribution relative to its mean and useful in the estimation of relationships. We usually
say that the variable X does not vary much if the coecient of variation is less than 5
percent. This is also helpful to make comparison between two variables that are measured
with dierent scales.
4.1.9 Skewness and Kurtosis
If a continuous density f(x) has the property that f( + a) = f( a) for all a ( being
the mean of the distribution), then f(x) is said to be symmetric around the mean . If
a distribution is not symmetric about the mean, then it is called skewed. A commonly used
measure of skewness is
3
= E[(X )
3
/
3
]. For a symmetric distribution such as the
normal, this is zero( =
3
= 0). [Positive skewed ( >
3
> 0), to the right with a long
tail, negative skewed ( <
3
< 0), to the left with a long tail].
The peaknedness of a distribution is called kurtosis. One measure of kurtosis is
4
=
E[(X)
4
/
4
]. For a normal distribution, kurtosis is called mesokurtic (
4
= 3). A narrow
distribution is called leptokurtic (
4
> 3) and a at distribution is called platykurtic
(
4
< 3). The value E[(X )
4
/
4
] 3 is often referred to as excess kurtosis.
18 QT 2013: Probability and Distribution Theories
4.2 Moments
4.2.1 Mathematical Expectation
The concept of mathematical expectation is easily extended to bivariate random variables.
We have
E[g(X, Y )] =
_ _
g(x, y)dF(x, y) (4.4)
where the integral is over the (X, Y ) space.
4.2.2 Moments
The rth moment of X is
E(X
r
) =
_
x
r
dF(x) (4.5)
4.2.3 Joint Moments
E(X
r
Y
s
) =
_ _
x
r
y
s
dF(x, y)
Let X and Y be independent random variables and let u(X) be a function of X only and
v(Y ) be a function of Y only. Then,
E[u(X)v(Y )] = E[u(X)]E[v(Y )] (4.6)
4.2.4 Covariance
Covariance between X and Y is dened as

XY
= Cov(X, Y ) = E[(X
x
)(Y
y
)] = E(XY )
x

y
(4.7)
In the continuous case this takes the form:

XY
=
_

(x
x
)(y
y
)f(x, y)dxdy (4.8)
and in the discrete case it is

XY
=

y
(x
x
)(y
y
)f(x, y) (4.9)
Although the covariance measure is useful in identifying the nature of the association
between X and Y , it has a serious problem, namely, the numerical value is very sensitive
to the units of measurement. To avoid this problem, a normalized covariance measure is
used. This measure is called the correlation coecient.
4. Expectations and moments 19
4.2.5 Correlation
The quantity

XY
=

XY

X

Y
=
Cov(X, Y )
_
V ar(X) V ar(Y )
(4.10)
is called correlation coecient between X and Y . If Cov(X, Y ) = 0, then Cor(X, Y ) = 0,
in which case X and Y are said to be uncorrelated. Two random variables are independent
then
XY
= 0 and
XY
= 0. The converse need not to be true.
Theorem 31 [
XY
[ 1 that is, 1
XY
1.
The inequality [Cov(X, Y )]
2
V ar(X)V ar(Y )is called Cauchy-Schwarz Inequality
or
2
XY
1 that is, 1
XY
1. It should be emphasized that
XY
measures only a linear
relationship between X and Y . It is possible to have an exact relation but a correlation less
than 1, even 0.
Example
To illustrate, consider random variable X which is distributed as Uniform [, ] and the
transformation Y = X
2
. Cov(X, Y ) = E(X
3
) E(X)E(X
2
) = 0 because the distribution
is symmetric around the origin and hence all the odd moments about the origin are zero. It
follows that X and Y are uncorrelated even though there is an exact relation between them.
In fact, this result holds for any distribution that is symmetric around the origin.
Denition 32 Conditional Expectation: Let X and Y be continuous random variables
and g(Y ) be a continuous function. Then the conditional expectation (or conditional mean)
of g(Y ) given X = x, denoted by E
Y |X
[g(Y ) [ X], is given by
_

g(y) f(y [ x) dy wheref(y [


x) is the conditional density of Y given X.
Note that E[g(Y ) [ X = x] is a function of x and is not a random variable because x is
xed. The special case of E(Y [ X) is called the regression of Y on X.
Theorem 33 Law of Iterated Expectation: E
XY
[g(Y )] = E
X
[E
Y |X
g(Y ) [ X]. That
is, the unconditional expectation is the expectation of the conditional expectation.
Denition 34 Conditional Variance: Let
Y |X
= E(Y [ X) =

(X) be the conditional


mean of Y given X. Then the conditional variance of Y given X is dened as V ar(Y [
X) = E
Y |X
[(Y

)
2
[ X]. This is a function of X.
Theorem 35 V ar
Y |X
(Y ) = E
X
[V ar(Y [ X)] + V ar
X
[E(Y [ X)], that is, the variance of
Y is the mean of its conditional variance plus the variance of its conditional mean.
Theorem 36 V ar(aX +bY ) = a
2
V ar(X) + 2abCov(X, Y ) +b
2
V ar(Y ).
20 QT 2013: Probability and Distribution Theories
Approximate Mean and Variance for g(X, Y )
After obtaining a linear approximation of the function g(X, Y )
g(X, Y ) g(
x
,
y
) +
_
g
X
_
(X
X
) +
_
g
Y
_
(Y
Y
) (4.11)
its mean can be written E[g(X, Y )] g(
X
,
Y
).
Its variance is
V ar[g(X, Y )]
2
X
_
g
X
_
2
+
2
Y
_
g
Y
_
2
+ 2
X

Y
_
g
X
_ _
g
Y
_
(4.12)
Note that approximations may be grossly in error. You should be especially careful with
the variance and covariance approximations.
Problems
1. For certain ore samples the proportion Y of impurities per sample is a random variable
with density function given by
f(y) =
_ _
3
2
_
y
2
+y, 0 y 1
0, elsewhere
.
The dollar value of each sample is W = 5 0.5Y . Find the mean and variance of W.
2. The random variable Y has the following probability density function
f(y) =
_ _
3
8
_
(7 y)
2
, 5 y 7
0, elsewhere
.
(a) Find E(Y ) and V ar(Y ).
(b) Find an interval shorter than (5, 7) in which least 3/4 of the Y values must lie.
(c) Would you expect to see a measurement below 5.5 very often? Why?
Part II
Statistical Inference
21
Chapter 5
Sampling
Literature
Wackerley et al., 7.1
Rice, 7.1 - 7.3
Casella and Berger, 5.1 - 5.4
5.1 Populations and Samples
The start of a statistical investigation normally concerns some measures of interest concern-
ing. The totality of elements about which some information is desired is called a population.
Often we only use a small proportion of a population, known as a sample, because it is
impractical to gather data on the whole population.
We measure attributes of this sample and draw conclusions or make policy decisions
based on the data obtained. With statistical inference, we estimate the unknown parameters
underlying statistical distributions, measure their precision, test hypotheses on them, and
use them to generate forecasts of random variables.
Population
A population (of size N), x
1
, x
2
. . . , x
N
is the totality of elements that we are
interested in.
Denition 5.1.
Parameter
The numerical characteristics of a population are called parameters. Parameters
are often denoted by Greek letters such as .
Denition 5.2.
22
5. Sampling 23
Sample
A sample (of size n) is a set of random variables, X
1
, X
2
, . . . , X
n
, that are drawn
from the population. The realization of the sample is denoted by x
1
, . . . , x
n
.
Denition 5.3.
5.2 Random, Independent, and Dependent Samples
There are certain properties that we would like our sample to have. Generally speaking we
can divide samples into three dierent groups: random samples, independent samples, and
dependent samples.
Random Sample
The random variables X
1
, X
2
, , X
n
are called a random sample if they are in-
dependent and identically distributed (iid) random variables. Note that a random
sample is equal to an independent sample with each random variable drawn from
the identical distribution.
The joint density of the x
i
s in a random sample sample has the form:
f(x
1
, x
2
, , x
n
) =
n

i=1
f(x
i
). (5.1)
Denition 5.4.
Independent Sample
The random variables X
1
, X
2
, . . . , X
n
are said to form an independent sample if all
random variables in the sample are independently distributed. Note that f
X
i
might
be dierent across i; it is not assumed that the Xs have the same distribution.
The joint density of the x
i
s in a independent sample has the form:
f(x
1
, x
2
, . . . , x
n
) = f
X
1
(x
1
)f
X
2
(x
2
) . . . f
Xn
(x
n
) =
n

i=1
f
X
i
(x
i
). (5.2)
Denition 5.5.
24 QT 2013: Statistical Inference
Dependent Sample
The random variables X
1
, X
2
, . . . , X
n
are said to form a dependent sample If the
observations are obtained over time or if there is a dependency between the distri-
butions, for instance because the sample is collected over time.
The joint density f(x
1
, x
2
, . . . , x
n
) can be factored as follows:
f(x
1
, x
2
, . . . , x
n
) = f(x
n
[ x
1
, x
2
, . . . , x
n1
)f(x
1
, x
2
, . . . , x
n1
) (5.3)
= f(x
n
[ A
n1
)f(x
1
, x
2
, . . . , x
n1
) (5.4)
=
n

i=1
f
X
i
(x
i
[ A
i1
), (5.5)
where A
i1
x
k

i1
k=1
(and A
0
= )
Denition 5.6.
5.3 Sample Statistics and Sampling Distributions
When drawing a sample from a population, a researcher is normally interested in reducing
the data into some summary measures. Any well-dened measure may be expressed as a
function of the realized values of the sample. As the function will be based on a vector of
random variables, the function itself, called a statistic, will be a random variable as well.
Statistic
Let X
1
, . . . , X
n
be a sample of size n and T(x
1
, . . . , x
n
) be a real-valued or vector-
valued function whose domain includes the sample space of (X
1
, . . . , X
n
), that does
not include any unknown parameters, then the random variable X = T(x
1
, . . . , x
n
)
is called a statistic. The probability distribution of a statistic is called the sampling
distribution of X. Examples of statistics are: Sample mean, variance, range, total,
covariance. Statistics are often denoted by Roman letters, as opposed to the Greek
letters used to denote their parameter counterparts.
Denition 5.7.
The analysis of these statistics and their sampling distributions is at the very core of
econometrics. As the denition of a statistic is very broad, it can include a wide range of
dierent measures. The most common statistics is probably the mean; other statistics may
measure the variability of the sample, the largest observation in the sample, or a correlation
between two sequences of random variables. Statistics do not need to be scalar, but may
also be vector-valued, returning for instance all the unique values observed in the sample.
Also note the important dierence between the sampling distribution which measures the
probability distribution of the statistic T(x
1
, . . . , x
n
) and the distribution of the population,
which measures the marginal distribution of each X
i
.
5. Sampling 25
The sample mean and the sample variance of a random normal sample have the
following three characteristics:
1. E[X] = , and X N(,
2
/n),
2. E[S
2
] =
2
, and (n 1)S
2
/
2

2
n1
,
3. X and S
2
are independent random variables.
Theorem 5.8.
5.3.1 Sample Mean
If X
1
, . . . , X
n
are random variables with dened means,
i
, and dened variances,

2
i
; then a linear combination of those random variables,
Z = a +
n

i=1
b
i
X
i
, (5.6)
will have the following mean and variance:
E(Z) = a +
n

i=1
[b
i
E(X
i
)], (5.7)
V ar(Z) =
n

i=1
n

j=1
[b
i
b
j
Cov(X
i
X
j
)]. (5.8)
Theorem 5.9.
If X
1
, . . . , X
n
is a random sample drawn from a population with mean and
variance
2
, then the sample mean X
n
= n
1

n
i=1
X
i
will have the following
expectation and variance:
E(X
n
) = , (5.9)
and
V ar(X
n
) =
2
Xn
=

2
n
. (5.10)
Corollary 5.10.
26 QT 2013: Statistical Inference
As discussed in the section above a sample statistic is a random variable and thus has a
probability distribution. Let us consider how a sampling distribution may look like. As an
example take the case of the sample mean X
n
of a random sample drawn from a normally
distributed population. We know that:
Linear combinations of normal variates are also normally distributed,
The expected value of the sample mean will be X
n
= ,
The standard error of the sample mean will be /

n.
So the sampling distribution of X will have a sampling distribution equal to
X
n
N(,

2
n
). (5.11)
We can now go one step further and calculate the standardized sample mean, by sub-
tracting the expected value, , and dividing by the (asymptotic) standard error to create a
standard normally distributed random variable:
Z =
X
n

/
_
(n)
N(0, 1). (5.12)
5.3.2 Sample Variance
If X
1
, . . . , X
n
is a random sample drawn from a population with mean and variance
2
,
then the sample variance
S
2
n
= (n 1)
1
n

i=1
(X
i
X
n
)
2
, (5.13)
will have the following expectation:
E(S
2
) =
2
. (5.14)
5. Sampling 27
Note that
S
2
n
= (n 1)
1
n

i=1
(X
i
X
n
)
2
= (n 1)
1

(X
2
i
) nX
2
n
.
Therefore, by taking expectations we get
E(S
2
n
) = E[(n 1)
1

(X
2
i
) nX
2
n
]
= (n 1)
1
[

(E[X
2
i
]) nE(X
2
n
)]
= (n 1)
1
[n(
2
+) n(n
1

2
+)]
= (n 1)
1
[n
2

2
]
= (n 1)
1
(n 1)
2
=
2
.
Proof 5.11.
If the sample is random and drawn from a normal population, then it can also be shown
that the sampling distribution is as follows:
(n 1)S
2
/
2

2
n1
. (5.15)
Proof of this is can be found in e.g. Casella and Berger, chapter 5.
5.3.3 Finite Population Correction
As a short distraction, notice that if the whole population is sampled, the estimation error
of the sample mean will be, logically, equal to zero. Similarly, if a large proportion of the
population is sampled, without replacement, the standard error calculated above will over-
estimate the true standard error. In such cases, the standard error should be adjusted using
a so-called nite population correction 1
n1
N1
.
When the sampling fraction n/N approaches zero, then the correction will approach 1.
So for most applications,

n
, (5.16)
which is the denition of the standard error as given in the previous section. As for most
samples considered, the sampling fraction will be very small, the nite sample correction
will be neglected throughout most of this syllabus.
Problems
1. For each of the following, identify the population, the parameter of interest, and
describe how you might collect a sample.
28 QT 2013: Statistical Inference
(a) A car manufacturer wants to determine whether the tires of a particular supplier
are prone to defects.
(b) A political scientist wants to determine whether a majority of Brits would prefer
proportional representation.
(c) Trading standards wants to nd out whether a particular brand of light bulb
meets the requirement that 95% of bulbs should last for at least 400 hours.
(d) A student wants to nd out whether Europeans have, on average, become taller
in the last 20 years.
(e) A medical students wants to determine the normal body temperature.
2. Which of the following is a random variable?
(a) The population size.
(b) The sample size.
(c) The population mean.
(d) The sample mean.
(e) The population variance.
(f) The sample variance.
(g) The variance of the sample mean.
(h) The largest value in the sample.
3. Let X
1
, X
2
, ..., X
m
and Y
1
, Y
2
, ..., Y
n
be two normally distributed independent random
samples, with X
i
N(
1
,
2
1
) and Y
i
N(
2
,
2
2
). Suppose that
1
=
2
= 10,
2
1
= 2,

2
2
= 2.5, and m = n.
(a) Find E(X) and V ar(X).
(b) Find E(X Y ) and V ar(X Y ).
(c) Find the sample size n, such that
(XY )
= 0.1.
Chapter 6
Large sample theory
Literature
Wackerley et al., 7.2 & 7.3.
Rice, 5 & 6.
Casella and Berger, 3 & 5.5 - 5.6.
In many situations it is not possible to derive exact distributions of statistics with the
use of a random sample of observations. This problem disappears, in most cases, if the
sample size is large, because we can derive an approximate distribution. Hence the need for
large sample or asymptotic distribution theory. Large sample theory builds heavily on the
notion of limits.
Limit of a sequence
Suppose a
1
, a
2
, ...., a
n
constitute a sequence of real numbers. If there exists a real
number a such that for every real > 0, there exists an integer N() with the
property that for all n > N(), we have [ a
n
a [< , then we say that a is the
limit of the sequence a
n
and write lim
n
a
n
= a.
Denition 6.1.
Intuitively, if a
n
lies in an neighborhood of a (a , a +) for all n > N(), then a said
to be the limit of the sequence a
n
. Examples of limits are
lim
n
_
1 +
_
1
n
__
= 1, and (6.1)
lim
n
_
1 +
_
a
n
__
n
= e
a
. (6.2)
The notion of convergence is easily extended to that of a function f(x).
29
30 QT 2013: Statistical Inference
Limit of a function
The function f(x) has the limit A at the point x
0
, if for every > 0 there exists a
() > 0 such that [ f(x) A [< whenever 0 <[ x x
0
[< ()
Denition 6.2.
6.1 Law of Large Numbers
One of the core principles in statistics is that the sample estimator will converge to the the
true value when the sample gets larger. For instance, if a coin is ipped enough times,
the proportion of times it comes up tails should get very close to 0.5. The Law of Large
Numbers is a formalization of this notion.
In order to establish the Law of large numbers, we rst need to establish two formal
denitions of convergence: convergence in probability, and almost sure convergence.
6.1.1 Convergence in Probability
This type of convergence is relatively weak and so normally not too hard to verify.
Convergence in Probability
The sequence of random variables X
n
is said to converge in probability to the real
number x if lim
n
Pr[[ X
n
x [ ] = 0 for all > 0. Thus it becomes less and
less likely that the random variable (X
n
x) lies the outside the interval (, +).
We write X
n
p
x or plim X
n
= x.
Denition 6.3.
There exist dierent equivalent denitions of convergence in probability. Some equivalent
denitions are given below:
1. lim
n
Pr[[X
n
x[ < ] = 1, > 0.
2. Given > 0 and > 0, there exists N(, ) such that Pr[[ X
n
x [> ] < , for all
n > N.
3. Pr[[ X
n
x [< ] > 1 , for all n > N, that is, Pr[[ X
N+1
x [< ] > 1 ,
Pr[[ X
N+2
x [< ] > 1 , and so on.
6. Large sample theory 31
If X
n
p
X and Y
n
p
Y , then
(a) (X
n
+Y
n
)
p
(X +Y ),
(b) (X
n
Y
n
)
p
XY , and
(c) (X
n
/Y
n
)
p
X/Y (if Y
n
, Y ,= 0).
Theorem 6.4.
If g() is a continuous function, then X
n
p
X implies that g(X
n
)
p
g(X). In other
words, convergence in probability is preserved under continuous transformations.
Theorem 6.5.
6.1.2 Almost Sure Convergence
Almost Sure Convergence
The sequence of random variables X
n
is said to converge almost surely to the real
number x, and is written X
n
a.s.
x, if Pr[lim
n
X
n
= x] = 1. In other words,
the sequence X
n
may not converge everywhere to x, but the points where it does
not converge form a set of measure zero in the probability sense. More formally,
given , and > 0, there exists N such that Pr[[ X
N+1
x [< , [ X
N+2
x [<
, . . .] > (1 ), that is, the probability of these events jointly occurring can be
made arbitrarily close to 1. X
n
is said to converge almost surely to the random
variable X if (X
n
X)
a.s
0.
Denition 6.6.
Do not be fooled by the similarity between the denitions of almost sure convergence and
convergence in probability. Although they look the same, convergence in probability is much
weaker than almost sure convergence. For almost sure convergence to happen, the X
n
must
converge for all point in the sample space (that have a strictly positive probability). For
convergence in probability all that is needed is for the likelihood of convergence to increase
as the sequence gets larger.
6.1.3 Convergence in Mean
An alternate form of pointwise convergence is the concept of convergence in mean.
32 QT 2013: Statistical Inference
Convergence in Mean (r)
The sequence of random variables X
n
is said to converge in mean of order (r) to
x (r 1), and designated X
n
(r)
x, if E[ [ X
n
x [
r
] exists and lim
n
E[ [
X
n
x [
r
] = 0, that is, if r th moment of the dierence tends to zero. The most
commonly used version is mean squared convergence, which is when r = 2.
Denition 6.7.
The sample mean (x
n
) converges in mean square to , because V ar(x
n
) = E[(x
n
)
2
] =
(
2
/n) tends to zero as n goes to innity.
6.1.4 Weak Law of Large Numbers
The concept of convergence in probability can be used to show that, under very general
conditions, the sample mean converges to the population mean, a result that is known as
The Weak Law of Large Numbers (WLLN). This property of convergence is also referred to
a consistency, will will be treated in more detail in the next chapter.
Let X
1
, X
2
, . . . , X
n
be iid random variables with E(X
i
) = and V ar(X
i
) =
2
<
. Dene X
n
= n
1

n
i=1
X
i
. Then for every > 0,
lim
n
Pr([X
n
[ < ) = 1;
that is, X
n
converges in probability to
Theorem 6.8 (Weak Law of Large Numbers).
To Weak Law of Large numbers can be straightforwardly proven by use of Chebychevs
Inequality.
6. Large sample theory 33
For every > 0 we have
Pr
_
[X E(X)[

= Pr
_
(X E(X))
2

2

E[(X )
2
]

2
, (6.3)
with
E[(X )
2
]

2
=
V ar(X)

2
=

2
n
2
. (6.4)
As lim
n
(

2
n
2
) = 0 we have
lim
n
Pr
_
[X E(X)[

= 0. (6.5)
Proof 6.9 (Weak Law of Large Numbers).
6.1.5 Strong Law of Large Numbers
Like in the case of convergence in probability, almost sure convergence can be used to prove
the convergence (almost surely) of the sample mean to the population mean. This stronger
result is known as the the strong law of large numbers (SLLN)
Let X
1
, X
2
, . . . , X
n
be iid random variables with
E(X
i
) = and
V ar(X
i
) =
2
< .
Dene X
n
= n
1

n
i=1
X
i
. Then for every > 0,
Pr[ lim
n
([X
n
[ < ) = 1]; (6.6)
that is, X
n
converges almost surely to :
(X
n

n
)
a.s.
0. (6.7)
Theorem 6.10 (Strong Law of Large Numbers).
34 QT 2013: Statistical Inference
6.2 The Central Limit Theorem
Perhaps the most important theorem in large sample theory is the central limit theorem,
which states, under quite general conditions, that the mean of a sequence of random variables
(e.g. the sample mean) converges in distribution to a normal distribution, even though the
population is not normal. Thus, even if we did not know the statistical distribution of the
population from which a sample is drawn, we can approximate quite well the distribution
of the sample mean by the normal distribution by having a large sample.
In order to establish this result, we rely on the concept of convergence in distribution.
6.2.1 Convergence in Distribution
Convergence in Distribution
Given a sequence of random variables X
n
whose CDF is F
n
(x), and a CDF F
X
(x)
corresponding to the random variable X, we say that X
n
converges in distribution
to X, and write X
n
d
X , if lim
n
F
n
(x) = F
X
(x) at all points x at which
F
X
(x) is continuous.
Denition 6.11.
If X
n
d
X and Y
n
p
c, where c is a non-zero constant,
(a) (X
n
+Y
n
)
d
(X +c), and
(b) (X
n
/Y
n
)
d
(X/c).
Theorem 6.12.
Intuitively, convergence in distribution occurs when the distribution of X
n
comes closer
and closer to that of X as n increased indenitely. Thus, F
X
(x) can be taken to be an
approximation to the distribution of X
n
when n is large.
6. Large sample theory 35
6.2.2 Denition CLT
Let X
1
, X
2
, ..., X
n
be an iid sequence of random variables, let T
n
be their sum

n
i=1
X
i
, and let X
n
T
n
/n be their mean. Dene the standardized mean as
Z
n
=
X
n
E(X
n
)
_
V ar(X
n
)
=
T
n
E(T
n
)
_
V ar(T
n
)
.
Then, under a variety of alternative assumptions
Z
n
d
N(0, 1). (6.8)
Theorem 6.13 (Central Limit Theorem).
6.3 Relations Between Convergence Modes
The following relations hold for the dierent convergence modes:
X
n
a.s.
X =X
n
p
X =X
n
d
X, (6.9)
also,
X
n
(r)
X =X
n
(s)
X =X
n
p
X, (6.10)
with (r > s 1).
6.4 Distributions derived from the normal
6.4.1 Chi-Square
Chi-Square distribution
If Z is a standard normal random variable, the distribution of U = Z
2
is called the
chi-square (
2
) distribution, with 1 degree of freedom. This is denoted with
2
1
.
If U
1
, U
2
, . . . , U
n
are independent chi-square random variables with 1 degree of
freedom, the distribution of V = U
1
+U
2
+. . . +U
n
will be a chi-square distribution
with n degrees of freedom, denoted with
2
n
.
Denition 6.14.
The moment generation function of a
2
n
distribution is
M(t) = (1 2t)
n/2
. (6.11)
36 QT 2013: Statistical Inference
This implies that if V
2
n
, then
E(V
n
) = n, and (6.12)
V ar(V
n
) = 2n. (6.13)
Like the other distributions that are derived from the normal distribution, the chi-square
distribution often appears as the distribution of a test statistic. For instance, testing for
the joint signicance of two (or more) independent normally distributed variables. If Z
a

N(
a
,
a
) and Z
b
N(
b
,
b
) and V is dened as
V
2
=
_
Z
a

a
_
2
+
_
Z
b

b
_
2
, (6.14)
then V
2
2
(Remember that
Zaa
a
N(0, 1) ).
Also, if X
1
, X
2
, . . . , X
n
is a sequence of independent normally distributed variables, then
the estimated variance
(n 1)S
2
/
2

2
n1
. (6.15)
6.4.2 Student-t
Student t distribution
Let Z N(0, 1) and U
n

2
n
, with Z and U
n
independent, then
T
n
=
Z
_
U
n
/n
, (6.16)
will have a t distribution with n degrees of freedom, often denoted by t
n
.
Denition 6.15.
The mean an variance of a t-distribution with n degrees of freedom is
E(T
n
) = 0, and (6.17)
V ar(T) =
n
n 2
, n > 2. (6.18)
Like the normal, the expected value of the t-distribution is 0, and the distribution is
symmetric around its mean, implying that f(t) = f(t). In contrast to the normal, the
t distribution has more probability mass in its tails, a property called fat-tailness. As the
degree of freedom, n, the tails become lighter.
Indeed in appearance the student t distribution is very similar to the normal distribution;
actually in the limit n the t distribution converges in distribution to a standard normal
distribution. Already for values of n as small as 20 or 30, the t distribution is very similar
to a standard normal.
6. Large sample theory 37
Remember that for a random sample drawn from a normal distribution, Z = (X
)/(/

n) N(0, 1). However in reality we do not have information about ; thus we


normally substitute the sample estimate
_
(S
2
) = S for . The resulting test statistic
T = (X )/(S/

n) will have a t-distribution.


To prove T = (X )/(S/

n) t(n 1) rewrite the statistic to get:


X
S/

n
=
X
S/

n

/

n
/

n
(6.19)
=
(X )/(/

n)
(S/

n)/(/

n)
=
Z
S/
=
Z
_
S
2
/
2
=
Z
_
S
2
/
2
(n 1)/(n 1)
=
Z
_
(n1)S
2
/
2
n1
=
Z
_
U
n1
n1
,
where
Z = (X )/(/

n) N(0, 1) and
U
n1
= (n 1)S
2
/
2

2
n1
. (6.20)
Thus
X
S
n
/

n
t
n1
(6.21)
Proof 6.16.
38 QT 2013: Statistical Inference
6.4.3 F-distribution
F distribution
Let U
n

2
n
and V
m

2
m
, and let U
n
and V
m
be independent from each other,
then
W
n,m
=
U
n
/n
V
m
/m
, (6.22)
will have a F distribution with m and n degrees of freedom, often denoted by F
n,m
.
Denition 6.17.
The mean an variance of an F-distribution with n and m degrees of freedom is
E(T
n
) =
m
m2
(6.23)
V ar(T) = 2
_
m
m2
_
2
n +m2
n(m4)
, m > 4. (6.24)
Under specic circumstances, the F distribution converges to either a t or a
2
distri-
bution. Particularly
F
1,m
= t
2
m
, (6.25)
and
nF
n,m
d

2
n
. (6.26)
Problems
1. let X
1
, X
2
, . . . , X
n
be an independent sample (i.e. independent but not identically
distributed), with
E(X
i
) =
i
, V ar(X
i
) =
2
i
; Let n
1

n
i=1

i
. Show that if n

n
i=1

2
i
0,
then X in probability.
2. The service times for customers coming through a checkout counter in a retail store are
independent random variables with mean 1.5 minutes and variance 1.0. Approximate
the probability that 100 customers can be serviced in less than 2 hours of total service
time.
3. Suppose that a measurement has mean and variance
2
= 25. Let X be the average
of n such independent measurements. How large should n be so that Pr([X [ <
1) = 0.95 ?
4. Let Z
1
, Z
2
, Z
3
, Z
4
be a sequence of independent standard normal variables. Derive
distributions for the following random variables.
6. Large sample theory 39
(a) X
1
= Z
1
+Z
2
+Z
3
+Z
4
.
(b) X
2
= Z
2
1
+Z
2
2
+Z
2
3
+Z
2
4
.
(c) X
3
=
Z
2
1
(Z
2
2
+Z
2
3
+Z
2
4
)/3
.
(d) X
4
=
Z
1
_
Z
2
2
+Z
2
3
+Z
2
4
/

3
.
5. Let X
1
, X
2
, . . . , X
n
and Y
1
, Y
2
, . . . , Y
m
be two random independent samples drawn from
the same normal distribution N(, ). Dene S
2
X
and S
2
Y
as the estimated variances
of both samples.
Find the distribution of
S
2
X
S
2
Y
.
6. Let F
n,m
be a F distribution with n and m degrees of freedom. Derive the distribution
of
1
Fn,m
.
7. Prove that an F(1, n) distribution is equal to t
2
n
, i.e the square of a t distribution with
n degrees of freedom.
8. Amongst others, Jarque and Bera have shown that for a (large) sample that is drawn
from a normal distribution the sample skewness and kurtosis (centralised 3
rd
and 4
th
moment, respectively) are asymptotically normal with distribution

S N(0, 6/n)

K N(3, 24/n),
where n is the sample size. Derive a test for the joint hypothesis that, of a sample,
the skewness equals 0 and the Kurtosis equals 3. Derive the sampling distribution of
this test.
Chapter 7
Estimation
Literature
Wackerley et al., 8 & 9.1 - 9.4.
Rice, 8.6 - 8.8.
Casella and Berger, 7,9, & 10.1.
7.1 Point Estimation
The formula for obtaining the estimate of a parameter is referred to as an estimator, often
denoted with a hat (e.g.

). An estimator is a function of the observations x
1
, x
2
, , x
n
,
just like a statistic. The numerical value associated with an estimator is called an estimate.
There are two types of parametric estimation: point and interval estimation.
A point estimation procedure uses the information in the sample to arrive at a sin-
gle number that is intended to be close to the true value of the target parameter in the
population. For example, the sample mean
X =

n
i=1
X
i
n
(7.1)
is one possible point estimator of the population mean . There may be more than one
estimator for a population parameter. The sample median, X
(n/2)
, for example might be
another estimator for the population mean.
Small Sample Criteria for Estimators
The standard notation for an unknown parameter is and an estimator of this parameter
is normally denoted by

. The parameter space is denoted by . A function g() is called
estimable if there exists a statistic u(x) such that E[u(x)] = g().
40
7. Estimation 41
7.1.1 Unbiasedness
Unbiasedness
An estimator

is called unbiased estimator of if E(

) = . If E(

) = b()
and b() is nonzero, it is called bias.
Denition 7.1.
7.1.2 Eciency
Mean Square Error (MSE)
A commonly used measure of the adequacy of an estimator is E[(

)
2
], which is
called the mean square error ( MSE). It is a measure of how close

is, on average,
to the true . The MSE can be decomposed into two parts:
MSE = E[(

)
2
]
= E[(

E(

) +E(

) )
2
]
= V ar(

) +bias
2
(). (7.2)
Denition 7.2.
Relative Eciency
Let

1
and

2
be two alternative estimators of . Then the ratio of the respective
MSEs, E[(

1
)
2
]/E[(

2
)
2
], is called the relative eciency of

1
with
respect to

2
.
Denition 7.3.
Uniformly Mininmum Variance Unbiased Estimator (UMVE)
An estimator

of is called a uniformly minimum variance unbiased estima-
tor(UMVE) if
1. E(

) =
2. For any other unbiased estimator

, V ar(

) V ar(

), .
Thus, among the class of unbiased estimators, a UMVU has the smallest variance.
Denition 7.4.
42 QT 2013: Statistical Inference
Let X
1
, . . . , X
n
be a random (iid) sample with density function f(x[). Let

be
an unbiased estimator of . Then, under smoothness assumptions on f(x[),
V ar(

)
1
I()
, where (7.3)
I() = nE
_

2
ln f(X[)

2
_
. (7.4)
Theorem 7.5 (Cramer-Rao Inequality).
The quantity I(), also called the Fisher Information Criterion, is closely bound up with
the concept of maximum likelihood. It is a measure of the curvature of the likelihood
function near the maximum likelihood estimate estimate of . If the curvature is very steep,
then a lot of information can be obtained from the sample and the estimator is likely to
be precise. Conversely if the curvature is very blunt, then the sampling distribution of

is
more dispersed.
As the Cramer-Rao inequality stipulates a minimum variance for estimators, it is also
sometimes called the Cramer-Rao Lower Bound. If an estimator achieves this bound, or
comes very close to it, it can be said to be ecient. However, it has to be noted that it
is not always possible to create an estimator which can achieve this lower bound. Also,
the lower bound is not applicable to all estimators, only for those where the smoothness
conditions, mentioned in the denition, are met.
7.1.3 Suciency
Suciency
A statistic T(X) is a sucient statistic for if f(X
1
, . . . , X
n
[ T(X)) is independent
of .
Denition 7.6.
A sucient statistic for a parameter is a statistic that captures all information about
that is contained in the sample. All other information in the sample, not captured in the
statistic, is not relevant for estimating

.
For any parameter there are often many sucient statistics. For example the whole
sample itself (X
1
, . . . , X
n
) is always a sucient statistic. Of course, whole sample still
contains a lot of information that is not relevant to estimate . We would ideally like to
have a statistic that contains all information about , but is reduced enough so that it
contains no unimportant information:
7. Estimation 43
Minimally sucient
T(X) is said to be a minimal sucient statistic if, for any other sucient statistic
T

(X), T(X) is a function of T

(X).
Denition 7.7.
Large Sample Properties of Estimators
7.1.4 Consistency
Consistency
An estimator

is consistent if the sequence

n
converges to in the limit, i.e.

.
Denition 7.8.
There are three types of consistency, that correspond to the three dierent laws of large
numbers discussed earlier:
1.

n
p
(Weak Consistency)
2.

n
(2)
(Squared-error Consistency)
3.

n
a.s.
(Strong Consistency)
7.1.5 Asymptotic Eciency
Asymptotic Eciency
Let

n
be a consistent estimator of .

n
is said to be asymptotically ecient if
there is no other consistent estimator

n
for which
limsup
n
[E(

n
)
2
]/[E(

n
)
2
] > 1 (7.5)
for all in some open interval.
Denition 7.9.
44 QT 2013: Statistical Inference
7.2 Interval Estimation
Condence Interval
Instead of obtaining a point estimate, we can also estimate an interval, [T
1
, T
2
] to
capture with some probability (1 ) 100%. This is called a (1 ) 100%
condence interval.
Denition 7.10.
More formally, let X
1
, . . . , X
n
be a random sample, and T
1
T
2
two statistics such that:
Pr(T
1
() T
2
) = 1 . (7.6)
[T
1
, T
2
] is a two-sided (1)100% condence interval for (). Alternatively we can dene
T
l
and T
u
such that
Pr [T
1
()] = 1 , (7.7)
and
Pr [() T
2
] = 1 . (7.8)
To create [T
l
, ) and (, T
u
] the single sided (1 ) 100% condence intervals.
7.2.1 Pivotal-Quantity Method of Finding CI
pivotal quantity
The random variable Q = q(X
1
, . . . , X
n
) is said to be a pivotal quantity if the
distribution of Q is independent from .
Denition 7.11.
For example, if the sample X
1
, . . . , X
n
is a random sample drawn from N(, 1), then:
X is a pivotal quantity since x N(0, 1/n).
X/ is not a pivotal quantity since x/ N(1, 1/
2
n).
If Q is a pivotal quantity with a pdf, then for any xed (0, 1) there will exist a q
1
and q
2
such that
Pr(q
1
Q q
2
) = 1
= Pr(cq
1
cQ cq
2
)
= Pr(d +cq
1
d +cQ d +cq
2
). (7.9)
7. Estimation 45
Note in the above that the probability of the event P(q
1
< Q < q
2
) is unaected by a change
of scale or a translation of Q. Thus, if we know the pdf of Q, it may be possible to use these
operations to form the desired condence interval. For example, assume that
X N(, 1/n). (7.10)
(7.11)
Then
Q =
x
1

n
N(0, 1), (7.12)
is a pivotal quantity.
Pr
_
q
1

x
1/

n
q
2
_
= 1
= Pr
_
1

n
q
1
x
1

n
q
2
_
= Pr
_
1

n
q
1
x
1

n
q
2
x
_
= Pr
_
x
1

n
q
2
x
1

n
q
1
_
. (7.13)
So,
_
x
1

n
q
2
, < x
1

n
q
1
_
is a (1 )100% condence interval for .
7.2.2 CI for the mean of a normal population
Consider the case where both and
2
are unknown.
We know that,
Q =
x
S

n
t
(n1)
. (7.14)
As the distribution of Q does not depend on any unknown paramaters, Q is a pivotal
quantity.
Using CDF of the t-distribution, we can always nd a number q, such that
P
_
q
x
S

n
q
_
= 1 , (7.15)
which can be written as,
P
_
x q
S

n
< < x +q
S

n
_
= 1 . (7.16)
Where q = t
(

2
,(n1))
.
46 QT 2013: Statistical Inference
Let n = 10, x = 3.22, s = 1.17, (1 ) = 0.95.
The 95% CI for equals,
_
3.22
(2.262)(1.17)

10
, 3.22 +
(2.262)(1.17)

10
_
= (7.17)
[2.38, 4.06] . (7.18)
Example 1.
7.2.3 CI for the variance of a normal population
We know that
Q = (n 1)
S
2

2

2
(n1)
. (7.19)
Note that the distribution of Q does not depend on any unknown parameters, hence Q is a
pivotal quantity. Therefore we can calculate:
P (q
1
Q q
2
) = 1
= Pr
_
q
1
(n 1)
S
2

2
q
2
_
= Pr
_
(n 1)
S
2
q
2

2
(n 1)
S
2
q
1
_
. (7.20)
So,
_
(n1)S
2
q
2
,
(n1)S
2
q
1
_
is a 100(1 )100% CI for
2
.
As in the previous example, let n = 10, x = 3.22, s = 1.17, (1 ) = 0.95.
The 95 percent CI for
2
is
_
_
(n 1)
s
2

2
(

2
,(n1))
, (n 1)
s
2

2
(
1
2
,(n1))
_
_
,
with
2
(0.025, (9))
= 19.02 and
2
(0.975, (9))
= 2.70. so the 95% CI equals [0.65, 4.56].
Example 2.
Problems
1. Let X
1
, X
2
, . . . , X
n
be a random sample with mean and variance
2
. Consider the
following three estimators:
(a)
1
=
X
1
+Xn
2
7. Estimation 47
(b)
2
=
X
1
4
+
1
2

n1
i=2
X
i
(n2)
+
Xn
4
(c)
3
=

n
i=1
X
i
n+k
where 0 < k 3.
(d)
4
= X
(a) Explain for each estimator whether they are unbiased and/or consistent.
(b) Find the eciency of
4
relative to
1
,
2
, and
3
. Assume n = 36.
2. Suppose that E(

1
) = E(

2
) = , V ar(

1
) =
2
1
, and V ar(

2
) =
2
2
. A new unbiased
estimator

3
is to be formed by

3
= a

1
+ (1 a)

2
.
(a) How should a constant a be chosen in order to minimise the variance of

3
?
Assume that

1
and

2
are independent.
(b) How should a constant a be chosen in order to minimise the variance of

3
?
Assume that

1
and

2
are not independent but are such that Cov(

1
,

2
) = ,= 0.
Chapter 8
Maximum likelihood
Literature
Wackerley et al., 9.6 - 9.7.
Rice, X.X.
Casella and Berger, 6.3 & 7.2.
Greene, 14
8.1 Introduction
Let f(x[) be the joint pdf of a sample Y
1
, . . . , Y
n
with observed realizations
y
1
, . . . , y
n
. The function of dened by
L([x) = f(x[)
is called the likelihood function. Sometimes the likelihood function is written as
L() with the conditioning on the observed sample points implied (rather than
explicitly stated).
Denition 8.1 (Likelihood Function).
The joint PDF f(x[) denotes the probability of observing the observed sample for a
given parameter value , hence the label likelihood function. A natural way to nd the
parameter values that are most likely to be the true value is trying to nd the parameter
values that maximize this likelihood function.
To simplify the problem of nding the maximum point of L() we can exploit the assumed
independence of the sample to factorize the likelihood
L() = f

(y
1
, . . . , y
n
) =
n

i=1
f

(y
i
).
48
8. Maximum likelihood 49
We can further simplify the problem by taking the (natural) log of the likelihood function
to obtain
l() = ln(L()) =
n

i=1
ln(f

(y
i
)).
Taking the log is a monotonic transformation, so L() and l() will obtain their maximum for
the same value of . However it is often much easier to maximize the so called log-likelihood
rather than the likelihood itself.
Given a random sample Y
1
, . . . , Y
n
f

, with likelihood L() and log-likelihood


l(), the Maximum Likelihood Estimator (MLE) of is dened as

= arg max

L() = arg max

l(),
where is the set of permissable parameter values, also known as the parameter
space.
Denition 8.2 (Maxmimum Likelihood Estimator).
The advantage of the MLE is the theoretical simplicity of the estimator: it is the value
that is most likely to be correct, given our observations. However, in practice, MLE are
not always as easy to come by. Only in a limited number of cases can we use the MLE
to analytically solve for the value of the estimator; in most cases one has to use numerical
optimization methods to nd the estimate.
8.1.1 Estimation
The gradient of the log-likelihood is dened as
g() =
l()

(k 1),
where k is the number of parameters to be estimated. Essentially this vector
contains all the partial derivatives of the parameters w.r.t. the log-likelihood and
is also referred to as the score vector.
The MLE

satises the condition that g(

) = 0.
Denition 8.3 (Score Vector).
50 QT 2013: Statistical Inference
The Fisher information matrix, I(
0
) provides a measure of the precision with
which

is estimated, and is dened as
I(
0
) = E

0
[g(
0
)g(
0
)

]
= E

0
[H(
0
)].
where H() is the Hessian, a (k k) matrix containing the partial second order
derivatives.
Denition 8.4 (Information Matrix).
8.2 Large sample properties
8.2.1 Consistency
Under certain conditions (which often hold) the ML estimator of

can be shown to be
consistent. That is:
plim

=
8.2.2 Eciency
The maximum likelihood estimator asymptotically achieves the Cramer-Rao lower bound.
This implies that is asymptotically ecient, as no estimator can have a lower variance than
the Cramer-Rao lower bound.
8.2.3 Normality
Using central limit theory it can be shown that asymptotically
n
1/2
(


0
) N(0, nI(
0
)
1
).
Furthermore,
n
1/2
g(

) N(0, I(
0
)),
which is a property that is used in various test statics.
8.2.4 Invariance
A special property of maximum likelihood estimators is that if

is the MLE of
0
, then u(

)
will be the MLE of u(
0
). For example, given that
2
is the MLE of
2
, we know that
will be the MLE of .
1
Note that this implies that many estimators based on the Maximum Likelihood procedure
will not be unbiased.
1
u() =

() in this case.
8. Maximum likelihood 51
8.3 Variance - Covariance Matrix
As stated before, the asymptotic variance-covariance matrix is equal to the inverse of the
information matrix
[I(
0
)]
1
=
_
E
0
_

2
l(
0
)

0
__
1
.
However, in reality we only have information on

and not on
0
, so a solution should be
found to recover the empirical V-Cov matrix. There are three possible ways:
1. If the expected value of the Hessian is known, then one can evaluate
[I(
0
)]
1
=
_
E
0
_

2
l(
0
)

0
__
1
.
at the estimates

. Normally, however, the expectation of the Hessian will involve
complex nonlinear functions, so option 2 or 3 will be used instead
2. The second estimator simply evaluates the actual (instead of the expected) Hessian at
the maximum likelihood estimates
[

I(

]
1
) =
_

2
l(

_
1
.
3. A third estimator is based on the alternative formulation of the Information matrix
as the expectation of the outer product of the gradient (score vector). In this case the
outer product of the empirical gradients is evaluated at the MLE

I(

)]
1
=
_
n

g
i
g

i
_
1
.
This estimator of the V-Cov matrix is referred to as the BHHH estimator or the Outer
Product of Gradients (OPG) estimator.
Asymptotically all three estimators of the V-Cov matrix are the same, so there is no
statistical reason to choose one over the other (in large samples). Computationally the
BHHH is often the easiest to compute as it does not require the calculation of the Hessian.
In small samples, however, there can be large dierences between the three estimates. In
particular the BHHH estimates tend to be very large in small samples. Greene suggests that
in small/moderate samples V-Cov estimates based on the (empirical) Hessian are preferred.
Problems
1. One observation is taken on a discrete random variable X with pmf f(x[), where
1, 2, 3. Fin the MLE of .
52 QT 2013: Statistical Inference
x f(x[1) f(x[2) f(x[3)
0
1
3
1
4
0
1
1
3
1
4
0
2 0
1
4
1
4
3
1
6
1
4
1
2
4
1
6
0
1
4
2. Let X
1
, . . . X
n
be a random sample drawn from a population with pmf
P

(X = x) =
x
(1 )
1x
, x 0, 1, [0, 0.5].
a Find the method of moment estimator f .
b Find the MLE of .
3. Assume that the relation between income (y) and education (x) is given by the fol-
lowing model:
f(y
i
, x
i
, ) =
1
+x
i
e
y
i
/(+x
i
)
Twenty observations have been collected and tabulated below
Observation Income (y) Education (x)
1 20.5 12
2 31.5 16
3 47.7 18
4 26.2 16
5 44.0 12
6 8.28 12
7 30.8 16
8 17.2 12
9 19.9 10
10 9.96 12
11 55.8 16
12 25.2 20
13 29.0 12
14 85.5 16
15 15.1 10
16 28.5 18
17 21.4 16
18 17.7 20
19 6.42 12
20 84.9 16
(a) Find the loglikelihood equation.
8. Maximum likelihood 53
(b) Find an expression for the score of observation i, conditional on some parameter
.
(c) Find the MLE

(note: this is best done in a spreadsheet, not easily feasible on a
calculator)
(d) Calculate the variance of

(This can be done in three ways, again best done in
a spreadsheet)
Chapter 9
Hypothesis testing
Literature
Wackerly et al., 10.
Rice, 9.1 - 9.3.
Casella and Berger, 8.
9.1 Hypotheses
The testing of statistical hypotheses on unknown parameters of a probability model is one
of the most important steps of any empirical study. Three examples of common hypothesis
tests include:
The comparison of two alternative models,
The evaluation of the eects of a policy change,
The testing of the validity of an economic theory.
The term hypothesis stands for a statement or conjecture regarding the values that the
parameters,, might take. The testing of a hypothesis consists of three basic steps:
1. Formulate two opposing hypotheses,
2. Derive a test statistic and identify its sampling distribution,
3. Create a decision rule and choose one of the opposing hypothesis.
9.2 The Elements of a Statistical Test
9.2.1 Null and Alternative Hypotheses
Consider a family of distributions represented by the density function f(x; ), . A
hypothesis can be thought of as a binary partition of the parameter space into two sets,
54
9. Hypothesis testing 55

0
and
1
such that
0

1
= and
0

1
=.The set
0
is called thenull hypothesis,
denoted by H
0
.The set
1
is called thealternative hypothesis, denoted by H
1
, and is the
class of alternatives to the null hypothesis
9.2.2 Simple and Composite Hypotheses
If the null is of the form H
0
: =
0
and the alternative is H
1
: =
1
, then they are said
to be simple hypotheses. If either H
0
or H
1
species a range of values for instead of a
single value (for example, H
1
: ,=
0
) then we have a composite hypothesis. If both
the null and the alternative are a simple hypothesis, the hypothesis test reduces to choosing
between the two density functions f(x;
0
) and f(x;
1
).
9.3 Statistical Tests
A decision rule that selects one of the inferences reject the null hypothesis or do not reject
the null hypothesis is called a statistical test. A test procedure is usually described by a
sample statistic T(x) = T(x
1
, x
2
, , x
n
), which is called the test statistic.
The range of values of T for which the test procedure recommends the rejection of the
null is called the critical region, and the range in which the null is not rejected is called
the acceptance region.
9.3.1 Type I and Type II Errors
In performing a test one may either arrive at the correct conclusion or commit one of two
types of errors:
Type I error : Rejecting H
0
when it is true
Type II error: Not Rejecting H
0
when it is false
Power (1 )
The probability of rejecting H
0
when it is false is called the power of the test. This
probability is normally denoted as (1 ).
Denition 9.1.
Operating Characteristic ()
The probability of not rejecting H
0
when it is actually false (ie. committing a
type II error) is known as the operating characteristic. This probability is usually
denoted as . This concept is widely used in statistical quality control theory.
Denition 9.2.
56 QT 2013: Statistical Inference
Size ()
The probability of committing a type I error given a testing procedure, is called
the size of the test. It is also sometimes called the level of signicance of the test.
This probability is normally set by the researcher himself and is often denoted as
. Common sizes that are used in hypothesis testing are = 0.10, = 0.05, and
= 0.01
Denition 9.3.
Ideally we would want to keep both and as low as possible. But this is impossible
because, given some sample, reducing generally increases . Usually the only way to
decrease both and is to increase the sample size.
The classical decision procedure chooses an acceptable value for and then selects a
decision rule (that is, a test procedure) that minimizes .
In other words, given , among the class of decision rules for which Pr(I) , choose
the one for which is minimized or, equivalently, for which (1 ) is maximized. Thus the
test procedure selects the decision rule that maximizes (1 ) subject to P(I) . Such a
test is called a most powerful test. If the critical region obtained this way is independent
of the alternative H
1
, then we have a uniformly most powerful test.
It is also good to mention that in small samples the empirical size associated with a
critical value of a test statistic is often larger than the asymptotic size. Thus if a researcher
is not careful he risks choosing a test which rejects the null hypothesis more often than he
realizes.
9.4 Test statistics derived from maximum likelihood
Consider a restriction on the parameter space, in the form of
H
0
: c() = 0
9.4.1 Likelihood Ratio
If the restriction is valid, then imposing it should not lead to a large reduction in the log-
likelihood function. An intuitive testing method is therefore to simply look at the dierence
in the log-likelihood of the model with and without the restriction imposed. The Likelihood
Ratio(LR)test normally takes the form
2(l
r
l
u
)
2
k
where l
r
and l
u
are the log-likelihoods of the restricted and unrestricted model respectively,
and k is the number of restrictions imposed on the model. Note that this test-statistic is
always positive.
This type of test uses information of both the restricted and the unrestricted model.
9. Hypothesis testing 57
9.4.2 Wald Test
If the restriction is valid, then c(

MLE
) should be close to zero. Therefore is this value is
signicantly dierent from zero, then we can reject the null hypothesis. A disadvantage
of the Wald test is that it is not invariant. Often there are dierent ways to write the
restriction; the Wald test may change if you rewrite the constraint. The LR and LM test
do not suer from this shortcoming.
This type of test uses information of the unrestricted model.
9.4.3 Lagrange Multiplyer Test
The last test procedure is the Lagrange Multiplyer (LM) Test. Where the LR uses infor-
mation of both the restricted and unrestricted model, and the Wald test uses information
of only the unrestricted model, the LM test only uses information of the restricted model.
This is convenient if you impose a restriction like, say, Homoskedasticity but dont quite
know how the unrestricted model would look like.
The LM test is based on the observation that at the true MLE, the gradient (score vector)
of the log-likelihood is zero. If the gradient is far away from zero, then we can conclude that
the restriction is not met and we can reject the null hypothesis.
The LM test is generally
2
k
. In many settings the LM test will take the form of LM =
nR
2
i

2
k
.
Problems
1. The output voltage for a certain electric circuit is specied to be 130. A sample of 40
independent readings on the voltage for this circuit gave a sample mean of 128.6 and
a standard deviation of 2.1. Test the hypothesis that the average output voltage is
130 against the alternative that it is less than 130. Use a test with level 0.05.
2. Let Y
1,
Y
2
, ..., Y
n
be a random sample of size n = 20 from a normal distribution with
unknown mean and known variance
2
= 5. We wish to test H
0
: 7 versus H
1
:
> 7.
(a) Find the uniformly most powerful test with signicance level 0.05.
(b) For the test in (a), nd the power at each of the following alternative values for
:
1
= 7.5,
1
= 8.0,
1
= 8.5, and
1
= 9.0.
3. In a study to assess various eects of using a female model in automobile advertising,
each of 100 male subjects was shown photographs of two automobiles matched for
price, colour, and size but of dierent makes. Fifty of the subjects (group A) were
shown automobile 1 with a female model and automobile 2 with no model. Both
automobiles were shown without the model to the other 50 subjects (group B). In
group A, automobile 1 (shown with the model) was judged to be more expensive by 37
subjects. In group B, automobile 1 was judged to be more expensive by 23 subjects.
Do these results indicate that using a female model increases the perceived cost of an
automobile? Find the associated p-value and indicate your conclusion for an = .05
level test.
Appendices
58
Appendix A
Distributions
Literature
For a superb overview of the dierent types of distributions that you may come across in
your studies, the appendix at the back of Casella and Berger comes highly recommended.
It not only summarizes all salient features of the individual distributions, it also charts the
numerous links that exist between the dierent families of distributions.
A.1 Discrete Distributions
Denition 37 A random variable X is said to have a discrete distribution if it can take only
a nite number of dierent values x
1
, x
2
, ..., x
n
, or a countably innite number of distinct
points.
A.1.1 The Bernoulli Distribution
We have this distribution when there are only two possible outcomes to an experiment, one
labeled a success (p) and the other labeled a failure (1 p = q). If there is only a trial
of an experiment then we have the Bernoulli Distribution with the probability density
function
Pr(X = x[p) = p
x
(1 p)
1x
; x = 0, 1. (A.1)
for X = 0, 1(failure, success) and 0 p 1.
The mean and variance of a Bernoulli distribution are given as:
E(X) = p, (A.2)
V ar(X) = p(1 p) = pq. (A.3)
59
60 QT 2013: Statistical Inference
A.1.2 The Binomial Distribution
This is also a Bernoulli Distribution but in this distribution we have n independent trials.
Pr(X = x[n, p) =
_
n
x
_
p
x
(1 p)
nx
=
n!
x! (n x)!
p
x
(1 p)
nx
. (A.4)
x = 0, 1, ..., n (X is the number of success in n trials) 0 p 1.
The mean and variance of a binomial distribution are given as:
E(x) = np. (A.5)
V ar(x) = npq. (A.6)
Assume a student is given a test with 10 true-false questions. Also assume that
the student is totally unprepared for the test and guesses at the answer to every
question. What is the probability that the student will answer 7 or more questions
correctly?
Let X is the number of questions answered correctly. The test represents a binomial
experiment with n = 10, p = 1/2. So X Bin(n = 10, p = 1/2).
Pr(x 7) = Pr(x = 7) + Pr(x = 8) + Pr(x = 9) + Pr(x = 10)
=
10

k=7
_
10
k
__
1
2
_
k
_
1
2
_
10k
=
10

k=7
_
10
k
__
1
2
_
10
= 0.17.
Example 3.
A.1.3 Geometric Distribution
Let X be the number of the trial at which the rst success occurs. The distribution of X is
known as the geometric distribution. It has the density function
Pr(X = x[p) = p(1 p)
x1
; x = 1, 2, 3, ... (A.7)
The mean and variance of a geometric distribution are given as:
E(X) =
1
p
, (A.8)
V ar(X) =
1 p
p
2
. (A.9)
A. Distributions 61
A.1.4 Poisson Distribution
When n , and p 0 in a binomial distributed variable np = (> 0) for all n and p.
The probability of success is very small and the number of trials is large. This is known as
the Poisson distribution.
Pr(X = x[) =
e

x
x!
; x = 0, 1, 2, (A.10)
We use this distribution in queuing theory, and modeling the arrival of a next customer
on the check out line or the making of a phone call in a specic small interval. The increments
of a Poisson process are independent from each other and have the memoryless property as
they are iid exponentially distributed.
The mean and variance of a Poisson process are given as:
E(X) = , (A.11)
V ar(X) = . (A.12)
A.1.5 Hypergeometric Distribution
The binomial distribution is often referred to as sampling with replacement, which is
needed to maintain the same probabilities across the trials.
Let there be a objects in a certain class (defective) and b objects in another class (non-
defective). If we draw a random sample of size n without replacement then there are
_
a
x
_
(a over x) possible way to get x from class A. For each such outcome, there are
_
b
nx
_
(b over (n x)) possible ways drawing from B. Thus the probability density function of a
hypergeometric distributed X variable is:
f(x; n, a, b) =
_
a
x
__
b
n x
_
_
a +b
n
_ (A.13)
A.1.6 Negative Binomial Distribution
In a Binomial experiment, let Y be the number of trials to get exactly k success. To
get exactly k success, there must be k 1 success in y 1 trials and the next outcome must
be a success. Let X = y k be the number of failures until k success have been obtained.
The density function of X is known as the negative binomial.
f(x; k, p) =
_
x +k 1
k 1
_
p
k
(1 p)
x
x = 0, 1, 2, (A.14)
A.1.7 Simple Random Walk
This is a process often used to describe the behavior of stock prices. Suppose that
t
is
a purely random series with mean and variance
2
. Then a process X
t
is said to be a
62 QT 2013: Statistical Inference
random walk if
X
t
= X
t1
+
t
(A.15)
Let us assume that X
0
is equal to zero. Then the process evolves as follows:
X
1
=
1
(A.16)
X
2
= X
1
+
2
=
1
+
2
(A.17)
and so on. We have by successive substitution
X
t
=
t

i=1

i
(A.18)
Hence E(X
t
) = t and V ar(X
t
) = t
2
. Since the mean and variance change with t, the
process is nonstationary, but its rst dierence is stationary. Referring to share prices, this
says that the changes in a share price will be purely random process.
A.2 Continuous Distributions
Denition 38 Continuous distribution
For a random variable X if there exists a nonnegative function f(x), given on the real line,
such that for any interval B,
P(X B) =
_
B
f(x)dx
then X is said to have a continuous distribution and the function f(x) is called the probability
density function (PDF) or simply density function .
A.2.1 Beta Distribution
The density function for this distribution has the form
f(x; , ) =
x
1
(1 x)
1
_
1
0
x
1
(1 x)
1
dx
; 0 < x < 1, > 0, > 0. (A.19)
The denominator is known as the Beta Function. This distribution, B(, ), reduces to
the uniform distribution for = = 1.
The mean and variance of a Beta distribution are given as:
E(X) =

+
, (A.20)
V ar(X) =

( +)
2
( + + 1)
. (A.21)
A. Distributions 63
A.2.2 Cauchy Distribution
The standard Cauchy distribution has the density function
f(x) =
1
(1 +x
2
)
; < x < (A.22)
The Cauchy distribution is related to the normal distribution: it arises when the ratio of the
two independent normal variates is computed. Although it looks similar to the normal, it
has much fatter tails. In fact, its mean, variance and any higher moments are not bounded.
A.2.3 Chi-Square Distribution
If Z
1
, Z
2
, ....., Z
n
are independent N(0, 1) variables, and X =

n
i=1
Z
2
i
, then the probability
density function of a Chi-Squared distributed variable X is
f(x; n) =
(
x
2
)
n
2
1
e

x
2
2(
n
2
)
; x > 0, n = 1, 2, . . . (A.23)
Where (n/2) is a Gamma function with = n/2 and = 2.
The mean and variance of a Chi-square distribution are given as:
E(X) = n, (A.24)
V ar(X) = 2n. (A.25)
A.2.4 Exponential Distribution
The distribution
f(x; ) =
1

; x > 0, > 0. (A.26)


is called the exponential distribution. The special feature of an exponential distribution
is the fact that it is memoryless: Pr(X > m + n[X m) = Pr(X > n). The increments
of a Poisson process with arrival time have an exponential distribution with parameter .
Also, the exponential is a special case of a Gamma distribution ((, )) with = 1.
The mean and the variance of the exponential distribution are given as:
E(X) = , (A.27)
V ar(X) =
2
. (A.28)
A.2.5 F Distribution
If x = (w
1
/m)/(w
2
/n) where w1
2
(m) and w2
2
(n) are independent, then x
F(m, n) with the following density function
f(x; n, m) =
_
m
n
_m
2

_
(
m+n
2
)

_
m
2

(
n
2
)
x
m2
2
[1 +
m
n
x]
m+n
2
; x > 0, n = 1, 2, ... (A.29)
64 QT 2013: Statistical Inference
That is the ratio two independent chi-square variables, each divided by its degrees of freedom,
has the Snedecor F Distribution, with numerator and denominator degrees of freedom
equal to those of the respective chi-squares.
E(X) = n, (A.30)
V ar(X) = 2n. (A.31)
A.2.6 Gamma Distribution
This distribution has the density function
f(x; , ) =
1

()
x
1
e

; x > 0, > 0, > 0. (A.32)


When = 1, the Gamma distribution reduces to the exponential distribution.
The mean and variance of a Gamma distribution are given as follows:
E(X) = , (A.33)
V ar(X) =
2
. (A.34)
A.2.7 Lognormal Distribution
A random variable X is said to be lognormally distributed if ln(X) N(,
2
). A lognormal
distribution has a density function dened by
f
x
(x; ,
2
) =
1

2
exp
_

(ln x)
2
2
_
1
x
; 0 x < (A.35)
Because ln(X) is dened only for positive X and most economic variables take only
positive values, this distribution is very popular in economics. It has been used to model
the size of rms, stock prices at the end of a trading day, income distributions, expenditure
on particular commodities, and certain commodity prices. Although it performs well in the
center of the distribution, one of its shortcomings is that it cannot well explain the frequency
of extreme events observed in data.
The mean and variance of a lognormal distribution are given as follows:
E(X) = exp[ + (
2
/2)] (A.36)
V ar(X) = exp[2( +
2
)] exp[2 +
2
]. (A.37)
A.2.8 Normal Distribution
A random variable X with the density function
f(x; , ) =
1

_
(2)
e

1
2

(x )
2

(A.38)
A. Distributions 65
is called a Normal (Gaussian) distributed variable. The integral in the cdf of the standard
normal distribution does not have a closed form solution but requires numerical integration.
The mean and variance of a normal distribution are given as follows:
E(X) = (A.39)
V ar(X) =
2
. (A.40)
A.2.9 Students t Distribution
If Z N(0, 1) and W
2
(n)
with Z and W being independent, and X = Z/
_
W
k
, then the
probability density function of X is
f(x; n) =
1

_
(
n+1
2
)

_
n
2

(
1
2
)
_
1 +
x
2
n
_

1
2
(n+1)
(A.41)
The probability density function is symmetric, centered at zero, and similar in shape
to a standard normal probability density function. In fact, as n the t distribution
converges to a standard normal.
The mean and variance of a t distribution are given as follows:
E(X) = 0, n > 1 (A.42)
V ar(X) =
n
n 1
, n > 2. (A.43)
A.2.10 Uniform Distribution on an Interval
A random variable X with the density function
f(x; a, b) =
1
(b a)
(A.44)
in the interval a X b is called the uniform distribution on an interval.
The mean and variance of a uniform distribution are given as follows:
E(X) =
a +b
2
(A.45)
V ar(X) =
(a +b)
2
12
. (A.46)
A.2.11 Extreme Value Distribution (Gompertz Distribution)
For modeling extreme values such as the peak electricity demand in a day, maximum rainfall,
and so on, we can use the extreme value distribution which, in its standard form, has the
following density.
f(x) = e
x
exp[e
x
] < x < (A.47)
66 QT 2013: Statistical Inference
A.2.12 Geometric Distribution
The density function for this distribution is
f(x; ) = x
1
; 0 < x < 1, > 0. (A.48)
A.2.13 Logistic Distribution
This distribution has the following density function:
f(x) =
e
x
(1 +e
x
)
2
< x < . (A.49)
A.2.14 Pareto Distribution
The density function for this distribution is
f(x) =

x
0
_
x
0
x
_
+1
x > x
0
> 0 (A.50)
Although the lognormal distribution is often used to model the distribution of incomes, it
has been found to approximate incomes in the middle range very well but to fail in the upper
tail. A more appropriate distribution for this purpose is the Pareto distribution.
A.2.15 Weibull Distribution
In some more general situations the conditions for the exponential distribution are not met.
An exponential distribution provides an appropriate model for the lifetime of an equipment
but it is not suitable for the lifetime of human population. The exponential distribution
thus is called memoryless. The distribution has the density function
f(x; a, b) = abx
b1
e
ax
b
x > 0 a, b > 0 (A.51)
Note that when b = 1, this reduces to the exponential distribution.
Appendix B
Multivariate distributions
B.1 Bivariate Distributions
In most cases, the outcome of an experiment may be characterized by more than one variable.
For instance, X may be the income, Y the total expenditures of a household, and Z be family
size. We observe (X, Y, Z).
Denition 39 Joint Distribution Function: Let X and Y be two random variables. Then
the function F
XY
(x, y) = P(X x and Y y) is called the joint distribution function.
1) F
XY
(x, ) = F(x) and F
XY
(, y) = F(y).
2) F
XY
(, y) = F
XY
(x, ) = 0.
Denition 40 Joint Probability Density Function
Discrete probability function:
f
XY
(x, y) = P(X = x, Y = y) (B.1)
Continuous probability function:
f
XY
(x, y) =

2
F(x, y)
xy
(B.2)
and hence
F
XY
(x, y) =
_
x

_
y

f
XY
(u, v)dudv (B.3)
In the univariate case, if x is a small increment of x, then f
X
(x)x is the approximate
probability that x (1/2)x < X x + (1/2)x. Similarly, in a bivariate distribution,
f
XY
(x, y)xy is the approximate probability that x (1/2)x < X x + (1/2)x and
y (1/2)y < Y y + (1/2)y. The bivariate density function satises the conditions
f
XY
(x, y) 0 and
_

dF(x, y) = 1 where dF(x, y) is the bivariate analog of dF(x).


Denition 41 Marginal Density Function: If X and Y are discrete random variables,
then f
X
(x) =

Y
f
XY
(x, y) is the marginal density of X, and f
Y
(y) =

X
f
XY
(x, y) is
the marginal density of Y . In the continuous case, f
X
(x) =
_
f
XY
(x, y)dy is the marginal
density of X and f
Y
(y) =
_
f
XY
(x, y)dx is the marginal density of Y .
67
68 QT 2013: Statistical Inference
Denition 42 Conditional Density Function: The conditional density of Y given X = x is
dened as f(y [ x) = f(x, y)/f(x), provided f(x) ,= 0. The conditional density of X given
Y = y is dened as f(x [ y) = f(x, y)/f(y), provided f(y) ,= 0. This denition holds for
both discrete and continuous random variables.
Denition 43 Statistical Independence: The random variables X and Y are said to be
statistically independent if and only if f(y [ x) = f(y) for all values of X and Y for which
f(x, y) is dened. Equivalently, f(x [ y) = f(x) and f(x, y) = f(x)f(y).
Theorem 44 Random variables X and Y with joint density function f(x, y) will be sta-
tistically independent if and only if f(x, y) can be written as a product of two nonnegative
functions, one in X alone and another in Y alone.
Theorem 45 If X and Y are statistically independent and a, b, c, d are real constants with
a < b and c < d, then P(a < X < b, c < Y < d) = P(a < X < b)P(c < Y < d).
B.1.1 The Bivariate Normal Distribution
Let (X, Y ) have the joint density
f(x, y) =
1
2
x

y
_
(1
2
)
exp
_
_

1
2(1
2
)
_
_
xx
x
_
2
2
(xx)(yy)
xy
+
_
yy
y
_
2
_
_
_
(B.4)
for < x < , < y < , <
X
< , <
Y
< ,
X
,
Y
> 0 and
1 < < 1. Then (X, Y ) is said to have the bivariate normal distribution.
Theorem 46 If (X, Y ) is bivariate normal, then the marginal distribution of X is N(
X
,
2
X
)
and that of Y is N(
Y
,
2
Y
). The converse of this theorem need not to be true, that is, if the
marginal distribution of X is univariate normal, the joint density between X and Y not be
bivariate normal.
Theorem 47 For a bivariate normal, the conditional density of Y given X = x is univariate
normal with mean
Y
+(
Y
/
X
)(x
X
) and variance
2
Y
(1
2
). The conditional density
of X given Y = y is also normal with mean
X
+(
X
/
Y
)(y
Y
) and variance
2
X
(1
2
).
In the case of the bivariate normal density, the conditional expectation E(Y [ X) is the
form +X, where and depend on the respective means, standard deviations and the
correlation coecient. This is a simple linear regression in which the conditional expectation
is a linear function of X.
B.1.2 Mixture Distributions
If the distribution of random variables depend on parameters or variables which themselves
depend on other random variables then we say that we have mixture distributions. This
might take the form f(x; ) where depends on a random variable or the form f(x [ y),
where Y is another random variable. For example the unobserved heterogeneity in hazard
models is an example to the latter. The density function for the duration t, f(t [ v) is
conditional on an unobserved heterogeneity term which in turn itself is a random variable.
B. Multivariate distributions 69
B.2 Multivariate Density Functions
The joint density function of X
1
, X
2
, ...X
n
have the formf(x
1
, x
2
, ...x
n
). If Xs are continuous
random variables,
f
X
(x
1
, x
2
, x
n
) =

n
F
X
(x
1
, x
2
, x
n
)
x
1
x
2
x
n
(B.5)
B.2.1 The Multivariate Normal Distribution
Denition 48 Mean vector: Let X

= (X
1
, X
2
, ...X
n
) be an ndimensional vector random
variable dened in 1
n
with a density function f(x), E(X
i
)=
i
, and

= (
1
,
2
, ...
n
). Then
the mean of the distribution is = E(X), where and E(X) are nx1 vectors, and hence
E(X ) = 0.
Denition 49 Covariance Variance Matrix: The covariance between X
i
and X
j
is dened
as
ij
= E[(X
i

i
)(X
j

j
)], where
i
= E(X
i
). The matrix
V ar(X) =

=
_

11

12

1n

21

22

2n
.
.
.
.
.
.
.
.
.
.
.
.

n1

n2

nn
_

_
(B.6)
also denoted as V ar(X), is called the covariance matrix of X. In matrix notation, this can
be expressed as = E[(X )(X )

], where (X ) is nx1.
Note that the diagonal elements are variances.
Properties:
1. If Y
mx1
= A
mxn
B
nx1
+b
mx1
, then E(Y ) = A +b.
2. is a symmetric positive semi-denite matrix.
3. is positive denite if and only if it is nonsingular.
4. = E[XX

.
5. If Y = AX +b, then the covariance matrix of Y is AA

.
B.2.2 Standard multivariate normal density
Let X
1
, X
2
, ...X
n
be n independent random variables each of which is N(0, 1). Then their
joint density function is the product of individual density functions and is the standard
multivariate normal density.
f
X
(x
1
, x
2
, , x
n
) =
_
1

2
_
n
exp
_

n
i=1
x
2
i
2
_
=
_
1
2
_n
2
exp
_

x
2
_
(B.7)
70 QT 2013: Statistical Inference
We have the density function of the general multivariate normal distribution N(, )
as
f
Y
(y) =
1
(2)
n
2
[ [
1
2
exp
_

(y )

1
(y )
2
_
(B.8)
Properties:
1) If Y is multivariate normal, then Y
1
, Y
2
, . . . , Y
n
will be independent if and only if
is diagonal.
2) A linear combination of multivariate normal random variables is also multivariate
normal. More specically, let Y N(, ). Then Z = AY N(A, AA

), where A is an
nxn matrix.
3) If Y N(, ) and has rank k < n, then there exists a nonsingular kxk matrix
such that the kx1 matrix X = [A
1
O](Y ) is a kvariate normal with zero mean and
covariance matrix. I
k
, where O is a kx(n k) matrix of zeros.
B.2.3 Marginal and Conditional Distributions of N(, )
Let Y N(, ), and consider the following partition
Y =
_
Y
1
Y
2
_
; =
_

1

2
_
; =
_

11

12

21

22
_
; (B.9)
where the n random variables are partitioned into n
1
and n
2
variates (n
1
+n
2
= n).
Theorem 50 Given the above partition, the marginal distribution of Y
1
is N(
1
,
11
) and
the conditional density of Y
2
given Y
1
is multivariate normal with mean
2
+
21

1
11
(Y
1

1
)
and covariance matrix
22
(
21

1
11

12
).
Problems
1. Let Y
1
and Y
2
have the bivariate uniform distribution
f(y
1
, y
2
) =
_
1, 0 y
1
1; 0 y
2
1
0, otherwise
(a) Sketch the probability density surface
(b) Find F(0.2, 0.4).
(c) Find P(0.1 Y
1
0.3, 0 Y
2
0.5).
2. Let
f(y
1
, y
2
) =
_
2y
1
, 0 y
1
1; 0 y
2
1
0, otherwise
(a) Sketch f(y
1
, y
2
).
(b) Find the marginal density functions for Y
1
, and Y
2
.
Appendix C
Practice Exams
C.1 Exam I
This exam has 20 multiple choice questions. Answer ALL questions in this exam, by circling
the letter for the best answer in this exam booklet. You can use the margins for your rough
work or calculations but you are not required to provide any explanation for your choice.
For each correct answer you will score 5 points. An incorrect answer or an unanswered
question will score zero points. The maximum score in this exam is 100.
The pass mark for this exam is 50.
Good luck!
72 QT 2013: Statistical Inference
Questions
1. A (B C) can be expressed as
(a) (A B) (A C)
(b) (A B) (A C)
(c) (A B) (A C)
(d) (A B) (A C).
2. For a positively skewed distributed random variable, X, the following can be written
(a) Mean(X) < Median(X)
(b) Mean(X) > Median(X)
(c) Mean(X) < Mode(X)
(d) Median(X) < Mode(X).
3. For P(A[B), Bayes Theorem can be written as:
(a)
P(A B)
P(A)
(b)
P(A B)
P(A B
c
) +P(A B)
(c)
P(A B)
P(B)
(d)
P(A B)
P(A B) +P(A
c
B)
.
4. Let f(x) be the probability density function of X, then

f(x)dx will be equal to:


(a) 0
(b) 1
(c) E(x)
(d) none of the above.
C. Practice Exams 73
5. If the random variables X and Y are negatively correlated, then the relationship
between V ar(X +Y ) and V ar(X Y ) can be expressed as:
(a) V ar(X +Y ) > V ar(X Y )
(b) V ar(X +Y ) V ar(X Y )
(c) V ar(X +Y ) < V ar(X Y )
(d) V ar(X +Y ) V ar(X Y ).
6. Consider a population consisting of ve values - 1, 2, 2, 4, 8.
Calculate the population variance,
2
(a)
2
< 2
(b) 2
2
< 3
(c) 3
2
< 4
(d)
2
4.
7. Which of the following is a random variable?
(a) The population mean
(b) The sample mean
(c) The sample size
(d) All of the above.
8. Let X
1
, . . . , X
n
and Y
1
, . . . , Y
m
be independent random samples with X
i
N(
1
,
2
1
)
and Y
i
N(
2
,
2
2
), with
2
1
= 1,
2
2
= 2 and n = m. Find the sample size n such
that the standard error of (

X
n


Y
m
) equals 0.5.
(a) n < 10
(b) 10 n < 20
(c) 20 n < 30
(d) n 30.
74 QT 2013: Statistical Inference
9. Which of the following is correct?
(a) Convergence in probability implies almost sure convergence
(b) Almost sure convergence implies convergence in probability
(c) Both statement (A.) and statement (B.) are incorrect
(d) Both statement (A.) and statement (B.) are correct.
10. The central limit theorem implies that:
(a) A standardized population mean will converge in distribution to a normal distri-
bution
(b) A standardized population mean will converge in probability to its true value.
(c) A standardized sample mean will converge in distribution to a normal distribution
(d) A standardized sample mean will converge in probability to its true value.
11. An estimator is called unbiased if:
(a) It converges in probability to its population parameter
(b) Its expected value is equal to its population parameter
(c) There exists no other estimator of the same population parameter with a lower
variance
(d) All of the above is true.
12. Consider the following conditional density function
f(y [ x) = 2
y
x
2
for 0 x 1 and 0 y x
Calculate P[(1/4) < Y < (1/2) [ X = (5/8)].
(a) 0.48
(b) 0.52
(c) 0.58
(d) 0.69
C. Practice Exams 75
13. Let

1
and

2
be two unbiased estimators of , with V ar(

1
) < V ar(

2
). A new
estimator is dened as

3
= a

1
+ (1 a)

2
, with a chosen in such a way as to
minimize V ar(

3
). Which of the following is correct:
(a) V ar(

3
) V ar(

1
)
(b) If Cov(

1
,

2
) = 0, then a = 0.5
(c)

3
is a biased estimator of
(d) All of the above statements are incorrect.
14. In hypothesis testing a type I error occurs when:
(a) The null hypothesis is rejected when it is in fact not true
(b) The null hypothesis is not rejected when it is in fact not true
(c) The null hypothesis is rejected when it is in fact true
(d) The null hypothesis is not rejected when it is in fact true.
15. Suppose a single observation X is drawn from a uniform density on [0, ]. Consider
the hypothesis test H0 : = 1 vs. H1 : = 2. H
0
rejected if the X > 1. Which of
the following is correct:
(a) The size of this test is 0
(b) The power of this test is 0.5
(c) Both statements A. and B. are correct
(d) Both statements A. and B. are incorrect.
16. The events A and B form a partition of the sample space S. Calculate P(A[B) +
P(A[

B).
(a) 0
(b)
1
4
(c) 1
(d) None of the above.
76 QT 2013: Statistical Inference
17. Let A, B and C be events and assume the following holds: P(A) = P(B) = P(C) =
1/4, A B, and A and C are mutually exclusive. Calculate P(A B C).
(a)
1
4
(b)
3
8
(c)
1
2
(d)
5
8
(e)
3
4
.
18. Let
x
i
1 0 1
p(x
i
)
1
8
3
8
1
2
Calculate F
X
(0) +F
X
(1/2).
(a)
3
8
(b)
1
2
(c)
3
4
(d)
7
8
(e) 1.
19. Suppose the pdf of the random variable X is given by
f(x) =
_
2(1 x) for 0 < x < 1
0 otherwise
Find F(x).
(a) 3x 2
(b) 2(x 1)
(c) x(3 2x)
(d) x(2 x).
20. A student is answering a multiple choice exam with 20 questions. The probability,
that he has a question correct is 0.25. Find the threshold such that Pr[ p > [] =
0.025, where p is the realized fraction of questions that has been answered correctly.
(a) 0.20
(b) 0.20 < 0.30
(c) 0.30 < 0.40
(d) 0.40 < 0.50
(e) > 0.50.
C.2 Exam II
This exam has 20 multiple choice questions. Answer ALL questions in this exam, by circling
the letter for the best answer in this exam booklet. You can use the margins for your rough
work or calculations but you are not required to provide any explanation for your choice.
For each correct answer you will score 5 points. An incorrect answer or an unanswered
question will score zero points. The maximum score in this exam is 100.
The pass mark for this exam is 50.
Good luck!
78 QT 2013: Statistical Inference
Questions
Q1 & Q2: Let X
1
, X
2
, . . . , X
n
, be random sample with the following probability density
function f(x) = 2x
3
for x > 1
1. Compute E(X)
(a) 1 > E(X) 1.3
(b) 1.3 > E(X) 1.6
(c) 1.6 > E(X) 1.9
(d) 1.9 > E(X) 2.2
(e) 2.2 > E(X) 2.5
(f) E(X) > 2.5
2. Compute the M = Median(X).
(a) 1 > M 1.3
(b) 1.3 > M 1.6
(c) 1.6 > M 1.9
(d) 1.9 > M 2.2
(e) 2.2 > M 2.5
(f) M > 2.5
Q3. & Q4. Suppose a manufacturer of TV tubes draws a random sample of 10 tubes. The
probability that a single tube is defective is 10 percent.
3. Calculate the probability of having exactly 3 defective tubes.
(a) 0.057
(b) 0.194
(c) 0.930
(d) 0.987
4. Calculate the probability of having no more than 2 defectives.
(a) 0.057
(b) 0.194
(c) 0.930
(d) 0.987
C. Practice Exams 79
Q5. & Q6. The ages of a group executives attending a convention are uniformly distributed
between 35 and 65 years. If X denotes ages in years, the probability density function is
f(x) =
_
1
30
for 35 < X < 65
0 otherwise
.
5. Find the probability that the age of a randomly chosen executive in this group is
between 40 and 50 years.
(a) 1/4
(b) 1/3
(c) 1/2
(d) 2/3
(e) 3/4
6. Find the mean age of executives in the group.
(a) 42
(b) 47
(c) 50
(d) 52
(e) 59
7. Let A and B be two events such that P(A B) = 0.9, P(A [ B) = 0.625, and
P(A [

B) = 0.5. Calculate P(A)
(a) P(A) = 0.10
(b) P(A) = 0.25
(c) P(A) = 0.45
(d) P(A) = 0.50
(e) P(A) = 0.60
8. Consider the following two statements
(I.) The probability of the union of the events A and B can be written asP(AB) =
P(A) +P(B)[1 P(A [ B)]
(II.) In a sample, the observations of a random variable can be seen as degenerated
values of its marginal distribution (not all equal to each other, of course).
(a) Only statement (I.) is correct
(b) Only statement (II.) is correct
(c) Both statement (I.) and (II.) are correct
(d) Neither statement (I.) or (II.) are correct.
80 QT 2013: Statistical Inference
9. Consider the following two statements
(I.) Unbiasedness is related to the number of observations in each sample, while
consistency is related to the number of samples.
(II.) Consider a bivariate sample X, Y with sample size n drawn independently
from a population. Because the sample is independent the correlation between
X and Y must be zero.
(a) Only statement (I.) is correct
(b) Only statement (II.) is correct
(c) Both statement (I.) and (II.) are correct
(d) Neither statement (I.) or (II.) are correct.
10. Consider the following two statements
(I.) The following equality always holds
F(y) = P(Y y [ X x)F(x) +P(Y y [ X > x)(1 F(x))
(II.) One normally tries to minimise the probability of committing a type II error
after choosing an acceptable value for the probability of a type I error in testing
procedures.
(a) Only statement (I.) is correct
(b) Only statement (II.) is correct
(c) Both statement (I.) and (II.) are correct
(d) Neither statement (I.) or (II.) are correct.
11. For a negatively skewed distributed random variable, X, the following can be written
(a) Mean(X) < Median(X)
(b) Mean(X) > Median(X)
(c) Mean(X) > Mode(X)
(d) Median(X) > Mode(X).
12. Let P(A[B) > 0, which of the following is correct
(a) P(A[B) = P(A B[B)
(b) P(A[B) = P(A B[B) +P(A

B[B)
(c) P(A[B) > P(A

B[B)
(d) All of the above.
C. Practice Exams 81
Q. 13 & Q. 14 Consider the following frequency distribution for the random variables X and
Y .
Y
1 2 3 4 Total
20 218 302 198 660 1378
X 30 125 411 305 310 1151
40 201 256 287 327 1071
Total 544 969 790 1297 3600
13. Calculate Z
1
= E(X [ Y = 2)
(a) 20 Z
1
< 25
(b) 25 Z
1
< 30
(c) 30 Z
1
< 35
(d) 35 Z
1
< 40
14. Calculate Z
2
= E(Y [ X = 40).
(a) 2 Z
1
< 2.2
(b) 2.2 Z
1
< 2.4
(c) 2.4 Z
1
< 2.6
(d) 2.6 Z
1
< 2.8
(e) 2.8 Z
1
< 3.0
15. A (B C) can be expressed as
(a) (A B) (A C)
(b) (A B) (A C)
(c) (A B) (A C)
(d) (A B) (A C).
16. Consider the following conditional density function
f(y [ x) = c
1
y
x
2
for 0 x 1 and 0 y x
calculate c
1
.
(a) 2
(b) 0.5
(c) x
(d) 1
82 QT 2013: Statistical Inference
17. Let X
1
, X
2
, . . . , X
n
, be random sample from a Poisson distribution with probability
function
f(x) =
e

x
x!
for > 0 and x = 0, 1, 2, ...
The maximum likelihood estimator of equals
(a) +x
i
ln() ln(x)
(b) n
1

n
i=1
(x
i
)
(c) 1 +
x
i

(d)

n
i=1
[ +x
i
ln() + ln(x)]
18. If P(A B) 0 then the following inequality can be written
(a) P(A B) P(A) +P(B)
(b) P(A) +P(B) P(A B)
(c) P(A B) P(A) P(B)
(d) P(A) P(A B) P(B).
19. Consider the following two statements
(I.) The Cramer-Rao Inequality establishes a lower bound for the variance of an
unbiased estimator of . However, it does not necessarly imply that the variance
of the minimum variance unbiased estimator of has to be equal to the Cramer-
Rao Lower Bound.
(II.) Non-random samples may not be used to make scientic inferences.
(a) Only statement (I.) is correct
(b) Only statement (II.) is correct
(c) Both statement (I.) and (II.) are correct
(d) Neither statement (I.) or (II.) are correct.
20. Which of the following statements about measures of location is correct
(a) Harmonic means are generally appropriate to use for samples consisting of ratios
(b) Arithmetic means are generally appropriate to use for samples that include ex-
tremely low or high values
(c) Geometric means are generally appropriate to use for samples consisting of pro-
portions
(d) None of the above.

You might also like