Statistics Introduction

Chapter 1 Introduction
Statistics is learning from data the science of designing studies and analysing their results. Statistics uses a lot of mathematics, and some like statistics because of its challenging mathematics. Statistics involves a lot of other skills too arising in the application of statistical thinking in practice, and some like statistics because of its usefulness in a range of interesting applications. Statistics is pervasive everyone has data they need to analyse! Statistical reasoning can give you a completely new perspective on everyday things you hear about that involve data news stories, ads, political campaigns, surveys... in fact it has been argued that an understanding of statistics is essential to truly understand the world around you.
Example Shortly before the last New South Wales (NSW) state election, a poll was held asking a selection of NSW voters people who they would vote for. Out of 365 respondents, 65% said they would vote the liberal party ahead of Labour. Some important questions where statistics can help are: How accurate is this estimate of the proportion of people voting for Liberal ahead of Labour? Is this sample of just 365 NSW voters sucient to predict that that Liberal party would win the election?
CHAPTER 1. INTRODUCTION
Example Does smoking while pregnant aect the cognitive development of the foetus? Johns et al (1993) conducted a study to look at this question using guinea pigs as a model. They planned to inject nicotine tartate in a saline solution into some pregnant guinea pigs, inject no nicotine into others, and compare the cognitive development of ospring by getting them to complete a simple maze where they look for food. Some important questions where statistics can help are: How many guinea pigs should they include in the experiment? How should the no nicotine treatment have been applied? How should the data be analysed to assess the eect of nicotine on cognitive development?
Some other examples of questions we will answer in this course: Is climate change aecting species distributions? Whats your chance of winning the lottery? Is it worth playing? Is there gender bias in promotion? Does praying for a patient improve their rate of post-operative recovery? Is Sydney getting less rainfall than it used to? Statistics is of fundamental importance in a wide range of disciplines, and as such statistical skills are highly valued in the job market. Some major examples are in research studying whether a new treatment works, understanding the eects of a new pest on wildlife; and in business to study and predict sales patterns, in trialling new products. There is a serious shortage of statisticians in the jobs market, which is good news for you if you choose to do a stats major!
Key themes in statistics

Below are some key ideas in statistics some themes that will come up repeatedly in this course and indeed in the statistics major.
KEY THEMES IN STATISTICS
Sample vs population
The population is the total set of subjects we are interested in. A sample is the subset of the population for which we have data. A census is a study which obtains data on the whole population. The vast majority of studies use a sample rather than a census for logistical reasons it is easier to get data on just a sample, and you can often get better data by spending a lot of time with a few subjects rather than spending a little time with a lot.
Example Consider the guinea pigs study of the eects of smoking during pregnancy on ospring. The study used a sample of guinea pigs the alternative would be to enrol every guinea pig on the planet in the study!
Example The Australian Bureau of Statistics co-ordinate a census of all Australians every ve years, to nd out demographic information such as population size, age of Australians, education, etc. However they supplement this information with face-to-face interviews of a sample of people, e.g. for monthly estimates of the unemployment rate.
When using a sample, a challenging question we will consider is: what can we say about the population, based on our sample?
Description vs inference
Descriptive statistics refers to methods for summarising data. Inferential statistics is methods of making statements, decisions or predictions about a population, based on data obtained from a sample of that population. Most statistical analyses that you have met in previous studies are descriptive statis-
tics calculating a sample mean, histogram, etc. Calculating descriptive statistics is a key step in analysis, and it is useful for looking for patterns in a sample. But often we want to say something about a population based on the sample, that is, we want to make inferences about the population.
Example Consider the NSW poll, which included 365 registered NSW voters. When we say 65% of respondents would vote Liberal party ahead of Labour, we are reporting a descriptive statistic. When we use the data to answer the question How much evidence does this study provide that the Liberal Party will win the next election? we are making an inference about the population of all 4,635,810 (!) NSW voters, based on a sample of just 365.
Inference is where things get challenging in statistics mathematically and conceptually and later in this course we will meet some core tools for making inferences about populations based on samples, in one- and two-variable situations.
Sampling introduces variation

A fundamental idea in statistical inference is that sampling induces variation dierent samples will give you dierent data, depending on which subjects end up getting included in the sample.
Example Consider again the NSW election example. 365 NSW voters were sampled, and of these, 65% of them would vote for the Liberal party ahead of the Labour party. If a dierent 365 NSW voters were sampled, would you expect to get exactly 65% voting for the Liberal party again?
When we want to make inferences about a population based on a sample, we need to take into account the sample variation the extent to which we would expect
results to vary from one sample to the next. If we sample randomly, the responses in our response are random, which motivates the use of probability theory in data analysis. Probability has a key role to play in statistical inference, so a focus in this course is to develop key probabilistic ideas.
Signal vs noise
In statistical inference, the goal is to answer some general population-level question, while taking into account sample-to-sample variability. This can be understood as looking for a signal, while accounting for the noise. The methods we use for analysis, in this course and throughout the statistics major, will all have a component representing the signal and a component representing the noise. Typically we are primarily interested in the signal. It is important in analysis that we model the noise as well as we can, in order to better estimate the signal and make valid inferences about it experience tells us that if you dont model the noise correctly you can draw incorrect conclusions about the signal. Whenever you see a model for data, it might help you understand it to think about which components represent noise, and which represent the signal. (And what sort of signal are we specically interested in?)
Example Consider again the guinea pig study, where we are interested in whether there is an eect of the nicotine treatment on cognitive development of ospring, as tested using the number of errors the ospring make when completing a maze. A model for number of errors (Yi ) of the ith guinea pig is: Y i = i +
where i is the signal the true average number of errors of guinea pigs assigned the same treatment (nicotine or none) as the ith guinea pig, and i is the noise the error variation around this true average. We are interested in whether the true average is the same across all treatments that is, if for all i, i = for some constant .
Whats the research question?

How you collect data and how you analyse it depends on what the primary research question you are trying to answer is. So the rst thing to understand when thinking about the design and analysis of a study is: what is the primary purpose of a study? All else depends on it.
Example How you collect data depends on the question Consider the NSW election example we want to answer the question Who will win the next election? To answer this question we need to start with a representative sample of NSW voters. In obtaining a representative sample of NSW voters we need to make sure we dont include people who are not eligible to vote in the election, for example: People under the age of 18 People who have not registered to vote People not registered to vote in NSW. We also need to sample in a manner that gives all NSW voters the opportunity of being in the sample, to make sure all types of voter are represented. For example, using random digit dialling dialling random (landline) phone numbers would exclude from the sample any voters who do not have a landline (because they use a mobile phone instead, or no phone).
Example How you analyse data depends on the question Consider the following data: Sample A: Sample B: 8 37 30 29 27 26 33 0 42 21 18 65 34 78 45 43 21 25 20 75
How should it be analysed? Below are three graphs that would all appropriate for their own research questions.

80 80
7
(c)
Paired difference 0 20 40 A B
(a)
(b)
q q q
q q q
20
10
20 30 Sample A
40
These data are actually from the guinea pig experiment, where Sample A is the number of errors in the maze made by control guinea pigs (no nicotine treatment) and Sample B is the number of errors by treatment guinea pigs (with nicotine treatment). Which plot is most suitable for visualising the eect of nicotine on cognitive development of guinea pigs?
The main lesson here is that whenever collecting or analysing data, or answering questions on how to do it, you need to keep in mind the primary purpose of the study!
Number of errors 20 40 60
Sample B 40 60
Part One Summarising data
Chapter 2 Descriptive statistics

Given a sample of data, {x1 , x2 , . . . , xn }, how would you summarise it, graphically or numerically? Below we will quickly review some key tools. You will not be expected to construct the following numerical and graphical summaries by hand, but you should understand how they are produced, know how to produce them using the statistics package R, and know how to interpret such summaries. For more details on methods for descriptive statistics see: W. S. Cleveland (1994) Elements of graphing data, Hobart Press.
Two steps to data analysis

The rst two things to think about in data analysis are: 1. What is the research question? Descriptive statistics should primarily focus on providing insight into this question. 2. What are the properties of the variables of primary interest? The most important property to think about when constructing descriptive statistics is whether each variable is categorical or quantitative. Any variable measured on subjects is either categorical or quantitative. Categorical responses can be sorted into a nite set of (unordered) categories, e.g. gender. Quantitative Responses are measured on some sort of scale, e.g. height. If the sample {x1 , x2 , . . . , xn } comes from a quantitative variable, then the xi are real numbers, xi . If it comes from a categorical variable, then each xi comes from a nite set of categories or levels, xi {C1 , C2 , . . . , CK }. 9
10
CHAPTER 2. DESCRIPTIVE STATISTICS
Example Consider the following questions: 1. Will more people vote the Liberal party ahead of Labour at the next election? 2. Does whether or not pregnant guinea pigs are given a nicotine treatment aect the number of errors made in a maze by their ospring? 3. Is whether or not a Titanic passenger survived related to their gender? 4. How does brain mass change in dinosaurs, as body mass increases? What are the variables of interest in these questions? Are each of these variables categorical or quantitative?
Summary of descriptive methods Useful descriptive methods for when we wish to summarise one variable, or the association between two variables, depend on whether these variables are categorical or quantitative. Does the research question involve: One variable Two variables Both Both One of each Data type: Categorical Quantitative quantative categorical Numerics: Table of frequencies Mean/sd Median/quantiles Dotplot Boxplot Histogram etc Two-way table Mean/sd per group Scatterplot Boxplots Histograms etc Correlation
Graphs: Bar chart
Clustered bar chart
Scatterplot
We will work through each of the methods mentioned in the above table.
Example Consider again the the research questions of the previous example. What method(s) would you use to construct a graph to answer each research question?
CATEGORICAL DATA
11
Categorical data
We will simultaneously treat the problems of summarising one categorical variable and studying the association between two categorical variables, because similar methods are used for these problems.
Numerical summaries of categorical data

The main tool for summarising categorical data is a table of frequencies (or percentages). A table of frequencies consists of the counts of how many subjects fall into each level of a categorical variable. A two-way table (of frequencies) counts how many subjects fall into each combination of levels from a pair of categorical variables.
Example We can summarise the NSW election poll as follows: Party Liberal Frequency 237 Labour 128
Example Consider the question of whether there is an association between gender and whether or not a passenger on the Titanic survived. We can summarise the results from passenger records as follows: Outcome Survived Died 142 709 308 154
Gender
Male Female
which suggests that a much higher proportion of females survived: their survival rate was 67% vs 17%!
12
In the Titanic example, an alternative summary was the percentage survival for each gender. Whenever one of the variables of interest has only two possible outcomes a list (or table) of percentages is a useful alternative way to summarise the data. If you are interested in an association between more than two categorical variables you can extend the above ideas, e.g. construct a three-way table...
Graphical summaries of categorical data

A bar chart is a graph of a table of frequencies. A clustered bar chart graphs a two-way table, spacing the bars out as clusters to indicate the two-variable structure:
Bar Chart
200 600
Clustered bar chart

Died Survived
Frequency 50 100 150
Liberal
Labour
Frequency 200 400
male
female
Pie charts are often used to graph categorical variables, however these are not generally recommended. It has been shown that readers of pie charts nd it more dicult to understand the information that is contained in them, e.g. comparing the relative size of frequencies across categories. (For details, see the Wikipedia entry on pie charts and references therein http://en.wikipedia.org/wiki/Pie chart)
Quantitative data
When summarising a quantitative variable, we are usually interested in three things: Location or centre a value around which most of the data lie Spread how variable the values are around their centre. Shape other information about a variable apart from location and spread. Skewness is an important example.
QUANTITATIVE DATA
13
Numerical summaries of quantitative data

The most commonly used numerical summaries of a quantitative variable are the sample mean, variance and standard deviation: The sample mean 1 x= n
n
xi
i=1
is a natural measure of location of a quantitative variable. The sample variance 1 s = n1

2 n
i=1
(xi x)2
is a common measure of spread. The sample standard deviation is dened as s = s2 . The variance is a useful quantity for theoretical purposes, as we will see in the coming chapters. The standard deviation however is of more practical interest because it is on the same scale as the original variable and hence is more readily interpreted. The sample mean and variance are very widely used and we will derive a range of useful results about these estimators in this course. Lets say we order the n values in the dataset and write them in increasing order as {x(1) , x(2) , . . . , x(n) }. For example, x(3) is the third smallest observation in the dataset. The sample median is x( n+1 ) 2 x0.5 = 1 x n + x n+2 2 (2) ( 2 )
if n is odd if n is even
More generally, the pth sample quantile of the data x is xp = x(k) where p = k 0.5 n
for k {1, 2, . . . , n}. We can estimate the sample quantile for other values of p by linear interpolation. The median is sometimes suggested as a measure of location, instead of x, because it is much less sensitive to unusual observations (outliers). However, it is much less widely used in practice. There are a number of alternative (but very similar) ways of dening sample quantiles. A dierent method again is used as the default approach on the statistics package R.
14
Example The following (ordered) dataset is the number of mistakes made when ten subjects are each asked to do a repetitive task 500 times. 2 4 5 7 8 10 14 17 27 35
Find the 5th and 15th sample percentiles of the data. Hence nd the 10th percentile. There are ten observations in the dataset, so the 5th sample percentile is x(10.5)/10 = x0.05 = 2 Similarly, the 15th sample percentile is 4. The 10th sample percentile is the average of these two. So x0.1 can be estimated as x0.1 = 1 1 x(1) + x(2) = (2 + 4) = 3 2 2
Apart from x0.5 , the two important quantiles are the rst and third quartiles, x0.25 and x0.75 respectively. These terms are used to dene the interquartile range IQR = x0.75 x0.25 which is sometimes suggested as an alternative measure of spread to the sample standard deviation, because it is much less sensitive to unusual observations (outliers).
Graphical summaries of quantitative data

There are many ways to summarise a variable, and a key thing to consider when choosing a graphical method is the sample size (n). Some common plots:
Dotchart (small n)
q q q q q
Boxplot (moderate n)
Frequency 40 80 120 20 30 40 50 60 0
Histogram (large n)
q q
10
20
30
40
50
20
40
60
QUANTITATIVE DATA
15
A dotchart is a plot of each variable (x-axis) against its observation number, with data labels (if available). This is useful for small samples (e.g. n < 20). A boxplot concisely describes location, spread and shape via the median, quartiles and extremes: The line in the middle of the box is the median, the measure of centre. The box is bounded by the upper and lower quartiles, so box width is a measure of spread (the interquartile range, IQR). The whiskers extend until the most extreme value within one and a half interquartile ranges (1.5IQR) of the nearest quartile. Any value farther than 1.5IQR from its nearest quartile is classied as an extreme value (or outlier), and labelled as a dot or open circle. Boxplots are most useful for moderate-sized samples (e.g. 10 < n < 50). A histogram is a plot of the frequencies or relative frequencies of values within dierent intervals or bins that cover the range of all observed values in the sample. Note that this involves breaking the data up into smaller subsamples, and as such it will only nd meaningful structure if the sample is large enough (e.g. n > 30) for the subsamples to contain non-trivial counts. An issue in histogram construction is choice of number of bins. A useful rough rule is to use number of bins = n A histogram is a step-wise rather than smooth function. A quantitative variable that is continuous (i.e. a variable that can take any value within some interval) might be better summarised by a smooth function. So an alternative estimator that often has better properties for continuous data is a kernel density estimator: 1 fh (x) = n
n
i=1
wh (x xi )
for some choice of weighting function wh (x) which includes a bandwidth parameter h. Usually, w(x) is chosen to be the normal density (dened in Chapter 4) with mean 0 and standard deviation h. A lot of research has studied the issue of how to choose a bandwidth h, and most statistics packages are now able to automatically choose an estimate of h that usually performs well. The larger h is, the larger the bandwidth that is used i.e. the larger the range of observed values xi that inuence estimation of fh (x) at any given point x.
16
Shape of a distribution
Something we can see from a graph that is hard to see from numerical summaries is the shape of a distribution. Shape properties, broadly, are characteristics of the distribution apart from location and spread. An example of an important shape property is skew if the data tend to be asymmetric about its centre, it is skewed. We say data are left-skewed if the left tail is longer than the right, conversely, data are right-skewed if the right-tail is longer.
There are some numerical measures of shape, e.g. the coecient of skewness 1 : 1 1 = (n 1)s3
n
i=1
(xi x)3
SUMMARISING ASSOCIATIONS BETWEEN VARIABLES
17
but they are rarely used perhaps because of extreme sensitivity to outliers, and perhaps because shape properties can be easily visualised as above. Another important thing to look for in graphs is outliers unusual observations that might carry large weight in analysis. Such values need to be investigated are they errors, are they special cases that oer interesting insights, how dependent are results on these outliers.
Summarising associations between variables

We have already considered the situation of summarising the association between categorical variables, which leaves two possibilities to consider...
Associations between quantitative variables

Consider a pair of samples from two quantitative variables, {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. We would like to understand how the x and y variables are related. An eective graphical display of the relationship between two quantitative variables is a scatterplot a plot of the yi against the xi .
Example How did brain mass change as a function of body size in dinosaurs?
Brainsizebody mass relationship in dinosaurs
q
200
q q
Brain mass (ml) [log scale]
50
2 1
q
5 5e01
10
20
5e+00
5e+01
5e+02
5e+03
Body mass (kg) [log scale]
An eective numerical summary of the relationship between two quantitative vari-
18
ables is the correlation coecient (r): 1 r= n1

n
i=1
xi x sx
yi y sy
where x and sx are the sample mean and standard deviation of x, similarly for y. r measures the strength and direction of the association between x and y: Results 1. |r| 1 2. r = 1 if and only if yi = a + bxi for each i, for some constants a, b such that b < 0. 3. r = 1 if and only if yi = a + bxi for each i, for some constants a, b such that b > 0. Can you prove these results? (Hint: consider the square of
xi x sx
yi y ) sy
These results imply that r measures the strength and direction of associations between x and y: Strength of (linear) association values closer to 1 or 1 suggest that the relationship is closer to a straight line Direction of association values less than one suggest a decreasing relationship, values greater than one suggest an increasing relationship Examples:
r = 0.88
3
q q
r = 0.03
q q q q q q q q q q q q q
2.0
q q q q q q
q q
2.5
3.0
q q
1.0
q q
q qq q qq q q q q q q q
1.5
q q q q q
q q q
q q q
0.5
q q
r = 0.2
q q
r = 0.88
4
q q q
14
q q q q q q qq q q q q q q q q q q q q q
12
q q q
10
q q
q q qq q q q q qq q q
qq
q q q q q q q q
16
18
20
22
24
9 8
SUMMARISING ASSOCIATIONS BETWEEN VARIABLES
19
We will construct a diagram demonstrating the intuition behind how r measures these two properties.
Associations between categorical and quantitative variables

When studying whether categorical and quantitative variable are associated, an eective strategy is to summarise the quantitative variable(s) separately for each level of the categorical variable(s).
Example Recall the guinea pig experiment we want to explore whether there is an association between a nicotine treatment (categorical) and number of errors made by ospring (quantitative). To summarise number of errors, we might typically use mean/sd and a boxplot. To instead look at the association between number of errors and nicotine treatment, we calculate mean/sd of number of errors for each of the two levels of treatment (nicotine and no nicotine), and construct a boxplot for each level of treatment:
20
Sample A Sample B
x s 23.4 12.3 44.3 21.5
Number of errors
20
40
60
80
Note that in the above example the boxplots are presented on a common axis sometimes this is referred to as comparative boxplots or side-by-side boxplots. An advantage of boxplots over histograms is that they can be quite narrow and hence readily compared across many samples by stacking them side-by-side. Some interesting extensions are reviewed in the article 40 years of boxplots by Hadley Wickham and Lisa Stryjewski at Rice University. See the R lab exercises for an extension of this idea to when there are more than two variables. For other interesting graphs in action, have a look at the Joy of Statistics (http://www.youtube.com/watch?v=jbkSRLYSojo)!
Transforming data
Transforming data is typically done for one of two reasons to change the scale data were measured on (linear transformation), or to improve data properties (non-linear transformation). We will treat each of these in turn.
Linear transformation
A linear transformation of a sample from a quantitative variable, from {x1 , x2 , . . . , xn } to {y1 , y2 , . . . , yn }, satises: yi = a + bxi for each i
Linear transformation does not aect the shape of a distribution only its location
TRANSFORMING DATA and spread. Eects of linear transformation on statistics Consider a linear transformation yi = a + bxi , i = 1, . . . , n, and its eects on some statistic to be calculated from the xi (mx ) and the yi (my ). If mx is a measure of location, my = a + bmx If mx is a measure of spread in the same units as x, my = bmx If mx is a measure of shape then it is invariant under linear transformation: my = mx
21
These results are necessary by denition to be a measure of location of x, mx has to move with the data under changes of scale. For mx to be a measure of spread, it needs to be invariant under translation but it needs to vary with resizing. A measure of shape on the other hand should be invariant under any change of scale.
Example Show the following sequence of results: under linear transformation, 1. The sample mean x =
1 n n i=1
xi behaves as a measure of location.

1 n1 n i=1 (xi
2. The standard deviation sx = spread. 3. The correlation coecient 1 r= n1
x)2 behaves as a measure of
i=1
xi x sx
yi y sy
behaves as a measure of shape (consider linear transformations of xi and yi ).
Example Dinosaur body mass (x) was measured (well, estimated!) in kilograms.
22
If we transform the body mass data into grams instead (denoted y), how will the following values calculated from y relate to their counterparts calculated from x? 1. y , mean body mass in grams 2. sy , standard deviation of body mass in grams 3. ry , the correlation between body mass (in grams) and brain mass. What about when the above three statistics are calculated on log-transformed data, rather than the raw data?
A particularly important example of a linear transformation is standardisation of data to z-scores, as below. The z-score, or standardised score of a quantitative variable is dened as z= xx sx
The z-score is a measure of unusualness it measures how many standard deviations above/below the mean a value is (extreme values being unusual ones, far from zero). We will attach probabilities to precisely how unusual a given z-score is in the coming chapters.
Example Sydneys daily maximum temperature in March has a mean of about 25 degrees Celsius, and a standard deviation of 2.2. Hence the following z-scores: A March maximum temperature of 20 degrees in Sydney: z = 2.3. A maximum temperature of 35 degrees in Sydney: z = 4.5. Some other unusually large z-scores: Sachin Tendulkars batting average: z = 1.5 Don Bradmans batting average: z = 5.5. Your winnings if you win the jackpot in the Powerball lotto: z = 7367!!!
TRANSFORMING DATA Nonlinear transformations
23
If you have a quantitative variable that is strongly skewed, then the patterns you see (in scatterplots or elsewhere) can be dominated by a few outlying values. In such cases transforming data can be a good idea applying a non-linear transformation to a dataset will change its shape, often changing it for the better. The most common transformation is a log-transformation. You can use any base (base 2 and 10 are most common for interpretability) and it doesnt really matter. . .
Example Consider brain-massbody-mass data for dinosaurs, reptiles and birds. Compare scatterplots of log-transformed data and untransformed data!
Untransformed variables
100 150 200 250 300 350
q q q
logtransformed variables
q q q
Brain mass (ml) [log scale] 0.5 5.0 50.0
Dinosaur Reptile Bird
Dinosaur Reptile Bird

q q qq q q
q q
q q q q q qq qqq qq q q
q qq qq q qqqq q q q q q qqq q q q qq qq q qq q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q
Brain mass (ml)
q q
q q
50
1000
2000
3000
4000
0.1
1e02
1e+00
1e+02
Note that in the untransformed plot, little can be seen except for three outlying values (Tyrannosaurus, Carhcarodontosaurus and Allosaurus). On transformation, a lot of interesting structure becomes apparent.
Example* Show that measures of shape calculated on yi = logc (xi ), for i = 1, . . . , n, are the same for all positive values of the base c.
One reason why the log-transformation often works so well in revealing structure is the special property: log(ab) = log(a) + log(b). This can be understood as taking
24
multiplicative processes and making them additive that is, a variable that grows in a multiplicative way (e.g. account balance, size, prot, population size) can be understood as growing in an additive way once log-transformed. This is useful because in graphs (and in most analyses) additive patterns are the easiest to perceive. In the body mass-brain mass example we are studying two size variables, which are naturally understood as the outcomes of multiplicative (growth) processes hence log-transformation better reveals the relation between these processes. Let y = h(x) be some non-linear transformation of real-numbered values x. In most cases, y = h() x This point should be kept in mind when analysing transformed data the mean of transformed data is a dierent quantity to the mean of the originally observed variable, and they do not even have a one-to-one correspondence in most cases.
Part Two Modelling data
26
Chapter 3 Random Variables

When we have taken a sample of measurements of a variable, the next step beyond characterising it using descriptive statistics is to construct a model for the underlying processes (the signal and the noise). In this section of the notes we will describe some common distributions (data models) and their properties. Measurements always vary from one subject to another due to factors that are beyond our control, or beyond our knowledge. For this reason, we treat the measurements as random variables variables that have some random (noise) component. In this chapter we will explore random variables and some properties of random variables that are important in their study.
Introduction
For a discrete sample space S, a random variable X is a function dened on S with P (X = x) = P ({s})
s:X=x
being the probability that X takes the value x.
Example Toss a fair coin 3 times. S = {HHH, HHT, HTH, THH, TTH, THT, HTT, TTT} Let X denote the number of heads turned up. Then P (X = 0) = 1 3 3 1 , P (X = 1) = , P (X = 2) = , P (X = 3) = . 8 8 8 8 27
28
CHAPTER 3. RANDOM VARIABLES
Discrete Random Variables and Probability Functions

Denition The random variable X is discrete if there are countably many values x for which P (X = x) > 0. The probability structure of X is most commonly described by its probability function (sometimes referred to as its probability mass function, as in Hogg et al 2005). Denition The probability function of the discrete random variable X is the function fX given by fX (x) = P (X = x). Results The probability function of a discrete random variable X has the following properties: 1. fX (x) 0 for all x R. 2.
all x
fX (x) = 1.
Example Below is the probability function for the number of heads in three coin tosses:
f (x) 3/8 1/8 x
CONTINUOUS RANDOM VARIABLES AND DENSITY FUNCTIONS
29
Probabilities can be understood as the long-run frequencies of the outcomes of the variable X. Hence fX (x) can be understood as a bar chart of the relative frequencies of a very large sample from X.
Continuous Random Variables and Density Functions

When a random variable has a continuum of possible values it is continuous (e.g. the lifetime of a light bulb has possible values in [0, )). The analogue of the probability function for continuous random variables is the density function (sometimes called the probability density function). Denition The density function of a continuous random variable is a real-valued function fX on R with the property fX (x) dx = P (X A)
for any (measurable) set A R. Just as fX (x) can be understood as the long-run bar chart (of relative frequencies) in the discrete case, in the continuous case it can be understood as a long-run histogram a histogram (of relative frequencies) for a very large sample from X. Results The density function of a continuous random variable X has the following properties: 1. fX (x) 0 for all x R. 2.

fX (x) dx = 1.
Result For any continuous random variable X and pair of numbers a b

b
P (a X b) =
fX (x)dx = area under fX between a and b.

a
These results demonstrate the importance of fX (x): if you know or can derive fX (x), then you can derive any probability you want to about X, and hence any property
30 of X that is of interest.
Some particularly important properties of X are described later in this chapter.
Example The lifetime (in thousands of hours) X of a light bulb has density fX (x) = ex , x > 0. Find the probability that a bulb lasts between 2 thousand and 3 thousand hours.
Continuous random variables X have the property P (X = a) = 0 for any a R. (3.1)
It only makes sense to talk about the probability of X lying in some subset of R. A consequence of (3.1) is that, with continuous random variables, we dont have to worry about distinguishing between < and signs. The probabilities are not aected. For example, P (2 < X < 3) = P (2 X < 3) = P (2 < X 3) = P (2 X 3). This is not the case for discrete random variables.
Cumulative Distribution Function

The cumulative distribution function (cdf) of the random variable X is FX (x) = P (X x). This denition applies irrespective of whether a variable is discrete or continuous. The following gure shows the cdf for the coin toss example of page 28:
CUMULATIVE DISTRIBUTION FUNCTION
31
F (x) 1 7/8 1/2 1/8 0 0 1 2 3
Result For any random variable X and pair of numbers a b FX (b) FX (a) = P (a < X b) Proof:
FX (b) FX (a) = P (X b) P (X a)
= P (a < X b) + P (X a) P (X a) = P (a < X b)
= P ({a < X b} {X a}) P (X a)
The next two results show how FX may be found from fX and vice versa. Result The cumulative distribution function FX (x) can be found from its probability or density function fX (x) in the following way: FX (x) =
tx fX (t) x f (t) dt X
if X is discrete if X is continuous
Result The probability or density function fX (x) can be found from the cumulative distribution function FX (x) in the following way: fX (x) = FX (x) FX (x ) if X is discrete FX (x) if X is continuous
where FX (x ) is the limiting value of FX (x) as we approach x from the negative direction.
32
Example A coin, with p=probability of a head on a single toss, is tossed until a head turns up for the rst time. Let X denote the number of tosses required. Find the probability function and the cumulative distribution function of X.
Example The lifetime (in thousands of hours) X of a light bulb has density fX (x) = ex , x > 0. Find the cumulative probability function of X, and hence the probability that a bulb lasts between 2 thousand and 3 thousand hours.
CUMULATIVE DISTRIBUTION FUNCTION
33
Denition 1 If FX is strictly increasing in some interval, then FX is well dened and, for a specied p (0, 1), the pth quantile of FX is xp , where
1 FX (xp ) = p or xp = FX (p).
x0.5 is the median of FX (or f ). x0.25 , x0.75 are the lower and upper quartiles of FX (or fX ).
Example Let X be a random variable with cumulative distribution function FX (x) = 1 ex , x > 0. Find the median and quartiles of X.
f (x) = ex 1
area =1/4 area = 1/4
x 0 ln 4 3 ln 2 2 ln 2
34
Expectation and Moments

The measures of location and spread introduced in Chapter 2 for samples of data have parallel denitions for random variables.
Expectation
Consider a random variable X with P (X = 5) = 1 , P (X = 10) = 4 . If we observed 5 5 the values of, say, 100 random variables with the same distribution as X, we would expect to observe about 20 5s and about 80 10s so that the mean or average of the 100 numbers should be about 1 4 5 20 + 10 80 = 5 + 10 (= 9); 100 5 5 that is, the sum of the possible values of X weighted by their probabilities. Denition The expected value or mean of a discrete random variable X is E(X) =
all x
x P (X = x) =
xfX (x),
all x
where fX is the probability function of X. By analogy, in the continuous case we have: Denition The expected value or mean of a continuous random variable X is E(X) =

xfX (x) dx,
where fX is the density function of X. In both cases, E(X) has the interpretation of being the long run average of X in the long run, as you observe an increasing number of values of X, the average of these values approaches E(X). In both cases, E(X) has the physical interpretation of the centre of gravity of the function fX . So if a piece of thick wood or stone was carved in the shape of fX ; it
EXPECTATION AND MOMENTS would balance on a fulcrum placed at E(X).
35
E(X)
Example Let X be the number of females in a committee with three members. Assume that there is a 50:50 chance of each committee member being female, and that committee members are chosen independently of each other. Find E(X). x fX (x) = P (X = x) E(X) =
3 The interpretation of 3 is not that you expect X to be 2 (it can only be 0,1,2 or 3) 2 but that if you repeated the experiment, say, 100 times, then the average of the 100 150 numbers observed should be about 3 (= 100 ). 2
That is, we expect to observe about 150 females in total in 100 committees. We dont expect to see exactly 1.5 females on each committee!
Example Suppose X is a standard uniform random number generator (such as can be found on most hand-held calculators). X has density fX (x) = 1, 0 < x < 1.
36 Find E(X). E(X) =

xfX (x)dx =
f (x)
1 0 0 x
E(X)
Note that in both of the above examples, fX is symmetric and it is symmetric about E(X). This is not always the case, as in the following examples:
Example* X has probability function fX (x) = P (X = x) = (1 p)x1 p, x = 1, 2, . . . ; 0 < p < 1. Find E(X).
Example X has density fX (x) = ex , x > 0 Find E(X).
EXPECTATION AND MOMENTS
37
Special case: if X is degenerate, that is, X = c with probability 1 for some constant c, then X is in fact just a constant and E(X) = xP (X = x) = c 1 = c.
all x
Thus the expected value of a constant is the constant; that is, E(c) = c.
Expectation of transformed random variables

As discussed in Chapter 2 , sometimes we are interested in studying a transformation of a random variable. The following is an important result regarding the expectation of a transformation of a random variable: Result The expected value of a function g(X) of a random variable X is g(x)fX (x) X discrete all x E{g(X)} = g(x)fX (x)dx X continuous
While transformations are of interest in their own right, expectations of transformations are specically of interest when studying the properties of a random variable. The variance, dened in the following section, is an expectation of a function of a random variable. More generally, the rth moment of X about some constant a, dened as E{(X a)r }, can be used to characterise a range of properties of a random variable (spread when r = 2, skew when r = 3 . . .).
Example Let I denote the electric current, through a particular circuit, and I has density function 1 1x3 2 fI (x) = 0 elsewhere Power P is a function of I and resistance. For a circuit with resistance 3 Ohms, P = 3I 2 What is the expected value of P through this circuit?
38
Expectation of a variable under transformation

Often a change of scale is required, when studying a random variable. An example is when a change of measurement units is required (gkg, 0 F 0 C). Expectation under linear transformation If a and b are constants, E(a + bX) =
all x
(a + bx)P (X = x) P (X = x) + b
all x all x
= a
xP (X = x)
= a + bE(X). Similarly for X continuous. Thus E(X) behaves as a measure of location of X as dened in Chapter 2. Expectation of a sum Expectations are additive when summing functions of a random variable: E{g1 (X) + + gn (X)} = E{g1 (X)} + . . . + E{gn (X)} Expectation of a nonlinear transformation In most situations, E{g(X)} = g{E(X)} The main exceptions to this rule is when g(X) is a linear transformation of X, or when X is a constant (P (X = c) = 1 for some constant c).
Standard Deviation and Variance

The standard deviation of a random variable is a measure of its spread. It is closely tied to the variance of a random variable, dened below: Denition If we let = E(X), then the variance of X denoted by Var(X) is dened as Var(X) = E{(X )2 } (which is the second moment of X about ).
STANDARD DEVIATION AND VARIANCE Denition The standard deviation of a random variable X is the square-root of its variance: standard deviation of X = Var(X).
39
These denitions and their interpretation correspond to those introduced in Chapter 2 when studying descriptive statistics. Result Var(X) = E(X 2 ) 2 . Proof: Var(X) = E(X )2
= E(X 2 2X + 2 ) = E(X 2 ) 22 + 2 = E(X 2 ) 2 .
= E(X 2 ) 2E(X) + E(2 )
Example Assume the lifetime of a lightbulb (in thousands of hours) has density function fX (x) = ex . Calculate V ar(X) We will use V ar(X) = E(X 2 ) [E(X)]2 . Recall from page 36 that E(X) = 1. E(X 2 ) =
0
x2 ex dx
0
= x2 ex
2xex dx
= 0 + 2E(X) = 2 V ar(X) = E(X 2 ) [E(X)]2 = 2 12 = 1
40
Example Consider two games described as follow. Game A: Toss a fair coin, if it lands heads I give you $2, if it lands tails you give me $2. Game B: Toss a fair coin, if it lands heads I give you $2,000, if it lands tails you give me $2,000. Which game would you prefer to play? In your answer consider the expectation and variance of your winnings from each game.
Result Var(a + bX) = b2 Var(X) sd(a + bX) = b sd(X) Thus sd(X) behaves as a measure of spread of X as dened in Chapter 2.
Chebychevs Inequality
Chebychevs Inequality is a fundamental result concerning tail probabilities of general random variables. It is useful for derivation of convergence results given later in the notes. Chebychevs Inequality: If X is any random variable with E(X) = , Var(X) = 2 then P (|X | > k) 1 . k2
The probability statement in Chebychevs Inequality is often stated verbally as: the probability that X is more than k standard deviations from its mean. Or equivalently, recalling the idea of Z-scores from Chapter 2: the probability that X has a Z-score whose absolute value exceeds k. Note that Chebychevs Inequality makes no assumptions about the distribution of X. This is a handy result in practice, we usually do not know what the distribution of X is. But using Chebychevs inequality, we can make probabilistic statements about a random variable given only its mean and standard deviation!
DERIVING PROBABILITY DISTRIBUTIONS Proof (continuous case): = Var(X) =

2
41
(x )2 fX (x)dx (x )2 fX (x)dx (k)2 fX (x)dx
|x|>k |x|>k
since |x | > k = (x )2 fX (x) > (k)2 fX (x) 2 k22 fX (x)dx

|x|>k
= k 2 2 P (|X | > k) P (|X | > k) 1 . k2
Example The number of items a factory produces per day has mean 500 and variance 100. What is a lower bound for the probability that between 400 and 600 items will be produced tomorrow? Let X denote the number of items produced tomorrow.
Deriving probability distributions

In some special cases, the probability distribution of a variable can be derived from rst principles. A trick with such derivations is to try to derive a general expression for calculating probabilities. For discrete variables, this means one attempts to derive fX (x) directly, which we have already done in some exercises earlier in this chapter. But for continuous random variables, this means attempting to derive an expression for cumulative probabilities FX (x), then fX (x) = FX (x).
Example A point is chosen at random inside a circle that has radius r. Let X be the distance from the centre of the circle.
42 Find fX (x), the density function of X.
Another example where the above cdf strategy is useful is in deriving the distribution of some transformation of a random variable.
Transformation of random variables

Weve seen earlier that sometimes it is of interest to transform a random variable. In this section we will briey study the implications of transformation for the distribution of a random variable. Consider a random variable X with known fX (x). What is the distribution of Y = h(X) for some known function of X?
Transformations of discrete random variables

Result For discrete X, fY (y) = P (Y = y) = P {h(X) = y} = P (X = x).
x:h(x)=y
Example
x fX (x) Find fY (y) where Y = X 2 .
1 1/8
0 1/4
1 1/2
2 1/8
TRANSFORMATION OF RANDOM VARIABLES
43
Example* Assume that X is discrete and the transformation h(X) is monotonic over the set of allowable values of X. Show that fY (y) = fX {h1 (y)}.
Transformations of continuous random variables

Now we will consider the continuous case. The density function of a transformed continuous variable is simple to determine when the transformation is monotonic, i.e. strictly increasing or decreasing over the range of allowable values of the variable being transformed. Result For continuous X, if h is monotonic over the set {x : fX (x) > 0} then fY (y) = fX (x) dx dy dx dy
= fX h1 (y) for y such that fX {h1 (y)} > 0.
The proof is of course derived using the cumulative distribution function! Proof: FY (y) = P (Y y) = P {h(X) y} P {X h1 (y)} = FX {h1 (y)} if h = P {X h1 (y)} = 1 FX {h1 (y)} if h fY (y) = 1 fX {h1 (y)} dh dy(y) = fX (x) dx dy
1
if h
fX {h1 (y)} dh dy(y) = fX (x) dx if h dy dy dx > 0 if h < 0 if h and so
Now
fY (y) = fX (x)
dx . dy
44
Example fX (x) = 3x2 , 0 < x < 1. Find fY (y) where Y = 2X 1.
Example
1 Let fX (x) = ex/ , x > 0; > 0.
Find fY (y) where Y = X/.
Chapter 4 Common Distributions

In the previous chapter we saw that random variables may be characterised by probability functions (discrete case) and density functions (continuous case). Any non-negative function that sums to 1 is a legal probability function. Any nonnegative function that integrates to 1 is a legal density function. There are certain families of probability functions and density functions that are particularly useful in modelling data. This chapter covers the most common ones. This material is summarised concisely in Hogg et al (2005) Chapter 3 and in Rice (2007) Chapter 2.
Bernoulli Distribution
The Bernoulli distribution is very important in statistics, because it can be used to model response to any Bernoulli trial, dened as below: Denition A Bernoulli trial is an experiment with 2 possible outcomes. The outcomes are often labelled success and failure.
Example The tossing of a coin is a Bernoulli trial. We may dene: success = heads failure = tails
or vice versa. Some more examples of Bernoulli trials: 45
46 Dead or alive? Sick or not? Flowering or not owering? Labour or Coalition? Faulty or not?
CHAPTER 4. COMMON DISTRIBUTIONS
There are only two possible responses to any of the questions in the above example. Hence the above variables can all be modelled using the Bernoulli distribution, dened below: Denition For a Bernoulli trial dene the random variable X= 1 if the trial results in success 0 otherwise
Then X is said to have a Bernoulli distribution. Result If X is a Bernoulli random variable dened according to a Bernoulli trial with success probability 0 < p < 1 then the probability function of X is fX (x) = p if x = 1 1 p if x = 0 x = 0, 1.
An equivalent way of writing this is fX (x) = px (1 p)1x , Example
Consider coin-tossing as in the previous example. If the coin is fair, p = fX (x) = 1 2

x
1 2
and
1 1 2
1x
1 2
, x = 0, 1.
Denition A constant like p above in a probability function or density is called a parameter. The Bernoulli Distribution is a special case of the Binomial Distribution, covered in the next section.
BINOMIAL DISTRIBUTION
47
Binomial Distribution
The Binomial distribution arises when several Bernoulli trials are repeated in succession. Denition Consider a sequence of n independent Bernoulli trials, each with success probability p. If X = total number of successes then X is a Binomial random variable with parameters n and p. A common shorthand is: X Bin(n, p). The previous denition uses the symbol . In statistics this symbol has a special meaning: Notation The symbol is commonly used in statistics for the phrase is distributed as or has distribution. The mathematical expression Y Bin(29, 0.72) is usually read: Y has a Binomial distribution with parameters n = 29 and p = 0.72. Whenever summing the number of times we observe a particular binary outcome, across n independent trials, we have a binomial distribution.
Example Write down a distribution that could be used to model X when X is: 1. The number of patients who survive a new type of surgery, out of 12 patients who each have 95% chance of surviving.
2. The number who would vote Liberal ahead of Labour in a random sample of 365 voters (with probability p of voting Liberal)
Note that in the above, we require the assumption of independence of responses across the n units in order to use the binomial distribution. This assumption is
48
guaranteed to be satised if we randomly select units from some larger population (as was done in the poll example).
Result If X Bin(n, p) then its probability function is given by fX (x) = n x p (1 p)nx , x x = 0, . . . , n.
This result follows from the fact that there are n ways by which X can take the x x value x, and each of these ways has probability p (1 p)nx of occurring. Results If X Bin (n, p) then 1. E(X) = np, 2. Var(X) = np(1 p),
Example Adam pushed 10 pieces of toast o a table. Seven of these landed butter side down. What distribution could be used to model the number of slices of toast that land butter side down? Assume that there is a 50:50 chance of each slice landing butter side down.
What is the expected number of pieces of toast that land butter side down, and the standard deviation?
What is the probability that exactly 7 slices land butter side down? Is this unusual? (Use a tail probability to answer this question.)
As mentioned in the previous section, the Binomial distribution generalises the Bernoulli distribution. The next result makes this explicit:
GEOMETRIC DISTRIBUTION Results X has a Bernoulli distribution with parameter p if and only if X Bin(1, p).
49
Geometric Distribution
The Geometric distribution arises when a Bernoulli trial is repeated until the rst success. In this case X = number of trials until rst success and X is said to have a geometric distribution with parameter p, where p is the probability of success on each trial. Result If X has a Geometric distribution with parameter 0 < p < 1 then X has probability function fX (x; p) = p(1 p)x1 , x = 1, 2, . . . Results If X has a Geometric distribution with parameter p then
1 1. E(X) = p ,
2. Var(X) =
1p . p2
Alternative denitions of the geometric distribution are possible. For example, a common denition is X = number of failures before the rst success. This leads to a distribution on x = 0, 1, . . . , with a dierent mean than is given above, but the variance is unchanged.
Hypergeometric Distribution
Hypergeometric random variables arise when counting the number of binary responses, when objects are sampled independently from nite populations, and the total number of successes in the population is known. Suppose that a box contains N balls, m are red and N m are black. n balls are drawn at random. Let X = number of red balls drawn.
50
Then X has a Hypergeometric distribution with parameters N , m and n. This can be thought of as a nite population version of the binomial distribution. Instead of assuming some constant probability p of success in the population, we say that there are N units in the population of which m are successes. Result If X has a Hypergeometric distribution with parameters N , m and n then its probability function is given by fX (x; N, m, n) =
m x N m nx N n
0 x min(m, n).
Example Lotto. A machine contains 45 balls, and you select 6. Seven winning numbers are then drawn (6 main, one supplementary), and you win a major prize ($10,000+) if you pick six of the winning numbers. Whats the chance that you win a major prize from playing one game? Let X be the number of winning numbers. X is hypergeometric with N = 45, m = 6, n = 7. P (X = x) = f (x; 45, 6, 7) 6 x 39 7x 45 7
6 6 39 1 45 7
and
P (win major prize) = P (X = 6) =
which is less than 1 in a million (its 1 in 1,163,580).
Example About 1920, the famous statistician Ronald Fisher imposed a tea-room experiment on a colleague, algal biologist Dr Muriel Bristol, who had claimed that she could tell whether milk had been poured into her cup before or after the tea had been poured. http://en.wikipedia.org/wiki/Lady tasting tea
POISSON DISTRIBUTION
51
Fisher presented Dr Bristol with eight cups of tea, four poured milk-rst and four poured tea-rst, and asked her to taste each and tell him which four were poured milk-rst. Assuming she guessed randomly, what is the chance that she correctly identied at least three of the four cups that had been poured milk-rst?
Is the hypergeometric distribution symmetric with respect to n and m?
Results If X has a Hypergeometric distribution with parameters N , m and n then 1. E(X) = n

m , N m N
2. Var(X) = n
m N
N n N 1
It can be shown that as N gets large, a hypergeometric distribution with parameters m N , m and n approaches Y Bin(n, N ). A suggestion of this can be seen in the m above formulae: E(X) has the form of a binomial expectation with p = N , and Var(X) only diers from the corresponding binomial variance formula by a nite population correction factor N n which tends to one as N gets large. N 1
Poisson Distribution
Denition The random variable X has a Poisson distribution with parameter > 0 if its probability function is e x fX (x; ) = P (X = x) = , x! A common abbreviation is X Poisson(). The Poisson distribution often arises when the variable of interest is a count. For x = 0, 1, 2, . . . .
52
example, the number of trac accidents in a city on any given day could be welldescribed by a Poisson random variable. The Poisson is a standard distribution for the occurrence of rare events. Such events are often described by a Poisson process. A Poisson process is a model for the occurrence of point events in a continuum, usually a time-continuum. The occurrence or not of points in disjoint intervals is independent, with a uniform probability rate over time. If the probability rate is , then the number of points occurring in a time interval of length t is a random variable with a Poisson(t) distribution. Results If X Poisson() then 1. E(X) = , 2. Var(X) = ,
Example Suggest a distribution that could be useful for studying X in each of the following cases: 1. The number of workplace accidents in a month (when the average number of accidents is 1.4).
2. The number of ATM customers overnight when a bank is closed (when the average number is 5.6).
Example If, on average, ve university servers go oine per week, what is the chance that no more than one will go oine this week? (assume independence of servers going oine)
EXPONENTIAL DISTRIBUTION
53
Exponential Distribution
The Exponential distribution is the simplest common distribution for describing probability structure of positive random variables, such as lifetimes. Denition A random variable X is said to have an exponential distribution with parameter > 0 if X has density function: fX (x; ) = 1 x/ e , x > 0.
Result If X has an Exponential distribution with parameter then E(X) = , Var(X) = 2 . The exponential distribution is closely related to the Poisson distribution of the previous section. We know from previously that if a variable follows a Poisson process, counts of the number of times a particular event happens has a Poisson distribution with parameter . It can be shown that the time until the next event has an exponential distribution with parameter = 1/.
Example If, on average, 5 servers go oine during the week, what is the chance that no servers 1 will go oine today? (Hint: Note that a day is 7 of a week!)
An important property of the exponential distribution is lack of memory: if X has an exponential distribution, then P (X > s + t|X > s) = P (X > t) In words, if the waiting time until the next event is exponential, then the waiting time until the next event is independent of the time youve already been waiting. The exponential distribution is a special case of the Gamma distribution described in a later section.
54
Uniform Distribution
The uniform distribution is the simplest common distribution for continuous random variables. Denition A continuous random variable X that can take values in the interval (a, b) with equal likelihood is said to have a uniform distribution on (a, b). A common shorthand is: X Uniform(a, b). Denition If X Uniform(a, b) then the density function of X is fX (x; a, b) = 1 , ba a < x < b; a < b.
fX (x; a, b) is simply a constant function over the interval (a, b), and zero otherwise. The following gure shows four dierent uniform density functions.
Uniform(0,1)
2.0 2.0
Uniform(2,3)
1.5
fX(x)
fX(x) 3 2 1 0 x 1 2 3 4
1.0
0.5
0.0
0.0 3
0.5
1.0
1.5
0 x
Uniform(3,3.5)
2.0 2.0
Uniform(0.6,2.8)
1.5
fX(x)
fX(x) 3 2 1 0 x 1 2 3 4
1.0
0.5
0.0
0.0 3
0.5
1.0
1.5
0 x
Results If X Uniform(a, b) then 1. E(X) = (a + b)/2, 2. Var(X) = (b a)2 /12,
NORMAL DISTRIBUTION
55
There is also a discrete version of the uniform distribution, useful for modelling the outcome of an event that has k equally likely outcomes (such as the roll of a die). This has dierent formulae for its expectation and variance than the continuous case does.
Normal Distribution
A particularly important family of continuous random variables is those following the normal distribution: Denition The random variable X is said to have a normal distribution with parameters and 2 (where < < and 2 > 0) if X has density function (x)2 1 fX (x; , ) = e 22 , < x < . 2 A common shorthand is X N (, 2 ). Normal density functions are symmetric bell-shaped curves symmetric about . The following gure shows four dierent normal density functions.
N(0,1)
0.8 0.8
N(2,1)
0.6
fX(x)
fX(x) 10 5 0 x 5 10
0.4
0.2
0.0
0.0 10
0.2
0.4
0.6
0 x
10
N(4,0.25)
0.8 0.8
N(4,9)
0.6
fX(x)
fX(x) 10 5 0 x 5 10
0.4
0.2
0.0
0.0 10
0.2
0.4
0.6
0 x
10
The normal distribution is the most important distribution in statistical practice largely because of the Central Limit Theorem, to be discussed when we study
56
inference in the second half of the course. Results If X N (, 2 ) then 1. E(X) = , 2. Var(X) = 2 , The special case of = 0 and 2 = 1 is known as the standard normal distribution. It is common to use the letter Z to denote standard normal random variables: Z N (0, 1). The standard normal density function is
1 2 1 fZ (x) = e 2 x . 2
Computing Normal Distribution Probabilities

Consider the problem:
0.47
P (Z 0.47) =
1 2 ex /2 dx. 2
The standard normal density function does not have a closed form anti-derivative and cannot be solved in the usual way. Note, however: Result If Z N (0, 1) then
x
P (Z x) = FZ (x) =
1 2 et /2 dt = (x). 2
(x) is a special notation that is often used to denote the cumulative distribution function of the N (0, 1) random variable. This result means that probabilities concerning Z N (0, 1) can be computed whenever the function is available. It is available on R via the pnorm function. A table for is available in the back of the Course Pack (and on UNSW Blackboard). This can be used, for example, to show that: P (Z 0.47) = (0.47) 0.6808.
The shaded area in the following gure corresponds to (0.47).
NORMAL DISTRIBUTION
57
Result The function has the following properties: 1. lim (x) = 0,

x
2. lim (x) = 1,
x
3. (0) = 1 , 2 4. is monotonically increasing over R.
function
(x) 1.0 3 2 1 0.2 0.4 0.6 0.8
It follows from the previous result that the inverse of , 1 (x), is well-dened for all 0 < x < 1.
58 Some other examples: P (Z 1) = (1)
0.8413 and P (Z 1.54) = (1.54)
0.9382.
For nding a probability such as P (Z > 0.81), we need to work with the complement P (Z 0.81): P (Z > 0.81) = 1 P (Z 0.81) = 1 (0.81) 1 0.7910 = 0.2090
How about probabilities concerning non-standard normal random variables? For example, how do we nd P (X 12) where X N (10, 9)? Result If X N (, 2 ) then
Z=
X N (0, 1).
More generally, any linear transformation of a normal random variable is normal. This means that the Z-score of a normal variable ( X ) will be standard normal, and so we can use the table for the distribution to calculate any probability to do with a normally distributed random variable, as in the examples below.
Example Find P (X 12) where X N (10, 9). P (X 12) = P X 10 12 10 3 3 = P (Z 0.67) where Z N (0, 1) = (0.67) 0.7486
Example The distribution of young mens heights is approximately normally distributed with mean 174 cm and standard deviation 6.4 cm.
GAMMA DISTRIBUTION 1. What percentage of these men are taller than six foot (182.9 cm)?
59
2. Whats the chance that a randomly selected young man is 170-something cm tall? 3. Find a range of heights that contains 95% of young men.
Gamma Distribution
Denition The Gamma function at x R is given by (x) =
0
tx1 et dt.
Results Some basic results for the Gamma function are 1. (x) = (x 1)(x 1) 2. (n) = (n 1)! n = 1, 2, 3, . . .
1 3. ( 2 ) =
60
Result 2 above suggests that the Gamma function is an extension of the factorial function (e.g. 4!=24) to general real numbers. Denition A random variable X is said to have a Gamma distribution with parameters and (where , > 0) if X has density function: fX (x; , ) = ex/ x1 , () x > 0.
where () is the gamma function. A common shorthand for this distribution is: X Gamma(, ). Gamma density functions are skewed curves on the positive half-line. The following gure shows four dierent Gamma density functions.
Gamma(1.5,0.2) Gamma(4,3)
1.0
0.8
0.6
fX(x)
0.4
fX(x) 1 0 1 2 x 3 4 5 6
0.2
0.0
0.0 1
0.2
0.4
0.6
0.8
1.0
2 x
Gamma(13,5)
Gamma(13,10)
1.0
0.8
0.6
fX(x)
0.4
fX(x) 1 0 1 2 x 3 4 5 6
0.2
0.0
0.0 1
0.2
0.4
0.6
0.8
1.0
2 x
Results If X Gamma(, ) then 1. E(X) = , 2. Var(X) = 2 , The Gamma distribution generalises the Exponential distribution as below:
QUANTILE-QUANTILE PLOTS OF DATA Result X has an Exponential distribution if and only if X Gamma(1, )
61
Quantile-quantile plots of data

Consider the situation in which we have a sample of size n from some unknown random variable {x1 , x2 , . . . , xn } and we want to check if these data appear to come from a random variable with cumulative distribution function FX (x). This can be achieved using a quantile-quantile plot (sometimes called a Q-Q plot) as dened below. Quantile-quantile plots To check how well the sample {x1 , x2 , . . . , xn } approximates the distribution with cdf FX (x), plot the n sample quantiles against the corresponding quantiles of FX (x). That is, plot the points F 1 (p), x(k) where p = k 0.5 , for all k {1, 2, . . . n} n
If the data come from the distribution FX (x), then the points will show no systematic departure from the one-to-one line. According to the above denition, we need to know the exact cdf FX (x) to construct a quantile-quantile plot. However, if the shape of the distribution is invariant under linear transformation (known as a location-scale distribution, e.g. the normal distribution), we can construct the quantile-quantile plot using an arbitrary choice of parameters. Then we only need to check for systematic departures from a straight line rather than from the one-to-one line when assessing goodness-of-t. Hence, in the special case of location-scale distributions, we can see how well data approximates a type of distribution without requiring any knowledge of what the values of the parameters are.
Example Recall the example dataset from Chapter 1: 2 4 5 7 8 10 14 17 27 35
Use a quantile-quantile plot to assess how well these data approximate a normal distribution.
62
There are ten values in this dataset, so the values of p we will use to plot the data are k0.5 for all k {1, 2, . . . , 10}, that is, for all 10 p { 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95 } We want to nd the quantiles corresponding to these values of p from a normal distribution, and compare those to the observed values. We have tables for the standard normal distribution, so we will use these to obtain quantiles. That is, we will plot the x(k) against 1 (p) for the ten values of p displayed above. Using tables, we can show that the corresponding standard normal quantiles are { 1.64, 1.04, 0.67, 0.39, 0.13, 0.13, 0.39, 0.67, 1.04, 1.64 } and so we plot these values against our ordered example dataset. This results in the following plot:
This plot does not follow a straight line it has a systematic concave-up curve, so the data are clearly not normally distributed. In fact, because the curve is concaveup, the data are right-skewed (since the larger values in the dataset are much larger than expected for a normal distribution).
Chapter 5 Bivariate Distributions

Observations are often taken in pairs, leading to bivariate observations (X, Y ), i.e. observations of two variables measured on the same subjects. Examples we have seen include (brain volume, body mass) measured for several species of animal, and (gender, survival) of passengers on the Titanic. Often we are interested in exploring the nature of the relationship between two variables that have been measured on the same set of subjects. In this chapter we develop a notation for the study of relationship between two variables, and explore some key concepts. For further reading, consider Hogg et al (2005) Chapter 2 or Rice (2007) Chapter 3.
Joint Probability Function and Density Function

Denition If X and Y are discrete random variables then the joint probability function of X and Y is fX,Y (x, y) = P (X = x, Y = y), the probability that X = x and Y = y.
Why study joint probabilities?

Recall from rst year that if two variables are dependent, then P (A B) = P (A) P (B) In the context of two discrete random variables X and Y , P (X = x, Y = y) = P (X = x) P (Y = y) 63
64
CHAPTER 5. BIVARIATE DISTRIBUTIONS
So when we want to calculate P (X = x, Y = y), or any joint probability involving both X and Y , we cannot nd it using the probability functions of X and Y , which give us P (X = x) and P (Y = y). We instead need to know the joint probability function fX,Y (x, y).
Example Suppose that X and Y have joint probability function as tabulated below.
fX,Y (x, y) y 0 1/4 1/16 1/16
0 1 2
1 1/8 1/8 1/16
1 1/8 1/16 1/8
Find P (X = 0 and Y = 0). Show that P (X = 0 and Y = 0) = P (X = 0)P (Y = 0).
Note that this means that we wouldnt have been able to get the correct answer without looking at the joint probability distribution of X and Y .
Example Let X = number of successes in the rst of two Bernoulli trials each with success probability p and let Y = total number of successes in the two trials. Then, for example, fX,Y (1, 1) = P (X = 1 and Y = 1) = P (X = 1)P (Y = 1) = p(1 p). fX,Y (x, y)
JOINT PROBABILITY FUNCTION AND DENSITY FUNCTION y 1 p(1 p) p(1 p)
65
0 x 1
0 (1 p)2 0
2 0 p2
Joint density functions

Denition The joint density function of continuous random variables X and Y a bivariate function fX,Y with the property fX,Y (x, y) dx dy = P ((X, Y ) A)
any (measurable) subset A of R2 . For two continuous random variables, X and Y , probabilities have the following geometrical interpretation: fX,Y is a surface over the plane R2 and probabilities over subsets A R2 correspond to the volume under fX,Y over A. For example, if f (x, y) = 12 2 (x + xy) 7 for x, y (0, 1)
then the joint density function looks like this:
4 3 f(x,y) 2 1 0 1 0.5 X 0 0 0.2 0.4 Y 0.6 0.8 1
66
Example (X, Y ) have joint density function 12 f (x, y) = (x2 + xy) 7 1 2 Find P (X < 2 , Y < 3 ).
for x, y (0, 1)
First, what region of the (X, Y ) plane do we want to integrate over?
1 (X < 1 , Y < 2 ) 2 3
x 0 1
We want to integrate fX,Y (x, y) over this region:

1/2 2/3
P (X < 1/2, Y < 2/3) =

0 1/2 0 2/3 0 1/2 0 1/2 0 1/2 0
fX,Y (x, y) dy dx 12 2 (x + xy) dy dx 7

2/3 2
=
0
12 = 7 = = 12 7 8 7
xy 2 x y+ 2 2 x x2 + 3 2 x dx 3
1/2
dx
0 2
2 3
dx
x2 +
8 x3 x2 = + 7 3 6 0 8 1 1 = + 7 83 46 2 = 21
JOINT PROBABILITY FUNCTION AND DENSITY FUNCTION
67
Other results for fX,Y (x, y)

Many of the denitions and results for random variables, considered in chapter 3, generalise directly to the bivariate case. We consider some of these in this section. Essentially, all that changes from chapter 3 to the results stated here is that instead of doing a single summation or integral, we now do a double summation or double integral, because there are now two variables under consideration. In each of the following cases, think of what the univariate (one-variable) version of the result is, as given in chapter 3. Firstly, we have the following property for joint probability functions. Result If X and Y are discrete random variables then fX,Y (x, y) = 1.
all x all y
If X and Y are continuous random variables then

fX,Y (x, y) dx dy = 1.
Next, we will consider a denition of the joint cumulative distribution function (cdf). Denition The joint cdf of X and Y is FX,Y (x, y) = P (X x, Y y) P (X = u, Y = v) (X discrete) ux vy = y x fX,Y (u, v) du dv (X continuous)

Example consider again fX,Y (x, y) = Find FX,Y (x, y). First, what region of the (X, Y ) plane do we want to integrate over? 12 2 (x + xy), 7 0 < x < 1, 0 < y < 1.
68
1 fX,Y > 0
x 0 1
For 0 < x < 1 and 0 < y < 1,
FX,Y (x, y) =
fX,Y (u, v) dudv =
12 y x 2 u + uv du dv 7 0 0 x 12 y u3 u2 v = + dv 7 0 3 2 0 12 y x3 x2 v = + dv 7 0 3 2 y 12 x3 v x2 v 2 12 + = = 7 3 4 0 7
x3 y x2 y 2 + 3 4
1 1 Thus, for example, P (X < 2 , Y < 3 ) = FX,Y
1 1 , 2 3
12 7
1 72
1 144
3 84
Finally, we will consider expectations. We dene the expectation of some joint function of X and Y , g(X, Y ), as below. Result If g is any function of X and Y , g(x, y) P (X = x, Y = y) (discrete) all x all y 1 E{g(X, Y )} = g(x, y) fX,Y (x, y) dx dy (continuous)

Note that this formula has the same form as that of E{g(X)} from chapter 3.
MARGINAL PROBABILITY/DENSITY FUNCTIONS
69
Example
fX,Y (x, y) y 0 1 2 x 0 0.1 0.2 0.2 1 0.2 0.2 a where a is a constant.
For the above bivariate distribution,
1. Find a, if fX,Y (x, y) is a joint probability function. 2. Find FX,Y (1, 1). 3. Find E(XY ).
Marginal probability/density functions

Result If X and Y are discrete, then fX (x) and fY (y) can be calculated from fX,Y (x, y) as follows: fX (x) = fX,Y (x, y)
all y
fY (y) =
all x
fX,Y (x, y)
fX (x) is sometimes referred to as the marginal probability function of X.
70
Example
0 1 fY (y)
0 0.1 0.2
y 1 0.2 0.2
2 0.2 0.1
fX (x)
Find the marginal distributions of X and Y .
Result If X and Y are continuous, then fX (x) and fY (y) can be calculated from fX,Y (x, y) as follows: fX (x) = fY (y) =

fX,Y (x, y) dy fX,Y (x, y) dx
fX (x) is sometimes referred to as the marginal density function of X.
Example fX,Y (x, y) = Find fX (x) and fY (y). 12 2 (x + xy), 0 < x < 1, 0 < y < 1. 7
CONDITIONAL PROBABILITY AND DENSITY FUNCTIONS
71
Conditional Probability and Density Functions

Denition If X and Y are discrete, the conditional probability function of X given Y = y is fX|Y (x|y) = P (X = x|Y = y) = Similarly, fY |X (y|x) = P (Y = y|X = x) = fX,Y (x, y) . fX (x) fX,Y (x, y) P (X = x, Y = y) = . P (Y = y) fY (y)
Note that this is simply an application of the denition of conditional probability from rst year. Denition If X and Y are continuous, the conditional density function of X given Y = y is fX,Y (x, y) fX|Y (x|Y = y) = fY (y) Similarly, fY |X (y|X = x) = fX,Y (x, y) . fX (x)
Often we write fY |X (y|x) as shorthand for fY |X (y|X = x).
Example
0 x 1 fY (y) Find fX|Y (x|2) and fY |X (y|0).
0 0.1 0.2 0.3
y 1 0.2 0.2 0.4
2 0.2 0.1 0.3
fX (x) 0.5 0.5 1
72
Example 12 2 x + xy , 0 < x < 1, 0 < y < 1 7
fX,Y (x, y) =
Find fX|Y (x|y) and fY |X (y|x).
Let X and Y be continuous. For a given value of x, fY |X (y|x) is an ordinary density function and has the usual properties such as: Result If X and Y are continuous then
b
P (a Y b|X = x) = Similar results apply to discrete X and Y : Result If X and Y are discrete then P (Y A|X = x) =
fY |X (y|x) dy.
yA
fY |X (y|X = x).
Conditional Expected Value and Variance

The conditional expected value of X given Y = y is x P (X = x|Y = y) if X is discrete
all x
E(X|Y = y) =
x fX|Y (x|y) dx
if X is continuous
CONDITIONAL EXPECTED VALUE AND VARIANCE Similarly, y P (Y = y|X = x) if Y is discrete

all y
73
E(Y |X = x) =
y fY |X (y|x) dy
if Y is continuous
Note that this can be thought of as an application of the denition of E(X) from chapter 3.
Example Recall the example on page 71, in which 0 0.1 0.2 0.3 y 1 0.2 0.2 0.4 2 0.2 0.1 0.3 fX (x) 0.5 0.5 1
0 x 1 fY (y) Find E(X|Y = 2) and E(Y |X = 0).
Example Recall the example on page 72, in which fX,Y (x, y) = Find E(X|Y ). 12 2 x + xy , 0 < x < 1, 0 < y < 1 7
74
The conditional variance of X given Y = y is Var(X|Y = y) = E(X 2 |Y = y) {E(X|Y = y)}2 where E(X |Y = y) = Similarly for Var(Y |X = x). Note that these denitions can be thought of as an application of the denitions of Var(X) from chapter 3.
2
x2 P (X = x|Y = y)
all x
x2 fX|Y (x|y)dx.
Example Find Var(X|Y = 2) for the discrete data example on page 71.
Example Find Var(X|Y ) for the continuous data example on page 72.
INDEPENDENT RANDOM VARIABLES
75
Independent Random Variables

Denition Random variables X and Y are independent if and only if for all x, y fX,Y (x, y) = fX (x)fY (y) Result Random variables X and Y are independent if and only if for all x, y fY |X (y|x) = fY (y) or fX|Y (x|y) = fX (x). This result allows an interpretation that conforms with the every day meaning of the word independent. If X and Y are independent, then the probability structure of Y is unaected by the knowledge that X takes on some value x (and vice versa). Result If X and Y are independent, FX,Y (x, y) = FX (x) FY (y).
Example
0 x 1 2 fY (y) Are X and Y independent?
1 0.01 0.04 0.05 0.1
y 0 0.02 0.13 0.05 0.2
1 0.07 0.33 0.3 0.7
fX (x) 0.1 0.5 0.4 1
76
Example X and Y have joint probability function fX,Y (x, y) = p2 (1 p)x+y , x = 0, 1, . . . , y = 0, 1, . . . ; 0 < p < 1. Are X and Y independent?
Example X and Y have joint density fX,Y (x, y) = 6xy 2 , 0 < x < 1, 0 < y < 1. Are X and Y independent?
COVARIANCE AND CORRELATION
77
Result If X and Y are independent, E(XY ) = E(X) E(Y ) and more generally, for any functions g(X) and h(Y ), E{g(X) h(Y )} = E{g(X)} E{h(Y )}
Covariance and Correlation

Covariance
Denition The covariance of X and Y is Cov(X, Y ) = E{(X X )(Y Y )} where X = E(X) , Y = E(Y ). Cov(X, Y ) measures not only how X and Y they vary together linearly. Cov(X, Y ) > 0 i.e. if X is likely to be large when Y is large small. If X and Y are negatively associated, Results 1. Cov(X, X) = Var(X) 2. Cov(X, Y ) = E(XY ) X Y . Result If X and Y are independent then Cov(X, Y ) = 0. Proof (continuous case): vary about their means, but also how if X and Y are positively associated, and X is likely to be small when Y is Cov(X, Y ) < 0.
E(XY ) = =
xyfX (x) fY (y)dxdy

xfX (x)dx
yfY (y)dy
= E(X) E(Y ).
78 Results
1. For arbitrary constants a, b, Var(aX + bY ) = a2 Var(X) + b2 Var(Y ) + 2abCov(X, Y ). Hence 2. Var(X + Y ) = Var(X) + Var(Y ) + 2 Cov(X, Y ), 3. Var(X + Y ) = Var(X) + Var(Y ) when X and Y are independent, and also 4. Var(X Y ) = Var(X) + Var(Y ) when X and Y are independent.
Correlation
Denition The Correlation between X and Y is Corr(X, Y ) = Cov(X, Y ) Var(X) Var(Y ) .
Corr(X, Y ) measures the strength of the linear association between X and Y , and is analogous to the sample correlation coecient introduced in Chapter 2. Denition If Corr(X, Y ) = 0, then X and Y are said to be uncorrelated. Independent random variables are uncorrelated, but uncorrelated variables are not necessarily independent; for example, if X has a distribution which is symmetric about zero and Y = X 2 , E(XY ) = E(X 3 ) = 0 and E(X) = 0, so Cov(X, Y ) = 0 and Corr(X, Y ) = 0, but since Y = X 2 , X and Y are dependent. Results 1. |Corr(X, Y )| 1 2. Corr(X, Y ) = 1 if and only if P (Y = a+bX) = 1 for some constants a, b such that b < 0. 3. Corr(X, Y ) = 1 if and only if P (Y = a + bX) = 1 for some constants a, b such that b > 0. Proof: 1. Let = Corr(X, Y ).
THE BIVARIATE NORMAL DISTRIBUTION
79
0
2 2 where X = Var(X) and Y = Var(Y )
Var
Y X + X Y Y X , X Y
= 2 + 2 Cov = 2(1 + ) 1. Also, 0 Var

X X
Y Y
= 2(1 ), so 1. +
Y Y
2. If = 1 , Var This means that c. But

X X
X X
= 2(1 + ) = 0.
X X
Y + Y is a constant, i.e. P
Y Y
= c = 1 for some constant
Y XY X + = c Y = + c Y X Y X so P (Y = a + bX) = 1 for some constants a = cY and b = Y < 0. X

Y X
3. Similarly, for = 1, P (Y = a + bX) = 1 for some constant a and b =
> 0.
The Bivariate Normal Distribution

The most commonly used special type of bivariate distribution is the bivariate normal. X and Y have the bivariate normal distribution if fX,Y (x, y) = 1 2X Y 1 2 exp 1 2(1 2 ) x X X
2 2
x X X
y Y Y
y Y Y
< x < , < y < ; < X < , < Y < , X > 0, Y > 0, 1 < < 1. Result
2 1. X N (X , X ) 2 2. Y N (Y , Y ).
Result = Corr(X, Y ) Another useful result is that the conditional distributions Y |X = x and X|Y = y are normal, as shown in one of this chapters tutorial exercises.
80
Visualisation of the Bivariate Normal Density Function

The bivariate normal density is a bivariate function fX,Y (x, y) which appears as a bell-shaped surface in three-dimensional space. Below are two views of the surface when X = Y = 0, X = Y = 1 and = 0.5:
When looking down on the surface from directly above, the bivariate normal distribution has elliptical contours. The following gure provides contour plots of the bivariate normal density for X = 3, Y = 7, X = 2, Y = 5 in all cases, but with = Corr(X, Y ) taking four dierent values:
Corr(X,Y)= 0.3
20 20
Corr(X,Y)= 0.7
15
10
y 0 5 5 0 5 x 10 15 20 5 5 0 5
10
15
5 x
10
15
20
Corr(X,Y)= 0.7
20 20
Corr(X,Y)= 0
15
10
y 0 5 5 0 5 x 10 15 20 5 5 0 5
10
15
5 x
10
15
20
These, respectively, correspond to moderate positive correlation between X and Y , strong positive correlation between X and Y , strong negative correlation between X and Y , X and Y uncorrelated.
MORE THAN TWO RANDOM VARIABLES Result If X and Y are uncorrelated jointly normal variables, then X and Y are independent.
81
Note that this is a special exception to the page 78 rule uncorrelated variables are not necessarily independent. You will prove this exception in one of the tutorial exercises for this chapter.
More Than Two Random Variables

All of the denitions and results in this chapter extend to the case of more than two random variables. For the general case of K random variables we now give some of the most fundamental of these. Denition The joint probability function of X1 , . . . , XK is fX1 ,...,XK (x1 , . . . , xK ) = P (X1 = x1 , . . . , XK = xK ) We often write X = (X1 , . . . , XK )T in vector notation, and refer to the probability function as fX (x). Denition The joint cdf in both the discrete and continuous cases is FX (x) = FX1 ,...,XK (x1 , . . . , xK ) = P (X1 x1 , . . . , XK xK ). Denition The joint density of X1 , . . . , XK is fX (x) = fX1 ,...,XK (x1 , . . . , xK ) = Denition X1 , . . . , XK are independent if fX (x) = fX1 ,...,XK (x1 , . . . , xK ) = fX1 (x1 ) . . . fXK (xK ) or FX (x) = FX1 ,...,XK (x1 , . . . , xK ) = FX1 (x1 ) . . . FXK (xK ). K FX ,...,XK (x1 , . . . , xK ). x1 . . . xK 1
Part Three Collecting data
Chapter 6 Introduction to Study Design

Now weve met some common methods for modelling data, its time to think about how to collect data in such a way that we can use these models. Whether or not it is appropriate to do statistical modelling, using tools from the previous chapters, depends critically on how the data were collected. For example, we have been treating data as random variables. For this to be a reasonable thing to do, we need to sample randomly in order to introduce randomness. In this chapter we will work through some key ideas concerning the collection of data. We will also meet the important idea that an appropriate study design enables the examination of the properties of the statistics we calculate from samples, using random variables. For further reading on survey design, consider Rice (2007) Chapter 7.
Survey design
When collecting data in a survey it is critical to try to collect data that is representative and random.
Representativeness
When we collect a sample from a population, typically we would like to use this sample to make inferences about some property of the population at large. However, this is only reasonable if the sample is representative of the population. If this is not achieved then inferences about the population can be completely wrong. We can formally dene representativeness as below. 83
84
CHAPTER 6. INTRODUCTION TO STUDY DESIGN Denition Consider a sample X1 , . . . , Xn from a random variable X which has probability or density function fX (x). The sample is said to be representative if: fXi (x) = fX (x) for each i
Representativeness is typically a more important consideration than sample size it is better to have a small but representative sample than a large but unrepresentative sample.
Example The poll that changed polling http://historymatters.gmu.edu/d/5168 The Literary Digest correctly predicted the outcomes of each of the 1916-1932 US presidential elections by conducting polls. These polls were a lucrative venture for the magazine readers liked them; they got a lot of news coverage; and they were linked to subscription renewals. The 1936 postal card poll claimed to have asked one fourth of the nations voters which candidate they intended to vote for. Based on more than 2,000,000 returned post cards, it issued its prediction: Republican presidential candidate Alfred Landon would win in a landslide. But this largest poll in history couldnt have been more wrong the Democrat Roosevelt won by the election by the largest margin in history! (Roosevelt got more than 60% of the vote, but was predicted to get only 43%.) The Literary Digest lost a lot of credibility from this result and was soon discontinued. The result was correctly predicted by a new pollster, George Gallup, based on just 50,000 voters selected in a representative fashion and interviewed face-to-face. Gallup not only predicted the election result, but before the Literary Digest poll was released, he correctly predicted that it would get it wrong! This election made Gallup polls famous, and formed a template for polling methods ever since. What went wrong in the Literary Digest poll?
How do you ensure a sample is representative? One way to ensure this is to take a simple random sample from the population of interest, as below.
SURVEY DESIGN
85
Random samples
Denition A random sample of size n is a set of random variables X1 , . . . , X n with the properties 1. the Xi s each have the same probability distribution. 2. the Xi s are independent. We often say that the Xi are iid (independently and identically distributed).
Example Sampling with replacement Consider sampling a variable X in a population of 10 subjects, which take the following (sorted) values: 2 4 5 7 8 10 14 17 27 35
We sample three subjects randomly (with equal sampling probability for each subject), with replacement. Let these values be X1 , X2 and X3 . Show that X1 , X2 and X3 are iid.
It is more common however to sample without replacement. The most common method of obtaining such a random sample is to take a simple random sample: Denition A simple random sample of size n is a set of subjects sampled in such a way that all possible samples of size n are equally likely.
86
CHAPTER 6. INTRODUCTION TO STUDY DESIGN
To obtain a random sample using R: Obtain a list of all subjects in the population, and assign each subject a number from 1 to N Use sample(N,n) to take a simple random sample of size n. (The sample function generates N random numbers, assigns one to each subject, then includes in the sample the n subjects with smallest n random numbers.) Strictly speaking, a simple random sample does not consist of iid random variables they are identically distributed, but they are dependent, since knowledge that Xi = xi implies that Xj = xi because the ith subject can only be included in the sample once. However this dependence is very weak when the population size N is large compared to the sample size n (e.g. if N > 100n) and so in most instances it can be ignored. See MATH3831 for nite sample survey methods when this dependence cannot be ignored. It is important in surveys, wherever possible, to ensure sampling is random. This is important for a few reasons: It ensures the n values in the sample are iid, which is an important assumption of most methods of statistical inference (as in the coming chapters). Random sampling removes selection bias the choice of who is included in the study is taken away from the experimenter, hence it is not possible for them to (intentionally or otherwise) manipulate results through choice of subjects. Randomly sampling from the population of interest guarantees that the sample is representative of the population. Unfortunately, it is typically very hard to obtain a simple random sample from the population of interest, so the best we can hope for is a good approximation.
Example Consider polling NSW voters to predict the result of a state election. You do not have access to the list of all registered voters (for privacy reasons). How would you sample NSW voters? It is dicult to think of a method that ensures a representative sample!
SURVEY DESIGN
87
Statistics calculated from samples

Denition Let X1 , . . . , Xn be a random sample. A statistic is any real-valued function of the random sample. While any real-valued function can in theory be considered as a statistic, in practice we focus on particular functions which measure something of interest to us about the sample. Important examples of statistics are:
X=
1 n
n i=1
Xi , the sample mean
S2 =
1 n1
n i=1 (Xi
X)2 , the sample variance
X0.5 , the sample median
A key advantage of random sampling is that the fact that the sample is random implies that any statistic calculated from the sample is random. Hence we can treat statistics as random variables and study their properties. Further, the iid property makes it a lot easier to derive some important properties of statistics, as in the important examples below. Properties of the sample mean If X1 , . . . , Xn is a random sample from a variable with mean and variance 2 , then the sample mean X satises: E(X) = and Var(X) = 2 n
Properties of the sample proportion If X1 , . . . , Xn is a random sample from a Bernoulli(p) variable, then the sample proportion p satises: E() = p and Var() = p p p(1 p) n
Can you prove the above results?
88
Note that while the variance results given above require the variables to be iid (hence a random sample), the expectation results only requires the observations in the sample to be identically distributed. Because of the property that E(X) = , we say that sample means of random samples are unbiased, in a sense that will be dened formally in the following chapter. Similarly, p is unbiased.
Methods of survey sampling

There are many methods of survey sampling beyond taking a simple random sample. Key considerations when choosing a sampling scheme are eciency and eort ideally we would like an estimate that gives us a good estimate relative to the eort invested in sampling. Denition Consider two unbiased alternative statistics, denoted as g(X1 , . . . , Xn ) and h(Y1 , . . . , Ym ). We say that g(X1 , . . . , Xn ) is more ecient than h(Y1 , . . . , Ym ) if: Var[g(X1 , . . . , Xn )] < Var[h(Y1 , . . . , Ym )] Note that the above notation implies that not only can the statistics that we are using dier (g() vs h()), but the observations used to calculate the statistics can also dier (X1 , . . . , Xn vs Y1 , . . . , Ym ). This reects that there are two ways to achieve eciency use a dierent statistic (discussed in later chapters) or by sampling dierently. The most obvious way that sampling dierently can increase eciency is by increasing the sample size, but even for a xed sample size (n = m) eciency varies with sampling method. Below are three common methods of sampling for more, and a deeper study of their properties, see MATH3831. Simple random sample Weaknesses are that it can be dicult to implement in practice, requiring a high eort, and it can be inecient. Stratied random sample If the population can be broken into subpopulations (or strata) which dier from each other in the variable of interest, it is more ecient to sample separately within each stratum than to sample once across the whole population. e.g. Estimating average taxable income this varies considerably with age, so a good survey design would involve sampling separately within age strata (if possible). Cluster sampling This is useful when subjects in the population arise in clusters, and it takes less eort to sample within clusters than across clusters. Eort-
SURVEY DESIGN
89
per-subject can be reduced by sampling clusters and then measuring all (or sampling many) subjects within a cluster. e.g. Face-to-face interviews with 100 NSW household owners it is easier logistically to sample ten postcodes, then sample ten houses in each postcode, than to travel to a random sample 100 households spread across (potentially) 100 NSW postcodes!
Example Consider estimating the average heart rate of students, . Males and females are known to have dierent heart rates, M and F , but the same variance 2 . Consider estimating the mean using a stratied random sample, as follows: Take a random sample from each stratum, of size n. Calculate the sample mean of each gender XM and XF . Since males and females occur with (approximately) equal frequency in the student population, we can estimate the overall mean heart rate as 1 Xs = XM + X F 2 1. Find Var(Xs ). 2. * Show that the marginal variance of heart rate across the student population (ignoring gender) is Var(X) = 2 + (M F )2 /4. 3. Hence show that stratied random sampling is more ecient than using a simple random sample of size 2n, if M = F .
90
Design of experiments
Often in science we would like to demonstrate causation, e.g. does smoking causes learning diculties? Does praying for a patient cause a better recovery following heart surgery? While surveying a population often provides valuable information, it is very dicult to demonstrate causation based on just observing an association between two variables. The reason for this is that lurking variables can induce an association between X and Y when there is actually no causal relationship, or when the causal relationship has a completely dierent nature to what we observe.
Example Student survey results demonstrate that students who walk to UNSW take a lot less time than students who use public transport! Results:
Does this mean that walking to UNSW is faster than using public transport? i.e. Should we all walk to UNSW to save time?
Denitions An observational study (or survey) is a study in which we observe variables (X, Y ) on subjects without manipulating them in any way. An experiment is a study in which subjects are manipulating in some way (changing X) and we observe their response (Y ). The purpose of an experiment is to demonstrate that changes in X cause changes in Y .
DESIGN OF EXPERIMENTS
91
Example 1. The great prayer experiment (popularised in Richard Dawkins book The God Delusion) Does praying for a patient inuence their recovery from illness? A clinical trial was conducted to answer this question (Benson et al. 2006, published in the American Heart Journal ), in which 1201 patients due to undergo coronary bypass surgery were randomly assigned to one of two groups a group to receive daily prayers for 14 days following surgery, and a group who received no prayers. The study was doubleblind, meaning that neither the patient nor anyone treating them knew whether or not they were being prayed for. The outcome variable of interest was whether or not each patient had any complications during the rst 30 days following surgery. 2. A guinea pig experiment Does smoking while pregnant aect the cognitive development of the foetus? Johns et al (1993) conducted a study to look at this question using guinea pigs as a model. They injected nicotine tartate in a saline solution into ten pregnant guinea pigs, injected saline solution with no nicotine into ten other pregnant guinea pigs, and compared the cognitive development of ospring by getting them to complete a simple maze where the look for food. Cognitive development was measured as the number of errors made by the ospring when looking for food in a maze. For each experiment, what is X (the treatment variable) and Y (the response variable)?
Note that both the above experiments (indeed any good experiment) is designed so that the only thing allowed to vary across groups is the treatment variable of interest (X) so if a signicant eect is detected in Y , the only plausible explanation would be that it was caused by X.
Key considerations in experimental design

Any experiment should compare, randomise and repeat: Compare to demonstrate that changes in X cause changes in Y , we need to compare across suitable designed treatment groups (for which we have introduced changes in the value of X). These groups needs to be carefully designed so that the only thing that diers across groups is the treatment variable X. Double-blinding is often used for this reason (e.g. the prayer experiment), as is a placebo or sham treatment (e.g. saline-only injections in the guinea
92 pig experiment).
Randomise the allocation of subjects to treatment groups. This ensures that any dierences across groups, apart from those caused by treatment, are governed by chance (which we can then model!). Repeat the application of the treatment to the dierent subjects in each treatment group. It is important that application of treatment is replicated (rather than applied once, in bulk) in order that we can make inferences about the eect of the treatment in general. The above points may seem relatively obvious, but they can be dicult to implement correctly errors in design can be hard to spot.
Example Consider the following experiments. 1. Consider the Mythbusters Is yawning contagious? episode: http://www.yourdiscovery.com/video/mythbusters-top-10-is-yawning-contagious/ The rst attempt to answer this question involved sitting nine subjects together in a room for ten minutes, counting the number of yawns, and comparing results to when there was a seed yawner in the room with them who pretended to yawn for ten minutes. (Results were inconclusive!) 2. Greg was studying how mites aect the growth of cotton plants. He applied a mite treatment to eight (randomly chosen) plants by crumpling up a miteinfested leaf and leaving it at the base of each plant. He applied a no-mite treatment by not putting any leaves at the base of each of eight control plants. (Surprisingly, plants in the mite treatment had faster growth!?) 3. The success of a course on road rules was assessed by using the RTAs practice driving test: http://www.rta.nsw.gov.au/licensing/tests/driverknowledgetest/demonstrationdriverknowledgetest Participants were asked to complete the test before the course, then again afterwards, and results were compared. (There was a signicant improvement in scores on the test.) What error has been made in each study?
DESIGN OF EXPERIMENTS
93
Common experimental designs

Below are a few common experimental designs. Randomised comparative experiment. Dene K treatment groups (each with dierent levels of the variable X) and randomly assign subjects to each group. Randomised blocks design. If there is some blocking variable known to be important to the response variable, break subjects into blocks according to this variable and randomise allocation of subjects to treatment groups separately within each block. This controls for the eects of the blocking variable. Matched pairs design. A common special case of a randomised blocks design, where the blocks come in pairs. Common examples are before-after experiments (a pair of measurements is taken on a subject before and after treatment application), which control for subject-to-subject variation, and twins experiments (a pair of identical twins are studied, with one assigned to each of two treatment groups), which control for genetic variation. For more details on the above and other common types of experiment, see MATH3851. There is an analogy between randomised comparative experiments and simple random samples (which treat all subjects as equal), and between stratied random sampling and randomised blocks experiments (which break subjects into blocks/strata which are expected to dier in response). The terminology used is dierent but the concept is similar!
Example Does regularly taking vitamins guard against illness? Consider two experiments on a set of 2n subjects: A. Randomly assign subjects to one of two groups, each consisting of n subjects. The rst group are given a vitamin supplement to take daily over the study period (three months), the second are given a placebo tablet (with no vitamins in it), to take daily. Number of illnesses are recorded over the study period.
94
B. All subjects are given a set of tablets (vitamins or placbeo) and asked to take them daily for three months. They are then given a dierent set of tablets (placebo or vitamin, whichever they didnt have last time) and are asked to take these for three months. Number of illnesses are recorded and compared over the two periods. Let the number of illnesses in the two treatment groups be Yv and Yp . We are interested in mean dierence in number of illnesses between takers of vitamin tablets and takers of a placebo, estimated using the sample mean dierence Yv Yp . Assume Var(Yv ) = Var(Yp ) = 2 . 1. What type of experiment has been done in each of A and B above? 2. Find Var(Yv Yp ) for experiment A. 3. It is noted in analysis that there is a correlation between number of illnesses in the two study periods (because some people get sick more often than oth ers). Find Var(Yv Yp ) for experiment B, assuming that the correlation in measurements (and in sample means) across the two study periods is 0.5. 4. Which experiment gives a more ecient estimate of the treatment eect?
SAMPLE SIZE DETERMINATION
95
Sample size determination

The studies we have discussed so far have used a range of dierent sample sizes, as seen in the examples in the following table: Study Mites on plants experiment Guinea pig experiment NSW voter survey Prayer experiment n 8 10 365 600
The studies with a binary response had a larger sample size. This is often the case because less information is typically contained in a binary response than a quantitative response (although this depends on the variance 2 ). But in any given situation, how do you choose your sample size? Sample size determination via variances One method of sample size determination is to: Identify the key statistic of interest. Specify the desired value of the variance of this statistic, c. Estimate the variance of this statistic under random sampling, as a function of sample size, V (n). Solve c = V (n) for n.
Example In the prayer experiment, it was of interest to estimate the dierence in proportion of patients with complications between the prayer treatment and the control, p1 p2 . The researchers decided that they wanted to be able to estimate this quantity with a variance less than 0.032 (for reasons to be explored later). 1. Find Var(1 p2 ) as a function of n, the planned sample size in each treatment p group. 2. Solve for n such that Var(1 p2 ) 0.032 . p
Part Four Inference from data
Chapter 7 Estimators and their properties

Introduction
In Part 2 we dealt with the situation where we know the distribution of the variable of interest to us, and we know the values of parameters in the distribution. But in practice, we usually do not know the value of the parameters, and our primary goal is to use a random sample to make inferences about the parameters. This is the subject of the next few chapters. For further reading, consider Hogg et al (2005) sections 4.1, 5.1 and 5.4 or Rice (2007) sections 7.1-7.3 (but ignore the nite population correction stu).
Example Consider the situation where we would like to know the cadmium concentration in water from a particular dam. We take ve (representative) water samples, estimate the cadmium concentration in each, then average. What does this tell us about , the true average cadmium concentration of water in the dam? In this chapter we will learn how to make inferences about based on a sample.
Statistical inference is a very powerful tool it allows us to make very specic statements about a variable, or about a population, based on just a sample of measurements! 97
98
CHAPTER 7. ESTIMATORS AND THEIR PROPERTIES
Statistical Models
Given an observed random sample X1 , . . . , Xn it is common to postulate a statistical model for the data. This is a set of density or probability functions fX that are consistent with the data, and facilitates answering certain questions of interest. A parametric model is a set of fX s that can be parametrised by a nite number of parameters. We will only deal with parametric models in this course, although the more general non-parametric case is also worth studying this is considered in MATH3811.
Example Keating, Glaser and Ketchum (Technometrics, 1990) describe data on the lifetimes (hours) of 20 pressure vessels constructed of bre/epoxy composite materials wrapped around metal liners. The data are: 274 28.5 1.7 20.8 871 363 1311 1661 236 828 458 290 54.9 175 1787 970 0.75 1278 776 126 The following gure shows a graphical representation of the data:
Given the positive and right-skewed nature of the data a plausible parametric model for the data is: 1 fX (x; ) = ex/ , x > 0; > 0 . Since this family of density functions is parameterised by the single parameter > 0, this is a parametric model. We could check how well this parametric model ts the data using a quantile-quantile plot.
A general parametric model with a single parameter is {fX (x; ) : } .
ESTIMATION
99
The set R is the set of possible values of and is known as the parameter space. If this model is assumed for a random sample X1 , . . . , Xn then we write X1 , . . . , Xn fX (x; ), .
Note that a model for a random sample induces a probability measure on its members. However, these probabilities depend on the value of . This is sometimes indicated using subscripted notation such as: P , E , Var to describe probabilities and expectations according to the model and particular values of , although use of such notation will be minimised in these notes.
Estimation
Let X1 , . . . , Xn be a random sample with model {fX (x; ) : }. A fundamental problem in statistics is that of determining a single that is most consistent with the sample. This is known as the estimation problem, sometimes referred to as point estimation. Denition Suppose X1 , . . . , Xn fX (x; ), . An estimator for , denoted by , is any real-valued function of X1 , . . . , Xn ; i.e. = g(X1 , . . . , Xn ) where the function g : Rn R. This is denition was used in the previous chapter to dene a statistic here we are talking about exactly the same thing, but focussing on the situation where we wish to estimate a parameter .
Example Let p = proportion of UNSW students who watched the Mens Australian Open tennis nal. be a parameter of interest. Suppose that we survey 8 UNSW students, asking them whether they watched the Australian Open nal.
100
Let X1 , . . . , X8 be such that Xi = 1, if ith surveyed watched the Australian Open nal 0, otherwise.
An appropriate model is X1 , . . . , X8 fX (x; p), 0 < p < 1. where fX (x; p) = px (1 p)1x = p, x=1 1 p, x = 0.
Then the natural estimator for p is X 1 + X2 + X 3 + X4 + X 5 + X6 + X 7 + X8 p= , 8 corresponding to the proportion from the survey that watched the Australian Open nal. However, there are many other possible estimators for p; such as X2 + X4 + X6 + X8 palt = , 4 based on every second person in the survey. But even pweird = sin(X1 eX5 ) + coth(X3 /(7X8 + 12)). satises the denition of being an estimator for p! (But its not a very good one...)
The previous denition permits any function of the sample to be an estimator for a parameter . However, only certain functions have good properties. So how do we identify an estimator of that has good properties? This is the subject of the remainder of this section. Before we start studying properties of estimators, we state a set of facts that are fundamental to statistical inference: The estimator is a function of the random variables X1 , . . . , Xn and is therefore a random variable itself. It has its own probability function or density function
f
that depends on . We use f to study the properties of as an estimator of . We met this idea in the previous chapter, where we were studying some key properties of the sample mean and sample proportion. This fundamental idea will be used repeatedly in the remainder of the notes.
BIAS
101
Bias
The rst property of an estimator that we will study is bias. This corresponds to the dierence between E(), the centre of gravity of f , and the target parameter : Denition Let be an estimator of a parameter . The bias of is given by bias() = E() . If bias() = 0 then is said to be an unbiased estimator of . Bias is a measure of the systematic error of an estimator how far we expect the estimator to be from its true value , on average. Often, we want to use an estimator which is unbiased, or as close to zero bias as possible. In many practical situations, we can identify an estimator of that is unbiased. If we cannot, then we would like an estimator that has as small a bias as possible.
Example For the Australian Open tennis example p= Y 8
where Y is the number of students who watched the Australian Open nal, and Y Bin(8, p) Find fp (x), and bias(). Compare these to the corresponding results for palt = p (X2 + X4 + X6 + X8 )/4.
102
Standard error
The next fundamental property of an estimator is its standard error: Denition Let be an estimator of a parameter . The standard error of is simply its estimated standard deviation: se() = Var (),
To obtain Var (), we rst derive Var (), the variance of , then we replace unknown parameters () by their sample estimates (). Standard error is a measure of the sampling error of an estimator how variable we expect to be from one sample to another. Like the bias, the standard error of an estimator is ideally as small as possible. However, unlike the bias the standard error can never be made zero (except in trivial cases).
Example Consider, again, the Australian Open example. Find se() and se(alt ). Comment. p p
Therefore the standard error of palt will always be larger than that of p by a factor of 2 1.4, which suggests that p is a better estimator of p than palt .
MEAN SQUARED ERROR
103
Mean squared error

The bias and standard error of an estimator are fundamental measures of dierent aspects of the quality of , as an estimator of : bias is concerned with the system atic error in , while the standard error is concerned with its inherent random (or sampling) error. But consider a situation in which we want to choose between two alternative estimators of , where one has smaller bias, and the other has smaller standard error. How could we choose between these two estimators? One approach is to use a combined measure of the quality of , which combines the bias and standard error in some way. The mean squared error is the most common way of doing this: Denition The mean squared error of is given by MSE() = E{( )2 }. The following result shows how MSE () takes care of both the bias and standard error in : Result Let be an estimator of a parameter . Then MSE() = bias()2 + Var(), and the estimated mean squared error is MSE() = bias()2 + se()2 .
Proof:
MSE() = E{( )2 } = E[{ E() + E() }2 ]
= Var() + bias()2 + 2{E() }{E() E()} = Var() + bias()2 .
= E[{ E()}2 ] + E[{E() }2 ] + 2E[{ E()}{E() }] = Var() + E[{bias()}2 ] + 2{E() }E[{ E()}]
104
CHAPTER 7. ESTIMATORS AND THEIR PROPERTIES Denition Let 1 and 2 be two estimators of a parameter . Then 1 is better than 2 (with respect to MSE) at 0 if MSE0 (1 ) < MSE0 (2 ). If 1 is better than 2 for all then we say 1 is uniformly better than 2 .
This is essentially a generalisation of the notion of eciency that we met in the previous chapter.
Example For the Australian Open example, nd MSE() and MSE(alt ). Is p uniformly better p p than palt ?
(This result makes intuitive sense, since p is based on twice as many responses.)
Example Consider the problem of estimating , the average cadmium concentration in a dam. We take a random sample of 5 vials of water, and estimate the cadmium concentration in each sample. Let X1 , X2 , . . . , X5 be the corresponding values, and assume that we have a representative random sample.
CONSISTENCY Consider the estimator 1 X= 5
105
Xi
i=1
1. Find the bias of X. 2. Find the standard error of X. 3. Find MSE(X).
Common se formulas The standard error expressions for sample means and sample proportions are as follows. s se(X) = n where s is the sample standard deviation; and se() = p p(1 p) . n
These results are implied by the variance results stated and proved for sample means and proportions in the previous chapter.
Consistency
The consistency property of an estimator is concerned with its performance as the amount of data increases. It seems reasonable to demand that = n gets better as the sample size n grows. Consistency corresponds to n converging to as n becomes larger. Consistency uses the notion of convergence in probability, dened below.
106
CHAPTER 7. ESTIMATORS AND THEIR PROPERTIES Denition The sequence of random variables X1 , X2 , . . . converges in probability to a random variable X if, for all > 0,
n
lim P (|Xn X| > ) = 0. Xn X.

P
This is usually written as
Denition The estimator n is consistent for if

P n .
Example X1 , X2 , . . . are independent Uniform (0, ) variables. Let n = max(X1 , . . . , Xn ) for n = 1, 2, . . . . Then it can be shown that Fn (y) = and fn (y) = Show that n is consistent for . For 0 < < , P (|n | > ) = P (n < ) n = 0 as n For > , P (|n | > ) = 0 for all n 1, so
n y n
, 0<y< 1, y
ny n1 , 0 < y < . n
lim P (|n | > ) = 0

P n
CONSISTENCY
107
Weak Law of Large Numbers

A particularly important convergence in probability result is the one that shows consistency of a sample mean, for a (representative) random sample. This result is known as the Weak Law of Large Numbers. Weak Law of Large Numbers Suppose X1 , X2 , . . . are independent, each with mean and variance 2 < 1 . The sample mean Xn = n n Xi converges in probability to the true i=1 mean: P Xn ; Proof: Recall Chebychevs Inequality. For any random variable Y , P (|Y E(Y )| > k Replace k Var(Y ) by so that k =
Var(Y )
Var(Y )) and
1 . k2
P (|Y E(Y )| > ) Now E(Xn ) = , Var(Xn ) =

2 n
Var(Y ) . 2
so replacing Y by Xn we have for any > 0 2 0 as n . n2
P (|Xn | > )
P Thus Xn .
This result extends to sample proportions also.
Example There are n independent and identically distributed trials, each with probability p of success. Let X1 , . . . , Xn be dened by Xi = 1, if ith surveyed subject is a success 0, otherwise.
X , n
Consider the sample proportion pn = the n trials.
where X is the number of successes in
Use the Weak Law of Large Numbers to show that pn is a consistent estimator of p.
108
In some instances consistency is dicult to check via convergence in probability arguments. The following result reduces the problem to rst and second moments of the estimator: Result If
n
lim MSE(n ) = 0
then n is consistent for .
Example For the Australian Open example, nd MSE(n ). Hence show that pn is consistent p for p.
Result Let g() be some function of an estimator which is smooth in the neighbourhood of . P P implies that g() g() So demonstrating consistency of an estimator implies that (almost) any function of it is also consistent for its corresponding parameter. This result can be proved using a Taylor expansion, following a similar argument to the Delta Method proof coming in the following chapter.
Asymptotic normality
A nal property of an estimator that is of interest is its limiting (or asymptotic) distribution that it, the distribution to which it converges as n becomes large.
ASYMPTOTIC NORMALITY
109
Denition Let X1 , X2 , . . . be a sequence of random variables. We say that Xn converges in distribution to X if

n
lim FXn (x) = FX (x)
for all x where FX is continuous. A common shorthand is Xn X. We say that FX is the limiting distribution of Xn . This diers importantly from the idea of convergence in probability. Convergence in probability is concerned with whether the actual values of the random variables (the xi ) converge. Convergence in distribution, in contrast, is concerned with whether the distributions (the FXi (x)) converge. Convergence in distribution allows us to make approximate probability statements about an estimator n , for large n, if we can derive the limiting distribution FX (x). The most common limiting distribution we encounter in practice is the normal distribution, as below: Denition The estimator is asymptotically normal if D Z se() where Z N (0, 1). This is often written for short as: D N (0, 1) se() A particularly important example, which we will prove in the following chapter, is that X and p are asymptotically normal: Central limit theorem results For a random sample X1 , . . . , Xn from a distribution with mean and variance 2 < , X D N (0, 1) / n If X Bin(n, p), it can be shown that the sample proportion p = satises: pp D N (0, 1)
p(1p) n X n D
110
A result which is often helpful in demonstrating asymptotic normality is known as Slutskys Theorem. The proof is omitted in these notes, but may be found in advanced texts such as: Sering, R.J. (1980). Approximation Theorems of Mathematical Statistics, New York: John Wiley & Sons. Slutskys Theorem Let X1 , X2 , . . . be a sequence of random variables that converges in distribution to X, i.e. D Xn X. Let Y1 , Y2 , . . . be another sequence of random variables that converges in probability to a constant c, i.e. Yn c. Then 1. Xn + Yn X + c, 2. Xn Yn cX, This theorem remains valid if we replace convergence in distribution, Xn X, with P convergence in probability, Xn X.
D D D P
Example Consider the following results: For a random sample X1 , . . . , Xn from a distribution with mean , and variance 2 < , X D N (0, 1) s/ n where s is a consistent estimator of . If X Bin(n, p), the sample proportion p = pp
D p(1) p n X n
satises:
N (0, 1)
Prove these results using the Central Limit Theorem results stated on page 109, together with Slutzkys theorem.
OBSERVED VALUES OF ESTIMATORS
111
Observed values of estimators

Throughout this section we have considered the random sample X1 , . . . , Xn and the properties of the estimator , by treating it as a random variable. This allows us to do theoretical calculations concerning, say, the bias and large sample properties of . In practice, we only take one sample, and observe a single value of , known as the observed value of . This is also sometimes called the estimate of (as opposed to the estimator, which is the random variable we use to obtain the estimate).
Example Consider the pressure vessel example and let Xi = ith lifetime before the data are observed. An unbiased estimator for is 1 =X= n After the data are observed to be: 274 28.5 1.7 20.8 871 363 1311 1661 236 828 458 290 54.9 175 1787 970 0.75 1278 776 126 the observed value of becomes (274 + 28.5 + . . . + 126)/20 = 575.53. Xi .
i
Notation for observed values Some statistics texts distinguish between random variables and their observed values via use of lower-case letters. For the previous example this would involve: x1 = 274, x2 = 28.5, . . . , x20 = 126. and x = 575.53.
112
Good notation for the observed value of is a bit trickier since is already lower-case (in Greek). These notes will not worry too much about making such distinctions. So denotes the random variable X before the data are observed. But we will also say = 575.53 for the observed value. The meaning of should be clear from the context. Estimate (standard error) notation In applied statistics, a common notation when reporting the observed value of an estimator (or estimate) is to add the estimated standard error in parentheses: estimate (standard error)
Example Eight students are surveyed and two watched the Australian Open nal. Find p (dened previously) and its standard error, and write you answer in estimate (standard error) notation.
Condence Intervals
We have seen that an estimator of a parameter leads to a single number for inferring the true value of . For example, in the Australian Open example if we survey 50 people and 16 watched the Australian Open nal then the estimator p has an observed value of 0.32. However, the number 0.32 alone does not tell us much about the inherent variability in the underlying estimator. Condence intervals aim to improve this situation with a range of plausible values, e.g. we are condent that p is in the range 0.19 to 0.45.
CONFIDENCE INTERVALS
113
Denition Let X1 , . . . , Xn be a random sample from a model that includes an unknown parameter . Let L = L(X1 , . . . , Xn ) and U = U (X1 , . . . , Xn ) be statistics (i.e. functions of the Xi s) for which P (L < < U ) = 1 , for all .
Then (L, U ) is a 1 , or 100(1 )%, condence interval for . It is important to note that in the probability statement P (L < < U ) = 1 the quantity in the middle () is xed, but the limits (L and U ) are random. This is the reverse situation from many probability statements that arise in earlier chapters, such as P (2 X 7) for a random variable X. This is actually the reason we call it a condence interval rather than a probability interval: because once we estimate the random variables L and U from data, we no longer have a probability expression.
Approximate condence intervals via asymptotic normality

If an estimator is consistent and asymptotically normal, then we can construct a condence interval using the following result: Result Let X1 , . . . , Xn be a random sample from a model that includes an unknown parameter . Let be a consistent and asymptotically normal estimator of . Then an approximate 100(1 )% condence interval for is z1/2 se(), + z1/2 se() . where z1/2 is the 1 quantile of the standard normal distribution, 2 satisfying P (Z < z1/2 ) = 1 . 2 Proof: Because is consistent and asymptotically normal, D N (0, 1) se()
114 and hence 1
z1/2 <
< z1/2 se()
= P z1/2 se() < < z1/2 se() = P z1/2 se() < < + z1/2 se() = P z1/2 se() < < + z1/2 se() and so z1/2 se(), + z1/2 se() is an approximate 100(1)% condence interval for .
Example Let X Bin(n, p). Then an approximate 100(1 )% condence interval for p is p z1/2 se(), p + z1/2 se() . p p where p =
X n
and se() = p
p(1 p)/n.
Example Let X1 , . . . , Xn be a random sample from a variable with mean and nite variance. Then an approximate or large-sample 100(1 )% condence interval for is s s x z1/2 , x + z1/2 n n .
Note that in the above examples, the condence intervals are only approximate for two reasons: because the estimator ( and X respectively) is only approximately p normal, and because we are using a standard error in place of the true standard deviation ( p(1 p)/n and / n respectively).
Example Using the class as a sample, calculate an approximate 95% condence interval for
CONFIDENCE INTERVALS the proportion of students who watched the Australian Open Final.
115
Example Cadmium is a naturally occurring heavy metal, which is found in drinking water in low levels. Five water samples are taken from a small dam, and cadmium levels are measured. The average of these ve values is then used as an estimate of cadmium concentration in the dam. The sample mean was 0.055 mg/L, with a standard deviation of 0.006 mg/L. Construct a 95% condence interval for the true mean cadmium concentration in the dam. The Australian Drinking Water Guidelines recommend that drinking water contain no more than 0.05 mg/L of cadmium, due to health considerations. Would you conclude that there is evidence that the cadmium levels are currently unsafe?
116
CHAPTER 7. ESTIMATORS AND THEIR PROPERTIES Sample size determination via condence intervals A common method of sample size determination is to: Identify the key parameter of interest , and an asymptotically normal estimator for it n . Specify the desired margin-of-error m the precision to which we would like to estimate this parameter with 100(1 )% condence. Solve m = z1/2 se(n ) for n.
Example Recall the prayer experiment, where we wish to estimate the dierence in proportion of patients with complications between the prayer and non-prayer treatments, p1 p2 . We do so using the dierence in proportions p1 p2 , which is asymptotically normal with standard error: 0.5 se() p n Find the desired sample size in each group n such that we can estimate the change in proportion p1 p2 to within 0.05 of its true value with 90% condence.
Benson et al. arrived at the value n = 600 using similar arguments to the above, although they increased the sample size a little to account for the possibility of some patients being lost to follow-up (e.g. because they became unable to be contacted later in the study period).
Chapter 8 Distribution of sums and averages

We have already seen some examples where we wish to make inferences about sums or averages of random variables. In this chapter we will learn some tools and key results that are used for making inferences about sums, averages, and related estimators. We will start by considering the simplest situation of summing random variables when we have two independent random variables X and Y . Then we will consider extensions to sums and averages of n random variables.
Probability function/density function approach

Result Suppose that X and Y are independent random variables taking only nonnegative integer values, and let Z = X + Y . Then
z
fZ (z) =
y=0
fX (z y)fY (y),
z = 0, 1, . . .
This formula is called a (discrete) convolution formula. Proof: fZ (z) = P (X + Y = z) = P (X = z, Y = 0) + P (X = z 1, Y = 1) + . . . + P (X = 0, Y = z) = fX (z)fY (0) + fX (z 1)fY (1) + . . . + fX (0)fY (z)
z
= P (X = z)P (Y = 0) + P (X = z 1)P (Y = 1) + . . . + P (X = 0)P (Y = z) =

y=0
fX (z y)fY (y) 117
118
CHAPTER 8. DISTRIBUTION OF SUMS AND AVERAGES
Example* X Poisson (1 ), Y Poisson (2 ). P (X = k) = P (Y = k) = e1 k 1 , k = 0, 1, 2, . . . k!
e2 k 2 , k = 0, 1, 2, . . . k! Find the probability function of Z = X + Y .
Thus X + Y Poisson (1 + 2 ). It follows by induction that if X1 , X2 , . . . , Xn are independent with Xi Poisson (i ), then
n n i=1
Xi Poisson
i
i=1
This is an important and useful property of Poisson random variables.
Result Suppose X and Y are independent continuous variables with X fX (x) and Y fY (y). Then Z = X + Y has density fZ (z) =
all possible y
fX (z y)fY (y)dy.
This formula is called a (continuous) convolution formula.
PROBABILITY FUNCTION/DENSITY FUNCTION APPROACH Proof: FZ (z) = P (Z z) = P (X + Y z) =

x+yz zy
119
fX,Y (x, y) dx dy fX (x)fY (y) dx dy
=
all possible y
=
all possible y
FX (z y)fY (y) dy
To complete the proof we dierentiate wrt z in order to obtain the density function fZ (z): d fX (z y)fY (y) dy fZ (z) = FZ (z) = dz all possible y
Example X and Y are independent variables and fX (x) = ex , x > 0, fY (y) = ey , y > 0. Find the density function of Z = X + Y .
Note that the answer is the density function of a Gamma(2,1) random variable. Thus the sum of two independent exponential(1) random variables is a Gamma(2,1) variable.
120
Moment generating function approach

An important alternative approach to deriving the distribution of a sum, which works in certain special cases, is based on what are known as moment generating functions. Denition The moment generating function (mgf) of a random variable X is mX (u) = E(euX ). The name moment generating function comes from the following result concerning the rth moment of X about zero, E(X r ): Result (r) In general, E(X r ) = mX (0) for r = 0, 1, 2, . . . Derivation: mX (u) = E(euX ) uX (uX)2 + + ... 1! 2! u u2 u3 = 1 + E(X) + E(X 2 ) + E(X 3 ) + ... 1! 2! 3! = E 1+ Thus mX (0) = 1 = E(X 0 ). Next E(X) 2u 3u2 + E(X 2 ) + E(X 3 ) + ... 1! 2! 3! = mX (0) = mX (u)|u=0 = E(X). 2 3.2u (2) + ... mX (u) = E(X 2 ) + E(X 3 ) 2! 3! (2) = mX (0) = E(X 2 ), etc. mX (u) =
Moment generating functions provide an alternative way to derive the moments (expectation, variance, etc) of a distribution, which can be more convenient than direct evaluation using the density function.
Example Let X come from an exponential distribution with parameter . Find the moment generating function of X, and hence show that E(X) = and Var(X) = 2 .
MOMENT GENERATING FUNCTION APPROACH
121
Moment generating functions of common distributions Below are the moment generating functions of common distributions we met in chapter 2: X mX (u) Bernoulli(p) 1 p + peu Bin(n, p) (1 p + peu )n u Poisson() e(e 1) 1 Exponential() 1u 1 2 2 N (, 2 ) eu+ 2 u Gamma(, )
1 1u
Properties of moment generating functions

The following results on uniqueness and convergence for moment generating functions are particularly important in motivating their use in deriving distributions of sums and averages. Result Let X and Y be two random variables all of whose moments exist. If mX (u) = mY (u) for all u in a neighbourhood of 0 (i.e. for all |u| < for some > 0) then FX (x) = FY (x) for all x R. (i.e. the mgf of a random variable is unique)
Result Let {Xn : n = 1, 2, . . .} be a sequence of random variables, each with moment generating function mXn (u). Furthermore, suppose that
n
lim mXn (u) = mX (u) for all u in a neighbourhood of 0
and mX (u) is a moment generating function of a random variable X. Then

n
lim FXn (x) = FX (x) for all x R.
(i.e. convergence of mgfs implies convergence in distribution)
122
These two results are stated as theorems in the book Casella, G. and Berger, R.L. (1990). Statistical Inference, Duxbury.
The proofs rely on the theory of Laplace transforms but are not given in this reference. Instead, the reader is referred to Widder, D.V. (1946). The Laplace Transform. Princeton, New Jersey: Princeton University Press.
Moment generating functions of sums and averages

Moment generating functions are useful in deriving the distribution of sums and averages of random variables because of the following results: Result Suppose that X and Y independent random variables with moment generating functions mX and mY . Then mX+Y (u) = mX (u)mY (u). Proof : If X and Y are independent, then Z = X + Y has mgf mZ (u) = E(eu(X+Y ) ) = E(euX euY )
= E(euX ) E(euY ) = mX (u) mY (u)
More generally, Result If X1 , X2 , . . . , Xn are independent random variables, then ment generating function
n n n i=1
Xi has mo-
m Proof:
n i=1
Xi (u) = i=1
mXi (u) and mX (u) =

i=1
mXi
u . n
n i=1
Xi (u)
= E(eu
n
n i=1
Xi
= E
i=1 n
euXi
n
=
i=1
E(e
uXi
)=
i=1
mXi (u).
MOMENT GENERATING FUNCTION APPROACH mX (u) = E(eu n = E(e = m

n
u n 1 n i=1 n i=1
123
Xi
Xi
i=1
Xi
=
i=1
mXi
u n u n
This oers us a useful approach for deriving the distribution of the sum or average of independent random variables, using the 1-1 correspondence between distributions and moment generating functions. For this approach to work however we need to be able to recognise the distribution of the sum from its moment generating function.
Example Find the distribution of Y = Z1 + Z2 where Z1 N (0, 1) independently of Z2 N (0, 1).
Example Consider a normal random sample, X1 , X2 , . . . , Xn N (, 2 ). Use moment generating functions to show that
n
i=1
Xi N (n, n 2 ) and X N
2 n
(This result extends to weighted sums (and averages) of independent normal random variables.)
124
Example Let X1 , X2 , . . . , Xn be a random sample from Exponential(). An estimator of is X. Use the mgf of Xi to deduce the distribution of X. Hence write down MSE(X).
Application: exact condence intervals

The above two examples, and others that can be derived using a moment generating approach, can be used to make inferences about parameters, as below.
Example Consider a normal random sample, X1 , X2 , . . . , Xn N (, 2 ), where we want to make inferences about . An exact 100(1 )% condence interval for , if where known, is: X z1/2 , X + z1/2 n n If were unknown, as is usually the case, we would replace with s in the above to obtain an approximate condence interval. We will however be able to do even better than that (using the t distribution) in Chapter 11.
CENTRAL LIMIT THEOREM
125
Example* Let X1 , X2 , . . . , Xn be a random sample from Exponential(). An estimator of is X. Find a formula for an exact 95% condence interval for . (Hint: First nd the distribution of X/.) Hence construct a 95% condence interval for for the below pressure vessel data:
274 28.5 1.7 20.8 871 363 1311 1661 236 828 458 290 54.9 175 1787 970 0.75 1278 776 126
Central Limit Theorem

In the above we have derived the exact distribution of X, and hence made inferences about mean parameters, in some special cases. The Central Limit Theorem however is a fundamental result in statistics which allows us to make inferences about mean parameters using a random sample X1 , . . . , Xn from any distribution:
126
CHAPTER 8. DISTRIBUTION OF SUMS AND AVERAGES Central Limit Theorem Suppose X1 , X2 , . . . are independent and identically distributed random variables with common mean = E(Xi ) and common variance 2 = n X Var(Xi ) < . For each n 1 let Xn = i=1 i . Then n Xn D N (0, 1). / n
The key part of the Central Limit Theorem is in the assumptions (or lack thereof) averages are asymptotically normal even when we have made no assumptions about the shape of the common distribution of the Xi . This makes the result very general and particularly useful in practice. In fact, the Central Limit Theorem is often considered the single most important result in statistics, and it forms the basis of most of the statistical inference tools that are used by researchers today. Proof : The method of proof will be to show that the moment generating function of Xnn converges as n to the mgf of a N (0, 1) random variable; that is, /
n
lim mn (u) = eu
2 /2
) where mn (u) = E exp u (Xnn /
First, note that each standardized variable (Xi )/ has mean 0 and variance 1, so has mgf of the form: m(u) = 1 + u2 /2 + smaller terms, as u 0 Then mn (u) = E exp u = E = = exp (Xn ) / n
n
u n
n
i=1
Xi
u m , since the mgf of a sum of independent rvs is the product of mgfs, n n u2 1 1+ + smaller terms than n 2n
a n n
And because limn 1 +
= ea , lim mn (u) = exp(u2 /2) 2
The Central Limit Theorem stated above provides the limiting distribution of X . / n However, sometimes probabilities involving related quantities such as the sum n Xi i=1 are required. Since n Xi = nX the Central Limit Theorem also applies to the i=1
CENTRAL LIMIT THEOREM
127
sum of a sequence of random variables. The following result provides alternative forms of the Central Limit Theorem. Result Suppose X1 , X2 , . . . are independent and identically distributed random variables with common mean = E(Xi ) and common variance 2 = Var(Xi ) < . Then the Central Limit Theorem may also be stated in the following alternative forms: 1. 2. 3. D n(X ) N (0, 2 ),
i
Xi n D n Xi n D n
N (0, 1), N (0, 2 ).
Example graphs
To see the central limit theorem in action, try out some of the applets on the UNSW Blackboard interesting links page. In particular, you can nd the distribution of Xn for any choice you like of X and for several choices of n at the following page: http://www.ruf.rice.edu/lane/stat sim/sampling dist/index.html Use these applets to draw some graphs that demonstrate the central limit theorem.
128
How well does the central limit theorem work?*

Because the normal approximation to the sample mean is an asymptotic approximation, it can be expected to work well when n is large. But how large does n need to be for the normal approximation to be reasonable? It turns out that the answer depends on the distribution of the Xi , which we characterise by its probability or density function fX (x). Recall that in our proof of the central limit theorem, we used the rst two terms of the mgf of (Xi )/: u n
n
u2 1+ + smaller terms than n1 2n
This approximation takes only the rst two terms in the mgf for (Xi )/ and ignores all higher-order terms. To consider how valid this assumption is, lets consider the next two terms in the mgf:
u n
1+
1 u2 1 u3 2 u4 + 3/2 + 2 + smaller terms than n2 n 2 n 6 n 24
i where 1 = E(X) is the skewness of the distribution of Xi and 2 = 3 known as the kurtosis (long-tailedness) of Xi .
E(Xi )4 4
is
When we say that the distribution of X is normal, we are ignoring the third and fourth terms in the mgf of (Xi )/ (and indeed we ignore all higher order terms also). How reasonable this approach is depends on how small the coecients of these additional terms are. Clearly, if n is very large then these terms will be small, because the denominator of the coecient increases by n1/2 for each term in the sequence (from the coecient of u2 /2 onwards). The third coecient (1 /n3/2 ) not only gets smaller as n gets larger, but it will also be small when the skewness of the distribution of Xi is small. In fact, if fX (x) is symmetric, 1 = 0. The fourth coecient is proportional to 2 , the kurtosis of the distribution, a measure of how long-tailed the distribution of Xi is. This term will be small if fXi (x) is short-tailed and large if the distribution has long tails. However because this coecient is proportional to n2 rather than n3/2 or n1 , this term only comes into play for very small samples or for noticeably long-tailed distributions.
APPLICATIONS OF THE CENTRAL LIMIT THEOREM
129
From the above analysis, together with further study of the distribution of the sample mean for dierent choices of Xi , we can work out the following rough rules of thumb: For most distributions encountered in practice, n > 30 is a large enough value of n such that the normal approximation to the sample mean is reasonable. If it is reasonable to assume that the distribution under consideration has little skew, and is not long-tailed (i.e. does not have high kurtosis) then the central limit theorem will work well for even smaller n. In such cases n > 10 is often sucient. It should be noted however that pathological distributions exist for which n > 30 is not sucient to ensure approximate normality of X. For any given n, in theory one can always produce an X such that X is not close to normal. A simple example is X P oisson(1/n). Fortunately, such distributions are rarely encountered in practice.
Applications of the Central Limit Theorem

In this section we review some applications of the Central Limit Theorem (CLT). Being the most important theorem in statistics, its applications are many and varied. A key application is in making inferences about means, e.g. the condence interval we constructed for the true cadmium concentration in a dam, at the end of Chapter 7. Another application is a descriptive one. The CLT tells us that the sum of many independent small random variables has an approximate normal distribution. Therefore it is plausible that any real-life random variable, formed by the combined effect of many small independent random inuences, will be approximately normally distributed. Thus the CLT provides an explanation for the widespread empirical occurrence of the normal distribution. The CLT also provides some useful normal approximations to common distributions which can be interpreted as sums, as below. Central Limit Theorem for Binomial Distribution Suppose X Bin (n, p). Then X np np(1 p) N (0, 1).
D
130 Proof:
Let X1 , . . . , Xn be a set of independent Bernoulli random variables with parameter p. Then X=

i
Xi
From the Central Limit Theorem, as it applies to sums of independent random variables, X n lim P x = P (Z x) n n where Z N (0, 1) and = E(Xi ) = p and 2 = V ar(Xi ) = p(1 p). The required result follows immediately. 2 A similar result applies to p =
X , n
as stated in Chapter 6.
The normal approximation to the binomial tends to work better when n is larger and when p is farther from 0 or 1 (because then the distribution is less skewed). A useful rule-of-thumb being that if np > 5 and n(1 p) > 5 then the normal approximation to the binomial is good. This rule of thumb means that we dont actually need a very large value of n for this large sample approximation to work well if p = 0.5, we only need n = 10 for the normal approximation to work well. On the other hand, if p = 0.005, we would need a sample size of n = 1000 . . . The Poisson and Gamma distributions can also be understood as sums of independent random variables, and so following similar reasoning to the above they also are approximately normal the Poisson for large (e.g. > 5), and the Gamma for large (e.g. > 30). The above result suggests that probabilities involving binomial random variables with large n can be approximated by normal probabilities. However a slight adjustment, known as a continuity correction, is often used to improve the approximation: Normal Approximation to Binomial Distribution with Continuity Correction Suppose X Bin (n, p). Then P (X x) where Z N (0, 1). The continuity correction is based on the fact that a discrete random variable is being approximated by a continuous random variable. P Z x np +
1 2
np(1 p)
THE DELTA METHOD
131
Example
Adam tosses 25 piece of toast o a roof, and 10 of them land butter side up. Is this evidence that toast lands butter side down more often than butter side up? i.e. is P (X 10) unusually small? X Bin(25, 0.5). We could answer this question by calculating the exact probability, but this would be time consuming. Instead, we use the fact that Y
D
Compare this with the exact answer, obtained from the binomial distribution: P (X 10) = to 4 decimal places
The Delta Method

The Central Limit Theorem provides a large sample approximation to the distribu tion of Xn . But what about other functions of a sequence X1 , X2 , . . .? Some special examples include 1. Functions of X such as (Xn )3 and sin1 ( Xn ).
132
2. Functions dened through a non-linear equation such as the solution in to Xn

1 n i
ln(Xi ) + () = 0.
This second example is particularly important in statistics, as we will see when we study likelihood-based inference in an upcoming chapter. It turns out that these random variable sequences also converge in distribution to a normal random variable. The general technique for establishing such results has become known as the delta method. The reason for this name is a bit mysterious, although it seems to be related to notation () often used in Taylor series expressions. The Delta Method Let 1 , 2 , . . . be a sequence of estimators of such that n D N (0, 1). / n Suppose the function g is dierentiable at and g () = 0. Then g(Yn ) g() D N (0, 1). g ()/ n Equivalently, the Delta Method result can be stated as follows: If 1 1 Yn = + Zn + (terms in or smaller) n n
D
where Zn N (0, 1) then 1 1 g(Yn ) = g() + g ()Zn + (terms in or smaller) n n It is often useful in statistics to use this latter notation, where one expands a statistic into a constant term and terms which vanish at dierent rates as n increases. Sketch of Delta Method proof: Taylor series expansion gives g(Yn ) = g( + Yn ) = g() + g ()(Yn ) + . . . . Ignoring the lower order terms and re-arranging one obtains the following approximation: n{g(Yn ) g()} g () n(Yn ). But the right-hand side is n(Yn ) multiplied by a constant g () a linear transformation of n(Yn ). But linear transformations of normal random variables are normal, so n D N (0, 1) / n
THE DELTA METHOD implies that g(Yn ) g() D g ()2 N (0, 1). / n from which the result follows.
133
Yet another way to write the Delta Method result, which is more informal but useful for practical purposes, is as follows: If Yn
approx
2 n
then g(Yn ) N
approx
g(), g ()2
2 n
These two expressions are only valid for nite n, but n must be large enough for results when n to oer a reasonable approximation to the distribution of g(Yn ). Hence we refer to the above as a large sample approximation to the distribution of g(Yn ).
Example A random sample of ten trees are chosen and the distance to the nearest tree is measured. The sample mean is 7.8m and the standard deviation 1.2m. We are interested in using the Clark & Evans (1954) formula = the density of trees (per square metre).
1 4X 2
to approximate
Find the asymptotic distribution of , and hence construct an approximate 95% condence interval for average tree density.
There is a multivariate extension of the Delta Method, that will be used later in these notes. A proof can be found in Chapter 3 of: Sering, R.J. (1980). Approximation Theorems of Mathematical Statistics, New York: John Wiley & Sons.
134
Chapter 9 Parameter estimation and inference

We have studied sums, means and proportions key estimators that are very widely used in practice and how we can make inferences from sample estimates of these quantities about the true population value, using condence intervals. But what about more general statistical models? How do you come up with a good estimator of a parameter, and a condence interval for it, in such situations? This chapter provides some tools for answering these important questions. Throughout this chapter it is useful to keep in mind the distinction between estimates and estimators. An estimate of a parameter is a function = (x1 , . . . , xn ) of observed values x1 , . . . , xn , whereas the corresponding estimator is the same function (X1 , . . . , Xn ) of the observable random variables X1 , . . . , Xn . Thus an estimator is a random variable whose properties may be examined and considered before the observation process occurs, whereas an estimate is an actual number, the realized value of the estimator, evaluated after the observations are available. In deriving estimation formulas it is often easier to work with estimates, but in considering theoretical properties, switching to estimators may be necessary. For notational convenience, we usually denote the density or probability function fX simply by f . For further reading, consider Hogg et al (2005) sections 6.1-6.2 or Rice (2007) sections 8.4-8.5.
Method-of-Moments Estimation
A simple approach to estimation is the method of moments. 135
136
CHAPTER 9. PARAMETER ESTIMATION AND INFERENCE Denition Let x1 , . . . , xn be observations from the model f = f (x; 1 , . . . , k ) containing k parameters = (1 , . . . , k ). Form the system of k equations that equates the moments of fX with their sample counterparts: E(X) = E(X 2 ) = . . . E(X k ) = 1 n xk . i
i
1 n 1 n
xi
i
x2 i
i
Then the method-of-moments estimates are the solutions of these equations in 1 , . . . , k .
Example Consider a random sample with normal model: X1 , . . . , Xn N (, 2 ). Find the method-of-moments estimators of and 2 . The method-of-moments equations are: E(X) = E(X 2 ) = 1 n 1 n xi
i
x2 i
i
But E(X) = and E(X 2 ) = Var(X) + {E(X)}2 = 2 + 2 which leads to the system of equations: = x 1 2 + 2 = n
x2 i
i
PROPERTIES OF METHOD-OF-MOMENTS ESTIMATORS Substitution of the rst equation into the second leads to = 1 n x2 x2 = i 1 n (xi X)2 .
137
So the method-of-moments estimators of and are: =X and = 1 n (Xi X)2 .
Properties of method-of-moments estimators

Let x1 , . . . , xn be a random sample of observations from the model f = f (x; 1 , . . . , k ) and let 1 , . . . , k be the corresponding method-of-moments estimators. Assuming that Var(X k ) < , the method-of-moments estimators are: consistent and asymptotically normal. Proof (outline): Consistency follows from the Weak Law of Large Numbers, which states that 1 P Xij E(X j ), 1 j k. n i Since j is a function of the sample moments, we can establish that
P j j ,
1 j k.
To demonstrate asymptotic normality we apply the Central Limit Theorem to sample moments: j 1 j i Xi E(X ) D n N (0, 1), 1 j k Var(X j )/n Asymptotic normality of method-of-moments estimators follows via application of (a multivariate version of) the Delta Method. 2 Method-of-moments estimation is useful in practice because it is a simple approach that guarantees us a consistent estimator. However it is not always optimal, in the sense that it does not always provide us with an estimator that has the smallest possible standard errors and mean squared error. There is however a method that is (usually) optimal...
138
CHAPTER 9. PARAMETER ESTIMATION AND INFERENCE
Maximum Likelihood Estimation

Maximum likelihood estimation is a procedure that has optimal performance for large samples, for almost any model! Hence this method is a very important tool in statistical work. When this estimation method is possible, it is usually as good or better than method of moments estimation. We will start with the single parameter case. The extension to multi-parameter models will be discussed later in this chapter. We rst dene the likelihood function: Denition Let x1 , . . . , xn be observation from the pdf f where f (x) = f (x; ) depending on a parameter . The likelihood function L, a function of , is
n
L() = f (x1 ; ) f (xn ; ) = and the log-likelihood function of is () = ln{L()} =

i
f (xi ; ),
i=1
ln{f (xi ; )}.
Note that the likelihood function looks like the joint density function of the observations x1 , . . . , xn . However, a joint density function is used to calculate how likely a given set of observations x1 , . . . , xn is, given . Here we want to reverse the argument: we are given a set of observations x1 , . . . , xn , and want to know how likely these observations would be at dierent values of . Hence the likelihood function is regarded as a function of , for xed values of {xi } (whereas a density function does the reverse).
Example
Recall the example from the previous chapter with
Xi =
1, if ith surveyed student watched the Australian Open mens tennis nal 0, otherwise.
MAXIMUM LIKELIHOOD ESTIMATION for i = 1, . . . , n. An appropriate probability function for {Xi } is f (x; p), where f (x; p) = p, x=1 1 p, x = 0
139
= px (1 p)1x
1. Find the likelihood and log-likelihood functions, ie L(p) and (p). 2. Supposed the observed data are: x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1. Find the observed likelihood function.
Denition Let x1 , . . . , xn be observations from probability/density function f , where f (x) = f (x; ) containing the parameter . The maximum likelihood estimate of is the choice = that maximises L() over .
Example Consider the Australian Open example where the observed data are: x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1.
140
The likelihood function, already calculated as L(p) = (p)6 (1 p)2 , looks like this:
0.015 0.010
(p)
0.005
0.2
0.4
0.6
0.8
1.0
The maximum likelihood estimate of p is labelled with a vertical line, and it can be seen to be around about 0.75. In fact, a little calculus (as seen later) shows that the value of p that maximises L(p) is exactly p = 0.75.
Obtaining maximum likelihood estimators

As the previous denition shows, maximum likelihood estimation boils down to the problem of determining where a function reaches its maximum. The mechanics of determination of the maximum dier depending on the smoothness of L().
Smooth likelihood functions

Consider estimation of a general parameter . If L() is smooth then dierential calculus methods can be employed to obtain the maximiser of L(). However, it is usually simpler to work with the log-likelihood function (). Maximising () rather than L() is justied by: Result The point at which L() attains its maximum over is also that where () = ln{L()} =
i
ln{f (xi ; )}
attains its maximum. Therefore, the maximum likelihood estimate of is = that maximises () over .
OBTAINING MAXIMUM LIKELIHOOD ESTIMATORS
141
Example Consider the Australian Open example: we observe independent Bernoulli trials x1 , . . . , xn and wish to estimate p, the probability of watching the Australian Open Final. 1. Find the maximum likelihood estimator. 2. Find the maximum likelihood estimate when the observed data are: x1 = 0, x2 = 1, x3 = 1, x4 = 1, x5 = 1, x6 = 0, x7 = 1, x8 = 1 1. (p) = ln{L(p)} =
i=1 n n
xi
ln(p) +
xi
i=1
ln(1 p).
The rst derivative is d (p) = dp and is zero if and only if

n i=1 n i=1
xi
n n xi i=1 1p
n i=1
xi
n n xi i=1 = 0 p = 1p
n i=1
xi
Is this the unique maximiser of (p) over p (0, 1)? The second derivative is
d2 dp2
(p) =
xi
p2
n n xi i=1 (1p)2
which is negative for all 0 < p < 1 and samples xi {0, 1}, 1 i n. Hence (p) d is concave (downwards) over 0 < p < 1 and the point at which dp (p) = 0 must be a maximum. This same answer can be obtained by dierentiating L(p), but it is much messier! 2. For the observed data, p =
4
6 8
= 0.75.
(p)
16 0.0
14
12
10
0.2
0.4 p
0.6
0.8
1.0
142
Note that we have showed in the above example that the maximum likelihood estimator of p for a random Bernoulli sample is the sample proportion p = X , studied n in previous chapters.
Example Consider an observed sample x1 , . . . , xn with common density function f , where f (x; ) = 2xex , x 0; > 0. Write down the log-likelihood function for . Show that any stationary point on this function maximises (), hence nd the maximum likelihood estimator of .
2
Non-smooth likelihood functions*

Not all likelihood functions are dierentiable, or even continuous, over . In such non-smooth cases calculus methods, alone, cannot be used to locate the maximiser, and it is usually better to work directly with L() rather than (). The following notation is useful in non-smooth likelihood situations. Denition Let P be a logical condition. Then the indicator function of P, I(P) is given by 1 if P is true, I(P) = 0 if P is false.
Example
OBTAINING MAXIMUM LIKELIHOOD ESTIMATORS Some examples of use of I are
143
I(Barry OFarrell was the premier of New South Wales on 14th January, 2012) = 1. I(42 = 16) = 1, I(e = 17) = 0,
I(The Earth is bigger than the Moon & The Moon is made of blue cheese) = 0.
I(The Earth is bigger than the Moon) = 1,
The I notation allows one to write density functions in explicit algebraic terms. For example, the Gamma(, ) density function is usually written as f (x; , ) = ex/ x1 , () x > 0.
However, it can also be written using I as f (x; , ) = ex/ x1 I(x > 0). ()
The following result is useful for deriving maximum likelihood estimators when the likelihood function is non-smooth: Result For any two logical conditions P and Q, I(P Q) = I(P)I(Q). Non-smooth likelihood functions arise when the range of f depends on .
Example Suppose that the observation x1 , . . . , xn come from pdf f , where f (x; ) = 5(x4 /5 ), 0 < x < .
Use the I notation to nd an expression for L(), and simplify. f (x; ) = 5(x4 /5 )I(0 < x < ) = 5(x4 /5 )I(x > 0)I( > x).
144
The likelihood function is then

n
L() =
f (xi ; )
i=1 4
= 5(x4 /5 )I(x1 > 0)I( > x1 ) 5(x4 /5 )I(xn > 0)I( > xn ) 1 n
n n n
= 5
n i=1
xi
i=1
I(xi > 0)
i=1
I( > xi ) 5n
Note that
n
Also, I(xi > 0) = 1 with probability 1, since P (Xi > 0) = 1. Hence, the likelihood function is
n 4
i=1 n i=1
I( > xi ) = I( > x1 , > x2 , , > xn ) = I( > max(x1 , . . . , xn )).
L() = 5 or, even more digestibly,
n i=1
xi
5n I( > max(x1 , . . . , xn ))
L() = where Cn = 5n ( L().

n i=1 4
Cn 5n , > max(x1 , . . . , xn ) 0, otherwise
xi ) . The accompanying gure shows an example of such an
0.6
()
0.4 0.2
4 max(X1,,Xn)
Since Cn 5n is clearly decreasing for > max(x1 , . . . , xn ) (easily veried via calculus) it is clear that L() attains its maximum at max(x1 , . . . , xn ). Thus, the maximum likelihood estimator of is = max(X1 , . . . , Xn ).
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS
145
Properties of maximum likelihood estimators

Consistency
It can be shown under fairly mild conditions that the maximum likelihood estimator of is consistent; i.e. P . However, the derivation is relatively complicated and will be omitted from these notes. The interested reader is referred to page 316 of Hogg et al or, for an outline of an alternative proof, Theorem A on page 275 of Rice (2007).
Equivariance
Maximum likelihood estimators are equivariant under functions of the parameter of interest: Result Suppose is the maximum likelihood estimator of . Then for any function g g() is the maximum likelihood estimator of g().
Example Let X1 , . . . , Xn be random variables each with density function f , where f (x; ) = 2xex ,
2
x > 0.
It has previously been shown that the maximum likelihood estimator of is n = . 2 i Xi Find the maximum likelihood estimators of = 1/ and = ln(). From the equivariance property of maximum likelihood estimation, the maximum likelihood estimator of = 1/ is = 1 1 = n Xi2
i
and the maximum likelihood estimator of = ln() is = ln() = ln(n) ln Xi2

i
146
Variance and standard error

For smooth likelihood functions it is possible to show that a maximum likelihood estimate has asymptotic variance and standard error that are functions of the Fisher information: Denition Let X1 , . . . , Xn be random variables, each with density function f , where f (x) = f (x; ), and let () =
i
ln{f (Xi : )}
be the corresponding log-likelihood function, written as a function of the random variables to be observed, and assumed to be smooth. Then the Fisher information, is dened as In () = E{ ()} = E{ ()}2 . Result Let X1 , . . . , Xn be random variables with common density function f de pending on a parameter , and let be the maximum likelihood estimator of . Then as n P In () Var() 1 Hence we can say that se() 1 In ()
Example Recall the previous example of a random sample x1 , . . . , xn from a common density function f , where 2 f (x; ) = 2xex , x 0; > 0. Find the Fisher Information for , and hence the approximate se(). From previously, the log-likelihood function, written as a random variable, is () = n ln(2) + n ln() +
i
ln(Xi )
Xi2 .
i
the rst and second derivatives of () are () = n1 Xi2

i
and
() = n2 .
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS Therefore, the Fisher information is In () = E(n2 ) = n/2 . Hence the standard error of is approximately = n In () 1
147
se()
Note that in the above example, () has no random variables present so the expected value operation is redundant. This will not be the case in general, as in the example below.
Example Consider the daily number of hospital admissions due to asthma attacks. This can be considered as a count of rare events, which can be modelled using the Poisson distribution, where f (x; ) = e x , x! x {0, 1, 2, ...}
Let X1 , X2 , . . . , Xn be the observed number of hospital admissions due to asthma attacks on n randomly selected days (which can be considered to be iid). Find:
1. ().
2. The maximum likelihood estimator for .
3. The Fisher information for .
4. Hence approximate se() for large n.
148
Result For iid samples of size n, In () = nI1 () where I1 () = E 2 ln{f (X; )} 2
is the Fisher information based on X, a random sample of size n = 1.
Example Recall the previous example of a random sample X1 , . . . , Xn with common density function:
2
f (x; ) = 2xex , x 0; > 0.
Find I1 (), and hence In ().
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS
149
Asymptotic normality
Theorem: Asymptotic Normality of Maximum Likelihood Estimators Under appropriate regularity conditions, where se() = 1
In ()
Var()
N (0, 1) and
D N (0, 1) se()
This is an important result with widespread applications. The proof is given in the appendix at the end of the chapter. This result is important because it means that maximum likelihood is not only useful for estimation, but is also a method for making inferences about parameters. Because we now know how to nd the approximate distribution of a maximum likelihood estimator , we can now calculate standard errors and construct condence intervals for using for data from any family of distributions of known form.
Example Recall the previous example: X1 , . . . , Xn f, where f (x; ) = 2xex , x 0; > 0

2
Find the estimated standard error of , and the approximate distribution of . We previously found the Fisher information to be In () = n/2 . Therefore, the approximate variance of is Var() and the asymptotic standard error of is se() = 1 Thus
In ()
2 n = / n.
D N (0, 1). / n
This means that we can approximate the distribution of using approx N (, 2 /n)
150
Example Recall the n measurements of the daily number of hospital admissions due to asthma attacks, X1 , X2 , . . . , Xn , which we will model using the Poisson distribution, i.e. f (x; ) = e x , x!
x {0, 1, 2, ...}
Here, has the interpretation of being the average number of hospital admissions due to asthma attacks per day. Find the approximate distribution of , the maximum likelihood estimator of .
Asymptotic optimality
In the case of smooth likelihood functions where asymptotic normality results can be derived it is possible to argue that, asymptotically, the maximum likelihood estimator is optimal or best.
PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS Result Let X1 , . . . , Xn be random variables from pdf f , where f (x) = f (x; ),
151
and suppose that the maximum likelihood estimator n is asymptotically normal; i.e. D n(n ) N (0, {I1 ()}1 ). Let n be any other estimator of for which D n(n ) N (0, 2 ). Then 2 I1 () 1. Hence, maximum likelihood estimation has the lowest possible asymptotic variance and asymptotic standard error.
Example* Recall the example: X1 , . . . , Xn f, where f (x; ) = 2xex , x 0; > 0

2
We have shown previously that if is the maximum likelihood estimator of , then =n

i
Xi2
and
Var()
1 In ()
= 2 /n
It is known that E(X) = 1 2 and Var(X) = 1 1 4
1. Find the method of moments estimator of , . 2. Find an expression for the approximate distribution of X. 3. Hence use the delta method to nd the approximate distribution of . 4. Compare the asymptotic properties of the estimators and . Which is the better estimator?
152
Likelihood-based Condence Intervals

Because maximum likelihood estimators are asymptotically normal, condence intervals are immediately available using the method described in Chapter 6: Result Let X1 , . . . , Xn be random variables with common density function f , where f (x) = f (x; ),
and let be the maximum likelihood estimator of . Under the regularity conditions for which is asymptotically normal n z1/2 se() , + z1/2 se() is an approximate 1 condence interval for for large n, where se() = 1/ In (). This type of condence interval is known as a Wald condence interval. Because maximum likelihood estimators have a convenient asymptotic standard error expression in terms of Fisher information, they lead easily to Wald CIs for parameters .
Example Recall the example of a sample x1 , . . . , xn from the density function f , where f (x; ) = 2xex , x 0; > 0. 1. Derive a formula for a 95% condence interval for . 2. Use the following data to nd an approximate 95% condence interval for . 0.366 0.412 0.265 0.406 0.276 0.127 0.257 0.433 0.262 0.299 0.568 0.190 0.375 0.336 0.232 0.054 0.640 0.253 0.625 0.226 0.300 0.207 0.267 0.305 0.133 0.440 0.147 0.234 0.183 0.069 0.115 0.147 0.360 0.321 0.316 0.228 0.273 0.514 0.541 0.211 0.204 0.116 0.250 0.268 0.468 0.249 0.112 0.177 0.705 0.195 0.128 0.326 0.258 0.361 0.496 0.754 0.389 0.221 0.078 0.381 0.277 0.256 0.583 0.632 0.573 0.430 0.126 0.534 0.847 0.317 0.391 0.524 0.413 0.283 0.523 0.111 0.356 0.509 0.149 0.467 0.328 0.217 0.481 0.258 0.256 0.459 0.273 0.510 0.031 0.289 0.451 0.485 0.468 0.466 0.491 0.233 0.296 0.269 0.453 0.593
2
LIKELIHOOD-BASED CONFIDENCE INTERVALS (Note for these data (x1 , . . . , x100 ) we have
100 i=1
153
x2 = 14.018.) i
Inference about functions of maximum likelihood estimators

We have shown how to make inference about some parameter using likelihoodbased condence intervals. But what if we are interested in some function = g() of the parameter of interest? The Delta Method permits extension of the asymptotic normality result to a general smooth function of : Result Under appropriate regularity conditions, including the existence of two derivatives of L(), if = g() and = g(), where g is dierentiable and g () = 0, then D N (0, 1) Var() where Var() {g ()}2 In ()
This result follows directly from the delta method result given on page 132.
Example Recall the example: X1 , . . . , Xn f, where f (x; ) = 2xex , x 0; > 0

2
154
Suppose that the parameter of interest is = ln(). Use maximum likelihood estimation to nd an estimator of and its approximate distribution. From previous working, the maximum likelihood estimator of is = ln(n) ln and the variance of is Var( ) Thus Xi
i
|1/|2 1 = . n/2 n
D N (0, 1). 1/ n
We can use this result to construct condence intervals for : Result Under the same conditions as previously, with = g() and = g(),
n
lim P z1/2 se() < < + z1/2 se() = 1
where se() = |g ()|/ In (). Therefore, z1/2 se() , + z1/2 se() is an approximate 1 condence interval for for large n.
Multi-parameter Maximum Likelihood Inference*

In multi-parameter models such as X1 , . . . , Xn N (, 2 ) and X1 , . . . , Xn Gamma(, ) the maximum likelihood principle still applies. Instead of maximising over a single variable, the maximisation is performed simultaneously over several variables.
Example Consider the model X1 , . . . , Xn N (, 2 ), < < , > 0.
MULTI-PARAMETER MAXIMUM LIKELIHOOD INFERENCE* Find the maximum likelihood estimators of and . The log-likelihood function is
n
155
(, ) = ln
i=1
1 2 2 1 e(xi ) /(2 ) = 22 2 (, ) =
1 2 i (xi
i (xi
)2
n 2
ln(2) n ln().
Then,

) =
n ( x 2
) = 0
if and only if =x and regardless of the value of . Also

(, ) = 3 1 n
i (xi
)2 n 1 = 0
if and only if =
i
(xi )2
The unique stationary point of (, ) is then (, ) = x, 1 n (xi x)2 .
Analysis of the second order partial derivatives can be used to show that this is the global maximiser of (, ) over R and > 0. Hence, the maximum likelihood estimators of and are =X and = 1 n (Xi X)2 .
In multi-parameter maximum likelihood estimation the extension of Fisher information is as follows: Denition Let = (1 , . . . , k ) be the vector of parameters in a multi-parameter model. The Fisher information matrix is given by E(H11 ) E(H12 ) E(H1k ) E(H21 ) E(H22 ) E(H2k ) In () = . . . ... . . . . . . E(Hk1 ) E(Hk2 ) E(Hkk ) where Hij = 2 (). i j
156
Example Consider the model X1 , . . . , Xn N (, 2 ), < < , > 0.
Find the Fisher Information matrix for and . It was shown previously that the rst order partial derivatives are

(, ) =
n ( x 2
(, ) = 3
)
i
(xi )2 n 1
The second order partial derivatives are then n 2 (, ) = 2 2 2 (, ) = 2n ( ) x 3 and

2 2
(, ) = 3 4
(xi )2 + n 2 .
Noting that E(Xi ) = and E(Xi )2 = 2 for each i we then get E E and E
2 2 2 2 2
(, ) (, ) (, )
n 2,
= 0 = 3 4 n 2 + n 2 = 2n 2 . n/ 2 0 0 2n/ 2
The Fisher information matrix is then In (, ) = n/ 2 0 0 2n 2 =
Given the Fisher Information matrix, we can nd the asymptotic distribution of any function of a vector of maximum likelihood estimators = g(). But rst we need to dene the gradient vector, as below. Denition Let = (1 , . . . , k ) be the vector of parameters in a multi-parameter model and g() = g(1 , . . . , k ) be a real-valued function. The gradient vector of g is given by
g() 1
g() =
. . .
g() k
MULTI-PARAMETER MAXIMUM LIKELIHOOD INFERENCE*
157
Example Find
g(, ) where g(, ) = .
Result Let = g() be a real-valued function of = (1 , . . . , k ), with maximum likelihood estimate and = g(). Under appropriate regularity conditions, including the existence of all second order partial derivatives of L() and rst order partial derivatives of g, as n D N (0, 1) se() where se() = g()T In ()1 g().
This is a multi-parameter extension of the result given on page 153.
158
Example Let X1 , . . . , Xn N (, 2 ) Derive a convergence result that gives us the approximate distribution of = g(, ) = / . Given the invariance property of maximum likelihood estimators, the maximum likelihood estimator of is = = X
1 n
2 i (Xi X)
The Fisher information matrix may be shown to be In (, ) = which has inverse In (, )1 = The gradient vector is g(, ) = Since g(, )T In (, )1 g(, ) = = 1 [1/ / 2 ] n 1 n 1+ 2 2 2 2 0 0
1 2 2
n/ 2 0 0 2n/ 2 1 n 2 0 0
1 2 2
1/ / 2
1/ / 2
the approximate standard error of is se() = 1 n 1+ x2 . 2 i (xi x)
2 n
The asymptotic normality result for is then

1 n
1+
2 n
X2 (Xi X)2 i
N (0, 1).
159
We can use the asymptotic normality result previously stated to obtain an approximate condence interval for = g() = g(1 , . . . , k ). Result Under appropriate regularity conditions, including the existence of all second order partial derivatives of L(), if = g() where each component of g is dierentiable then:
n
lim P z1/2 se() < < + z1/2 se() = 1 se() = g()T In ()1 g().
where Therefore, z1/2 se(n ) , + z1/2 se() is an approximate 1 condence interval for for large n. Example Let X1 , . . . , Xn N (, 2 ) Derive a formula for an approximate 99% condence interval for = g(, ) = /. As shown previously, the maximum likelihood estimator for is X = 1 2 i (Xi X) n and the approximate standard error of is se() = g()T In ()1 g() = 1 n 1+
2 n
x2 . 2 i (xi x)
The appropriate quantile from the N (0, 1) distribution is z0.995 = 2.576. An approximate 99% condence interval for / is then X
1 n i (Xi
X)2
2.576 + 2.576
1 n 1 n
1+
2 n
X2 2 , i (Xi X) X2 2 i (Xi X) .
X
1 n
2 i (Xi X)
1+
2 n
160
Appendix: Proof of Theorem on Asymptotic Normality of Maximum Likelihood Estimators

Lemma:
Under some regularity conditions E and Var ln{f (X; )} = 0. ln f (X; ) = I1 ().
Proof of Lemma:
Since f is a density function we have 1=

f (x; ) dx.
Dierentiation of both sides with respect to leads to 0 = = =

f (x; ) dx =
f (x; )
f (x; ) dx
f (x; ) dx f (x; ) ln f (x; ) f (x; ) dx = E ln f (X; ) .
Dierentiation of the above equation: 0 = leads to 0=

ln f (x;)
f (x; ) dx with respect to
2 ln{f (x; )} f (x; ) dx + 2
ln f (x; ) ln f (x; ) f (x; ) dx
which is equivalent to 0 = I1 () + E But since E

ln f (X; )
ln f (X; ) = 0 the previous displayed equation is actually: 0 = I1 () + Var ln f (X; )
and the required result follows.
161
Proof of Theorem:
Let 2 () and () = 2 (). The maximum likelihood estimator n satises () = 0 = (n ) = ( + n ) = () + (n ) () + . . .
Since n we are justied in ignoring higher order terms and working with the approximation n ()/ (). With some minor rearrangement, this may be rewritten as Nn n(n ) Dn where the numerator is 1 Nn = Ni with Ni = ln{f (Xi ; )} nI1 () i and the denominator is 1 Dn = n From the Lemma, E(Ni ) = 0 and Var(Ni ) = I1 (). Application of the Central Limit Theorem to Nn then leads to Nn =
1 n i 2 2
Di
i
with Di =
ln{f (Xi ; )} I1 ()
Var(Ni )/n
2 2
Ni E(Ni )
N (0, 1).
Application of the Weak Law of Large Numbers to the denominator leads to Dn E(Di ) = E
P
ln{f (Xi ; )} I1 ()
I1 () I1 ()
I1 ().
Application of Slutzkys Theorem leads to Nn D N (0, 1) Dn I1 () which, in turn, leads to N (0, 1).
D
1/ nI1 () Then substitute In () = nI1 () to obtain the rst result of the theorem. Another application of Slutzkys Theorem completes the proof, giving 1 In () N (0, 1).
D
162
Chapter 10 Hypothesis Testing

We have previously discussed condence intervals as a method of making statistical inference. A related method, known as hypothesis testing, is the most commonly used statistical inference tool for making decisions given some data. For further reading, consider Hogg et al (2005) sections 5.5, 5.6 and 6.3 (likelihood ratio and Wald-type tests only), or Rice (2007) sections 9.1-9.4 and 11.2.1. There are many examples of situations in which we are interested in testing a specic hypothesis using sample data. In fact, most science is done using this approach! (And indeed other forms of quantitative research.) Below are some examples that we will consider in this chapter.
Example Recall that the Mythbusters were testing whether or not toast lands butter side down more often than butter side up. In 24 trials, they found that 14 slices of bread landed butter side down. Is this evidence that toast lands butter-side down more often than butter side up?
Example Before the installation of new machinery at a chemical plant, the daily yield of fertilizer produced at the plant had a mean = 880 tonnes. Some new machinery was installed, and we would like to know if the new machinery is more ecent (i.e. if > 880). 163
164
CHAPTER 10. HYPOTHESIS TESTING
During the rst n = 50 days of operation of the new machinery, the yield of fertilizer was recorded and the sample mean obtained was x = 888 with a standard deviation s = 21. Is there evidence that the new machinery is more ecient?
Example (Ecology 2005, 86:1057-1060) Do ravens intentionally y towards gunshot sounds (to scavenge on the carcass they expect to nd)? Crow White addressed this question by going to 12 locations, ring a gun, then counting raven numbers 10 minutes later. He repeated the process at 12 dierent locations where he didnt re a gun. Results: no gunshot gunshot 0 2 0 1 2 4 3 1 5 0 0 5 0 0 0 1 1 0 0 3 1 5 0 2
Is there evidence that ravens y towards the location of gunshots?
Example Most mammals give birth to their ospring during Spring more often than at other times of the year. Is this true for people too? In a sample of months. MATH2801/2901 students, are born during Spring
Does this sample provide evidence that more people are born in Spring than during other times of the year?
Stating the hypotheses

A key rst step in hypothesis testing is identifying the hypothesis that we would like to test (the null hypothesis), and the alternative hypothesis of interest to us,
STATING THE HYPOTHESES dened as below.
165
Denitions The null hypothesis, labelled H0 , is a claim that a parameter of interest to us () takes a particular value (0 ). Hence H0 has the form = 0 for some pre-specied value 0 . The alternative hypothesis, labelled H1 , is a more general hypothesis about the parameter of interest to us, which we will accept to be true if the evidence against the null hypothesis is strong enough. The form of H1 tends to be one of the following: H1 : = 0 H1 : > 0 H1 : < 0 In a hypothesis test, we use our data to test H0 , by measuring how much evidence our data oer against H0 in favour of H1 . In later statistics courses you will meet hypothesis tests concerning hypotheses that have a more general form than the above. H0 does not actually need to have the form H0 : = 0 but instead can just specify some restriction on the range of possible values for , i.e. H0 : 0 for some set 0 where is the parameter space. It is typical for the alternative hypothesis to be more general than the null, e.g. H1 : 0 .
Example Consider the Mythbusters example on page 163. State the null and alternative hypotheses. Here we have a sample from a binomial distribution, where the binomial parameter p is the probability that a slice of toast will land butter side down. We are interested in whether there is evidence that the parameter p is larger than 0.5. H0 : p = 0.5 H1 : p > 0.5
Example Consider the machinery example on page 164. State the null and alternative hypotheses.
166
In all of the examples of the previous section, there is one hypothesis of primary interest, which involves specifying a specic value for a parameter of interest to us. Can you nd the null and alternative hypotheses for all the above situations?
The process of hypothesis testing

A hypothesis test has the following steps. 1. State the null hypothesis (H0 ) and the alternative hypothesis (H1 ). By convention, the null hypothesis is the more specic of the two hypotheses. 2. Then we use our data to answer the question: How much evidence is there against the null hypothesis? A common way to achieve this is to: i) Find a test statistic that measures how far our data are from what is expected under the null hypothesis. (You must know the approximate distribution of this test statistic assuming H0 to be true. This is called the null distribution.) ii) Calculate a P -value, a probability that measures how much evidence there is against the null hypothesis, for the data we observed. A P -value is dened as the probability of observing a test statistic value as or more unusual than the one we observed, if the null hypothesis were true. 3. Reach a conclusion. A helpful way to think about the conclusion is to return to our original question: How much evidence is there against H0 ?
Example The Mythbusters example can be used to illustrate the above steps of a hypothesis test. 1. Let p = P (Toast lands butter side down) Then what we want to do is choose between the following two hypotheses:
INTERPRETING P -VALUES H0 : p =
1 2
167 versus H1 : p >

1 2
2. We want to answer the question How much evidence (if any) does our sample (14 of 24 land butter side down) give us against the claim that p = 0.5? (a) To answer this question, we will consider p, the sample proportion, and in particular we will look at the test statistic Z= Under the null hypothesis, Z= 0.5(1 0.5)/n p 0.5 N (0, 1)
D
p(1 p)/n
pp
N (0, 1)
This statistic measures how far our sample data are with H0 . The further p is from 0.5, the further Z is from 0. (b) To nd out if p = P 14 p 24
14 24
is unusually large, if p = 0.5, we can calculate Z

14 24
0.5(1 0.5)/n
0.5
P (Z > 0.78)
0.2177
14 3. So we can say that we would expect p to be at least as large as 24 quite often (22% of the time) due to sample variation alone. Observing an event of probability 0.22 is not particularly surprising, so we conclude that we have no evidence against the claim that p = 0.5 because our data are consistent with this hypothesis.
Interpreting P -values
The most common way that a hypothesis test is conducted and reported is via the calculation of a P -value.
Example When reading scientic research, you will often see comments such as the following: Paired t-tests indicated that there was no change in attitude (P > 0.05)
168
The Yahoo! search activity associated with specic cancers correlated with their estimated incidence (Spearman rank correlation = 0.50, P = 0.015) There was no signicant dierence (p > 0.05) between the rst and second responses
So what is a P -value? And how do you interpret a P -value once youve calculated it? Denition The P -value of an observed test statistic is P -value = P (observing a test statistic as or more extreme than the observed test statistic when H0 is true.)
The following are some rough guidelines on interpreting P -values. Note though that the interpretation of P -values should depend on the context to some extent, so the below should be considered as a guide only and not as strict rules.
Range of P -value
Conclusion
P -value 0.1
little or no evidence against H0
0.01 P -value < 0.1
some, but inconclusive evidence against H0
0.001 P -value < 0.01
evidence against H0
P -value < 0.001
strong evidence against H0
It is common for people to use 0.05 as a cut-o between a signicant nding (P < 0.05) and a non-signicant nding (P > 0.05) hence the interpretations in the example quotes on the previous page. Nevertheless, it is helpful to keep in mind
ONE-SIDED AND TWO-SIDED TESTS
169
that P is continuous and our interpretation of P should reect this a P -value of 0.049 (just less than 0.05) is hardly dierent from a P -value of 0.051 (just larger than 0.05), so there should be little dierence in interpretation of these values. In contrast, a P -value of 0.049 oers less evidence against H0 than a P -value of 0.0001!
Example Consider again the Mythbusters example. We calculated that the P -value was about 0.22. (This P -value measured how often you would get further from 0.5 than p = 14 .) 24 Draw a conclusion from this hypothesis test. This P -value is large it is in the little or no evidence against H0 range. Hence we conclude that there is little or no evidence against the claim that toast lands butter side down just as often as it lands butter side up.
One-sided and two-sided tests

The Mythbusters hypothesis test is an example of a one-sided hypothesis test, because we are interested in whether p > 1 , i.e. we are interested in alternative values 2 of p on one side of our hypothesised value, H0 : p = 1 . But there are many situa2 tions where we are interested in nding evidence that a parameter lies either side of a hypothesised value. This involves a two-sided test. Denition A one-sided hypothesis test about a parameter is either of the form: H0 : = 0 or H0 : = 0 versus H1 : > 0 . versus H1 : < 0
A two-sided hypothesis test about is of the form H0 : = 0 versus H1 : = 0 .
A two-sided test is sometimes called a two-tailed test, because we calculate the P -
170
value using both tails of the null distribution of the test statistic (instead of just using one tail as in a one-sided test).
Example Forty subjects are asked to taste two types of coee and say which they prefer (Drink A or Drink B). Of the two types, 28 subjects preferred Drink A. Conduct a hypothesis test to test if there is evidence for a preference of one type of coee over the other. Let p be the proportion preferring Drink A. The hypotheses to be tested are: H0 : p = 0.5 versus H1 : 0 = 0.5 (the proportion preferring Drink A is not 0.5) (proportion preferring Drink A is 0.5)
The test statistic we will use is Z= 0.5(1 0.5)/40 p 0.5
Then, using Z tables,
which has a N (0, 1) distribution if the null hypothesis is true. The observed value is 28 0.5 40 = 2.53 0.5(1 0.5)/40 p 0.5
P -value = Pp=0.5
= 2 P (Z > 2.53),
0.5(1 0.5)/40
> 2.53
2 0.0057 = 0.01
Z N (0, 1)
so there is strong evidence against H0 . We can conclude that there is a systematic preference between the two drinks.
Example The OzTAM survey claims that Open Mens tennis nal. % of Sydney-siders watched the Australian
ONE-SIDED AND TWO-SIDED TESTS
171
Use this result to test the hypothesis that MATH2801/2901 students have dierent TV viewing patterns than the remainder of Sydney-siders.
If n is small, it might not be reasonable to assume that p is approximately normal. Instead, we could use the binomial distribution to obtain an exact P -value (which you have done previously in MATH1B!).
172
Rejection regions
Instead of using a P -value to draw a conclusion based on data, an alternative approach that is sometimes used is to base the test on a rejection region (as dened below): Denition The rejection region is the set of values of the test statistic for which H0 is rejected in favour of H1 . The term rejection region comes from the fact that often people speak of rejecting H0 (if our data provide evidence against H0 ) or retaining H0 (if our data do not provide evidence against H0 ). If out test statistic is in the rejection region we reject H0 , if the test statistic is not in the rejection region we retain H0 . To determine a rejection region, we rst choose a size or signicance level for the test, which can be dened as the P -value at which we would start rejecting H0 . It should be set to a small number (typically 0.05), and by convention is usually denoted by . Once we have determined the desired size of the test, we can then derive the rejection region, as in the below examples.
Example Recall the coee example we wanted to test H0 : p = 0.5 versus H1 : p = 0.5 Our test statistic is Z= p 0.5
0.5(1 0.5)/40
1. Find a rejection region for a test of size 0.05 2. Hence test H0 versus H1 when z = 2.53. 1. From tables, if Z N (0, 1), then P (|Z| > 1.96) 0.05 and so our rejection region is {Z < 1.96, Z > 1.96}. In other words, if our observed value of Z is greater than 1.96 or less than 1.96, we will reject H0 in favour of H1 . Alternatively, if 1.96 < Z < 1.96, we retain H0 .
TYPE I AND TYPE II ERROR 2. Our observed value of Z was 2.53 which is in our rejection region = we reject H0 and conclude that there is evidence that p = 0.5.
173
We now have two dierent approaches. One approach involves computing the P value based on the observed test statistic, and deciding whether or not it is small enough to reject H0 . The other approach is based on setting the signicance level of the test to some number (like 0.05), then working out the range of values of our test statistic for which H0 should be rejected at this signicance level. We will mainly use the P -value approach because it is more informative, and because it is more commonly used in practice. The signicance level approach is however useful in determining important properties of tests, such as their Type I and Type II error.
Type I and Type II error

How could we choose a signicance level ? This problem can be answered by considering the possible errors that can made in reaching our decision. These can be categorised into two types of errors, called Type I error and Type II error. Denition Type I error corresponds to rejection of the null hypothesis when it is really true. Type II error corresponds to acceptance of the null hypothesis when it is really false.
reject H0
accept H0
H0 true
Type I error
No error
H0 false
No error
Type II error
174
Example For the coee example, a Type I error would correspond to concluding that there was a preference among the two drink types, when in actual fact there isnt. Type II error corresponds to concluding that there is a preference amongst the two types of drinks, when in actual fact there is.
Distribution of the test statistic
if H were true 0 f(x) if Ha were true
Type I error
critical value Distribution of the test statistic
if H were true 0 f(x) if Ha were true
Type II error
critical value
Clearly we would like to avoid both Type I and Type II errors, however there is always a chance that either one will occur so the idea is to reduce these chances as much as possible. Unfortunately, making the probability of one type of error small has the eect of making the probability of the other large. In most practical situations, we can readily control Type I error, but Type II error is more dicult to get a handle on.
TYPE I AND TYPE II ERROR
175
Denition The size or signicance level of a test is the probability of committing a Type I error. It is usually denoted by . Therefore, = size = signicance level = P (committing Type I error) = P (reject H0 when H0 is true) Type II error is usually quantied through a concept called power, discussed in MATH3811. A popular choice of is = 0.05. This corresponds to the following situation: We have set up our test in such a way that if we do reject H0 then there is only a 5% chance that we will wrongfully do so. There is nothing special about = 0.05. Sometimes it might be better to have = 0.1 or = 0.01, depending on the application.
Example If we want to test the following hypotheses about a new vaccine: H0 : vaccine is perfectly safe. versus H1 : vaccine has harmful side eects then it is important to minimise Type II error we want to detect any harmful side eects that are present. To assist in minimising Type II error, we might be prepared to accept a reasonably high Type I error (such as 0.1 or maybe even 0.25).
Example Suppose we are testing a potential water source for toxic levels of heavy metals. If the hypotheses are H0 : the water has toxic levels of heavy metals versus H1 : the water is OK to drink
176
then it is important to minimise type I error we wont want to make the mistake of saying that toxic water is OK to drink! So we would want to choose a low Type I error, maybe 0.001 say.
Wald Tests
So far we have only considered the special case of binomial data. How about the general situation: X1 , . . . , Xn f, where f (x) = f (x; ) ?
When the sample size is large the Wald test procedure often provides a satisfactory solution: The Wald Test Consider the hypotheses H0 : = 0 versus H1 : = 0
and let be an estimator of that is asymptotically normal: D N (0, 1). se() The Wald test statistic is W = 0 se()
approx.
N (0, 1)
Let w be the observed value of W . Then the approximate P -value is given by P -value P (|Z| > |w|) = 2(|w|) where Z N (0, 1). Usually the estimator in the Wald test is the maximum likelihood estimator since, for smooth likelihood situations, this estimator satises the asymptotic normality requirement, and a formula for the (approximate) standard error is readily available.
Example Consider the sample of size n = 100
WALD TESTS 0.366 0.412 0.265 0.406 0.276 0.127 0.257 0.433 0.262 0.299 0.568 0.190 0.375 0.336 0.232 0.054 0.640 0.253 0.625 0.226 0.300 0.207 0.267 0.305 0.133 0.440 0.147 0.234 0.183 0.069 0.115 0.147 0.360 0.321 0.316 0.228 0.273 0.514 0.541 0.211 0.204 0.116 0.250 0.268 0.468 0.249 0.112 0.177 0.705 0.195 0.128 0.326 0.258 0.361 0.496 0.754 0.389 0.221 0.078 0.381 0.277 0.256 0.583 0.632 0.573 0.430 0.126 0.534 0.847 0.317 0.391 0.524 0.413 0.283 0.523 0.111 0.356 0.509 0.149 0.467 0.328 0.217 0.481 0.258 0.256 0.459 0.273 0.510 0.031 0.289 0.451 0.485 0.468 0.466 0.491 0.233 0.296 0.269 0.453 0.593
177
which may be considered to be the observed value of a random sample X1 , . . . , X100 with common density function: f (x; ) = 2xex , x 0; > 0 Use a Wald test to test the hypotheses: H0 : = 6 versus H1 : = 6
2
In Chapter 9 it was shown that the maximum likelihood estimator is = with standard error
100 i=1
100 Xi2
se() = . 100
It may also be shown that is asymptotically normal so the Wald test applies. The Wald test statistic is 6 6 W = = = se() / 100 Since
100 i=1
100 i=1
100 2 6 Xi . 100 100 2 /10 Xi i=1
x2 = 14.081 the observed value of W is i w=

100 6 14.081 100 /10 14.081
= 1.5514.
Then, with Z N (0, 1), p-value = P (|Z| > 1.55) = 2(1.55) = 0.12. There is little or no evidence against H0 so we should retain the null hypothesis.
178
Example* (Ecology 2005, 86:1057-1060) Do ravens intentionally y towards gunshot sounds (to scavenge on the carcass they expect to nd)? Crow White addressed this question by going to 12 locations, ring a gun, then counting raven numbers 10 minutes later. He repeated the process at 12 dierent locations where he didnt re a gun. Results: no gunshot gunshot 0 2 0 1 2 4 3 1 5 0 0 5 0 0 0 1 1 0 0 3 1 5 0 2
Is there evidence that ravens y towards the location of gunshots? Answer this question using an appropriate Wald test.
LIKELIHOOD RATIO TESTS*
179
Likelihood Ratio Tests*

The Wald test described in the previous section is a general testing procedure for the situation where an asymptotically normal estimator is available. An even more general procedure, with good power properties, is the likelihood ratio test. The Likelihood Ratio Test (Single Parameter Case) Consider the hypotheses H0 : = 0 versus H1 : = 0
The likelihood ratio test statistic is = 2 ln L() L(0 ) = 2{ () (0 )}.
Under H0 and certain regularity conditions 2 . 1 Let be the observed value of . Then the approximate P -value is given by P -value P0 ( > ) = P (Q > ) = 2( ) where Q 2 . 1 Example Consider, one last time, the sample of size n = 100 0.366 0.412 0.265 0.406 0.276 0.127 0.257 0.433 0.262 0.299 0.568 0.190 0.375 0.336 0.232 0.054 0.640 0.253 0.625 0.226 0.300 0.207 0.267 0.305 0.133 0.440 0.147 0.234 0.183 0.069 0.115 0.147 0.360 0.321 0.316 0.228 0.273 0.514 0.541 0.211 0.204 0.116 0.250 0.268 0.468 0.249 0.112 0.177 0.705 0.195 0.128 0.326 0.258 0.361 0.496 0.754 0.389 0.221 0.078 0.381 0.277 0.256 0.583 0.632 0.573 0.430 0.126 0.534 0.847 0.317 0.391 0.524 0.413 0.283 0.523 0.111 0.356 0.509 0.149 0.467 0.328 0.217 0.481 0.258 0.256 0.459 0.273 0.510 0.031 0.289 0.451 0.485 0.468 0.466 0.491 0.233 0.296 0.269 0.453 0.593
D
which may be considered to be the observed value of a random sample X1 , . . . , X100 with common density function f given by: f (x; ) = 2xex , x 0; > 0.
2
180
Use a likelihood ratio test to test the hypotheses: H0 : = 6 versus H1 : = 6
First, note that the likelihood ratio statistic is = 2 ln where

100 100 100
L() L(6)
= 2{ () (6)}
() =
i=1
ln{f (Xi ; )} = 100 ln(2) + 100 ln() +

i=1
ln(Xi )
Xi2 .
i=1
As shown before, the maximum likelihood estimator is = so () = 100 ln(2) + 100 ln 100 100 2 i=1 Xi
100 100 i=1
100 Xi2 100 100 2 i=1 Xi

100
+
i=1 100
ln(Xi )
100
Xi2
i=1
= 100{ln(2) + ln(100)} 100 ln Also,
Xi2
i=1
+
i=1
ln(Xi ) 100.
100
100
(6) = 100 ln(2) + 100 ln(6) +

i=1
ln(Xi ) 6
100
Xi2
i=1
and so the likelihood ratio statistic is

100
= 2 100{ln(100/6) 1} 100 ln Since

100 i=1
Xi2
i=1
+6
i=1
Xi2 .
x2 = 14.081 the observed value of is i
= 2 [100{ln(100/6) 1} 100 ln(14.081) + 6 14.081] = 2.689. Then p-value = P=6 ( > ) = P (Q > 2.689), = P (Z 2 > 2.689), Z N (0, 1) = 2P (Z 2.689) Q 2 1
= 2(1.64) = 0.10
LIKELIHOOD RATIO TESTS*
181
which is close to the p-value of 0.12 obtained via the Wald test in the previous section. The conclusion remains that there is little or no evidence against H0 and that H0 should be retained.
Relation between Wald and Likelihood Ratio Test Statistics

Wald and likelihood ratio test statistics are related. Conceptually, the relation between the two tests can be understood by considering a log-likelihood function as below. There is some null value of the parameter 0 that we are testing. This null value does not attain the maximum of () - the maximum is attained under the alternative hypothesis, at .
A Wald statistic uses the horizontal axis, for , to construct a test statistic we take and compare it to 0 , to see if it is signicantly far from 0 . A likelihood ratio statistic uses the vertical axis, for (), to construct a test statistic we take the maximised log-likelihood, (), and compare it to the log-likelihood under the null hypothesis, (0 ), to see if (0 ) is signicantly far from the maximum. A quadratic approximation to the log-likelihood function can be used to nd a relation between Wald and likelihood ratio statistics: Consider a random sample X1 , . . . , Xn f (x; ) and a test of the hypothesis H0 : = 0 . Assume that H0 is true. Under some regularity conditions: W2 0 Proof :
P
182 Consider a Taylor expansion of

1 n
CHAPTER 10. HYPOTHESIS TESTING (0 ) about

1 n
():
1 1 (0 ) d 1 (0 )2 d2 (0 ) = ()+ ()+ ()+terms in (0 )3 and smaller 2 n n n d 2 n d Now the gradient of the log-likelihood estimate:
d d
() is zero at the maximum likelihood
2 1 1 11 d (0 ) = () + (0 )2 2 () + terms in (0 )3 and smaller n n 2n d
Rearranging this expression and multiplying by 2n: d2 () = terms in n(0 )3 and smaller d2 2 P P d (since n(0 )3 0) 2 (0 ) () (0 )2 2 () 0 d d2 P P 2 (0 ) () (0 )2 In () 0 (since 2 ()/In () 1) d P i.e. W 2 0 2 (0 ) () (0 )2 2 Slutskys Theorem was used twice in the above proof. The third-last line, in which P we state that n(0 )3 0, follows since: n(0 )3 = n(0 )2 (0 ) Now if H0 : = 0 is true, n(0 )2 converges in distribution to X, say (where P X is proportional to a chi-squared distribution) and (0 ) 0. So by Slutskys P Theorem n(0 )3 X 0 = 0. Where else was Slutskys theorem used in the above proof? This result shows that Wald and likelihood ratio tests are asymptotically equivalent when the null hypothesis is true and in large samples, they typically return similar test statistics hence similar conclusions. However, when the null hypothesis is not true, these tests can have quite dierent properties, especially in small samples.
Multiparameter Extension of the Likelihood Ratio Test

The likelihood ratio test procedure can be extended to hypothesis tests involving several parameters simultaneously. Such situations arise, for example, in the important branch of statistics known as regression, explored in MATH2831. Consider a model with parameter vector and corresponding parameter space . A general class of hypotheses is: H0 : 0 versus H1 : 0 /
LIKELIHOOD RATIO TESTS* where 0 is a subset of . Then the likelihood ratio statistic is = 2 ln sup L() sup0 L() .
183
Under H0 and regularity conditions on L, 2 d where d is the dimension of minus the dimension of 0 .
D
184
Chapter 11 Small-sample inference for normal samples

In previous chapters we have met methods of large-sample inference based on asymptotic results (all of which make use of the Central Limit Theorem, directly or indirectly). In this chapter we will meet a set of tools for making exact inference about normal random samples, even for small sample sizes. These methods are very widely used in practice because even when data are not normal, they still work quite well. Denition Let X1 , . . . , Xn be a random sample with common distribution N (, 2 ) for some and 2 . Then X1 , . . . , Xn is a normal random sample. Suppose X1 , X2 , . . . , Xn is a normal random sample. The sample mean X = n n 1 1 2 2 i=1 Xi and the sample variance S = n1 i=1 (Xi X) are two statistics of n particular interest to us they can be used to estimate the two parameters of the normal distribution, and 2 respectively. In order to understand the nature of the uncertainty when using X and S 2 to estimate their respective parameters, we need to know what the distribution of these statistics is.
Distribution theory for normal random samples

Distribution of the sample mean
We have shown previously that any linear function of the Xi s has a Normal distribution, implying that the sample mean is normally distributed: 185
186 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Result If X1 , . . . , Xn is a random sample from the N (, 2 ) distribution then X N (0, 1) . / n Note that the Central Limit Theorem from chapter 8 states that a sum or mean of a random sample from any distribution is approximately normal. The above result states that this result is exact if we have a normal random sample. (It follows then that for random samples from non-normal distributions, the central limit theorem will work better when X is closer to a normal distribution.) A common task of interest is to make inferences about , e.g. we often wish to construct a condence interval for . We could use the above result (to construct an exact Wald CI), however it involves knowing the true standard deviation . If we dont know , and we replace it with s, the condence interval is no longer be exact. What we really would like to know is the exact distribution of: X S/ n and in the following sections we will derive the exact distribution.
Distribution of the sample variance

Result Suppose X1 , X2 , . . . , Xn are a random sample from the N (, 2 ) distribu tion. Then X and S 2 are independent. This result can be proved by noting that X and the Xi X are jointly normal, and showing that their correlation is zero hence that they are independent. Then since S is a function of the Xi X, it also is independent of X. The sample variance S 2 of a normal random sample is related to a special distribution known as the chi-squared distribution: Denition If X has density fX (x) = ex/2 x/21 , x>0 2/2 (/2)
then X has the 2 (chi-squared) distribution with degrees of freedom . A common shorthand is X 2 . Note: we pronounce chi as kai, rhymes with buy and tie.
DISTRIBUTION THEORY FOR NORMAL RANDOM SAMPLES
187
The chi-squared distribution is related to the standard normal distribution, as below.
Results 1. If Z N (0, 1) then Z 2 2 . 1 2. If X1 , X2 , . . . , Xn are independent 2 random variables with Xi 2i ,

n n
then
i=1
Xi
2 ,
where =
i=1
i .
Result 1 can be derived via the cdf of Z 2 , Result 2 can be proved using moment generating functions (mgf given overleaf). Can you prove these results? Thus if X N (, 2 ), then X N (0, 1) and X 2 are independent N (, ) random variables, then
n 2
2 . Also, if X1 , X2 , . . . , Xn 1
i=1
Xi
2 , n
by Result 2.
This leads us to a key result the distribution of S 2 for a normal random sample. Result Let X1 , . . . , Xn be a random sample from the N (, 2 ) distribution. Then (n 1)S 2 2 . n1 2 To prove this result, the trick is n (Xi X)2 = n (Xi + X)2 then i=1 i=1 expand and simplify to terms involving (Xi )2 and n(X )2 , each of which is related to the chi-squared distribution...
Further properties of the chi-squared distribution In order to better understand the distribution of sample variances, it is useful to note a few further properties of the chi-squared distribution. Firstly, it is a special case of the Gamma distribution dened in Chapter 4: Result If X 2 then X Gamma(/2, 2). This means that Chapter 4 results can be used to obtain similar results for chisquared random variables:
188 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Results If X 2 then 1. E(X) = , 2. Var(X) = 2, 3. mX (u) =
/2 1 12u
u < 1/2.
We can calculate cumulative probabilities and quantiles for the chi-squared distribution on R using the pchisq and qchisq functions. Alternatively, tables for the chi-squared distribution can be used these are in the back of the Course Pack (and on UNSW Blackboard). Note that unlike R, the tables are right-tailed: they give probabilities p of the form P (2 > a) = p.
2 X density
area = area = 2 ,1
= mean 2 ,
Note that in the above gure (and elsewhere) we write the pth quantile of the 2 2 2 2 distribution as denote ,p , i.e. ,p satises P (X , ) = .
Example
Use tables to nd the following. 1. 2 10,0.95 2. 2 15,0.975 3. p such that 2 = 19.02 9,p 4. The smallest interval you can for p where 2 = 4.2. 1,p Check your answers using R.
1
DISTRIBUTION THEORY FOR NORMAL RANDOM SAMPLES
189
The t Distribution
We would like to make inferences about using: X S/ n But what is the exact distribution of this quantity? In the special case where the Xi are a normal random sample, it has what is known as a t distribution (or Students t distribution). Denition If T t then +1 2 fT (t) = 2 A common shorthand is T t . Denition Suppose Z N (0, 1) and Q 2 and Z and Q are independent. Then Z Q/ t . t2 1+
(+1) 2
< t < .
A proof of this result is most simply achieved using a bivariate transformation, which will be outlined in MATH2831. The above result leads to the following, which has fundamental importance in applied statistics: Result Let X1 , . . . , Xn be a random sample from the N (, 2 ) distribution. Then X tn1 . S/ n This result is important because it allows us to make exact inferences about based on a normal random sample, without knowledge of . Further, it turns out that X has approximately a tn1 distribution when we have a random samples from S/ n a distribution that is not normal. While the result is only exact for normal random samples, it is approximately true for approximately normal random samples, and it usually works quite well in practice even for random samples from grossly nonnormal distributions (provided that we have sucient sample size).
190 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Proof: Let T =
X , S/ n (X) / n
then Z
Q n1
T =
S 2 / 2
where Z N (0, 1) and Q 2 . n1
Also, the N (0, 1) and 2 variables are independent since X and S 2 are inden1 pendent. Thus, from the denition of the t distribution we can conclude that X T = S/n tn1 . 2 The t-test was rst derived by William S. Gosset, a statistician working for the Guinness brewery in Ireland on barley trials. Guinness wouldnt permit employees to publish results, in case they revealed trade secrets, so instead Gosset published his work under the pen-name Student. Hence the t-tables are sometimes referred to as Students t-tables. For details see: http://en.wikipedia.org/wiki/William Sealy Gosset Further properties of the t distribution Result If T t then fT (u) = fT (u) and so fT is symmetric about 0. This in turn implies that E(T ) = 0, provided that E(T ) exists. (It doesnt actually exist for = 1, as shown in a Chapter 3 exercise.) Result If T t then as , T converges to a N (0, 1) random variable. Proof: fT (t) and
t2 1+
(+1) 2
lim
t2 1+
= et
t2
so
lim fT (t) e 2 . 2
This result can also be derived via Slutskys theorem. Can you see how? When the degrees of freedom k is small, the variance estimator S 2 of 2 may not be very reliable, and the presence of S in the denominator of tk adds considerable extra variability. Thus a tk distribution for small k is more variable, and has thicker tails, than a standard normal N (0, 1) distribution. Similarly, as the degrees of freedom k becomes large, the precision of estimation of 2 by S 2 becomes accurate, and the tk distributions converge in distribution to N (0, 1).
INFERENCE FOR IN A NORMAL RANDOM SAMPLE
191
The pt and qt functions on R can be used to calculate cumulative probabilities and quantiles from t distributions. Alternatively, tables are available in the back of the Course Pack (and on UNSW Blackboard). Again, the tabulated values are for right-tailed probabilities, unlike the left-tailed probabilities on R.
t density
area =
area =
t, = t,1
t,1
We use a similar notation for quantiles of the t distribution as for quantiles of the 2 distribution: if T t , then P (T t, ) = . That is, t, is the th quantile of the t distribution.
Example Find the following (using tables and R):
1. t10,.95 2. t15,0.99 3. Find c such that P (t8 > c) = 0.01.

1
Inference for in a Normal Random Sample

The t-distribution result above can be used to make exact, small-sample inference about the true mean:
192 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Exact inference about from a normal random sample Let X1 , . . . , Xn be a random sample from the N (, 2 ) distribution. Then an exact 100(1 )% condence interval for is S S X tn1,1/2 , X + tn1,1/2 n n .
To test H0 : = 0 , an exact test can be constructed using: T = X 0 S/ n
which has exactly a tn1 distribution if H0 is true. The above hypothesis test is known as a t-test. In rst year you learnt an alternative to this method based on the binomial distribution the sign test. But t-tests are known to outperform sign tests in most settings intuitively, this is because a t-test uses all the information in the data rather than just summarising each value as > 0 or < 0 .
Example Body temperature of adults is claimed to be 37 degrees Celsius on average, and this is routinely used as the baseline in determining whether or not a patient has a fever. A study at the University of Maryland tested this claim based on measurements of 106 randomly selected healthy American adults. The dataset Bodytemp, containing the results of this study, is available on UNSW Blackboard. The dataset looks something like this (but with 106 measurements in it): 37.00, 37.00, 36.67, 36.67, 37.22, 36.89, . . ., 36.67, 36.11 Find a 90% condence interval for the true average body temperature of healthy American adults.
Example The UNSW hairdresser charges $35 for a haircut. In comparison, a sample of students spend at the hairdressers an average of , with a standard deviation of .
INFERENCE ABOUT X Y FOR TWO NORMAL RANDOM SAMPLES 193 Is there evidence that the university hairdresser is over-charging, i.e. charging more than the average student spends?
We could use a similar strategy, but via the chi-square distribution, to make inferences about based on a random sample. However, this is not advisable because methods of small-sample inference about are not robust to non-normality the Central Limit Theorem oers robustness when making inference about , but not .
Inference about X Y for two Normal Random Samples
A common situation in applied statistics is one of comparison between two samples. For example, Is recovery time shorter for patients using a new treatment than for patients on the old treatment? Do people lose more weight, on average, if they go on the CSIRO diet than the Atkins diet? Such data commonly arises from randomised comparative experiments with two treatment groups.
194 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Result Let X1 , . . . , XnX N (X , 2 ) and Y1 , . . . , YnY N (Y , 2 ) be two independent normal random samples; each with the same variance 2 . Then a 100(1 )% condence interval for X Y is X Y tnX +nY 2,1/2 Sp where
2 Sp =
1 1 + nX nY
2 2 (nX 1)SX + (nY 1)SY , nX + nY 2 known as the pooled sample variance. To test H0 : X = Y , an appropriate test statistic is
X Y Sp
1 nX
1 nY
tnX +nY 2
under H0
As previously, the normality assumption is not critical when making inferences about , because some robustness to non-normality is inherited from the Central Limit Theorem. The proof is analogous to that for the single sample condence interval for .
Example Peak expiratory ow is measured for 7 normal six-year-old children and 9 sixyear-old children with asthma. The data are as follows: Normal children 55 57 80 71 62 56 77 asthmatic children 53 39 56 54 49 53 45 37 44
We would like to know if peak ow is dierent between normal and asthmatic chil-
INFERENCE ABOUT X Y FOR TWO NORMAL RANDOM SAMPLES 195 dren, and if so, how dierent. Use a condence interval to answer this question. Graphical inspection of the data shows that the data do not appear to be too far from the normal distribution, for each sample. Let X = mean peak expiratory ow for normal children Y = mean peak expiratory ow for asthmatic children.
We will obtain a 95% condence interval for X Y under the assumption that the variances for each population are equal. The sample sizes, sample means and sample variances are nX = 7, nY = 9, The pooled sample variance is x y 65.43, 47.78, s2 X s2 Y 109.62 47.19
The appropriate t-distribution quantile is The condence interval is then
In conclusion, we can be 95% condent that the dierence in mean peak expiratory ow between the two groups is between
Example Do MATH2801 and MATH2901 students spend the same amount at the hairdressers, on average? A class survey obtained the following results:
196 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Carry out the hypothesis test: H0 : 2801 = 2901 versus H1 : 2801 = 2901
Robustness against inequality of variances*

In the previous example, the two sample variances had distinctly dierent values. This seems to call into question the validity of the homogeneous variance assumption 2 2 that X = Y . However, there is a valuable principle of robustness against inequality of variances, 2 2 due to G. E. P. Box, which states that even if X = Y , assuming equality anyway has minimal eect as long as the sample sizes nX , nY are nearly equal. To see this, equate the two expressions for se2 (X Y ), either assuming equal variances, or not assuming equal variances. This gives 1 1 + nX nY
2 Sp = 2 2 SX SY + nX nY
2 2 Solving gives two solutions, either (i) SX = SY , suggesting that the equal variances assumption is valid, or (ii) nX = nY , equal sample sizes.
Further analysis suggests that the equal variance assumption ends up being important only when the two sample sizes dier substantially, i.e. nX nY or nX nY and if the variances also dier substantially. A few rules of thumb are available one fairly lenient approach is to only worry about unequal variances if both the sample variances dier by a factor of at least two and if the sample sizes dier by a factor of two or more. An approach that is sometimes used to check the equal variance assumption is 2 2 to construct a test or condence interval for X /Y . However, this is not a very
INFERENCE ABOUT X Y FOR PAIRED NORMAL RANDOM SAMPLES197 practical idea it doesnt take into account what we are practically interested in (in particular, whether the sample sizes dier substantially as well), and most such tests do not work well when sample sizes are small (and hence condence intervals for variances are very broad). In the previous example, the sample sizes are nearly equal, so the CI calculated under the homogeneous variances assumption is likely to be reliable, even though the true underlying variances may be dierent.
Example Consider again the question of whether MATH2801 students spend more at the hairdressers than MATH2901 students, on average. Does the equal variance assumption appear to be reasonable?
Inference about X Y for Paired Normal Random Samples
Sometimes we have two samples and we want to compare the means, but the samples are not independent. This most commonly arises because the study is designed in such a way that data are collected in pairs, e.g. a matched pairs experiments. When data are paired, it is not appropriate to use the method of the previous chapter, which assumes independence of samples. But we can make inferences using the paired dierences:
198 CHAPTER 11. SMALL-SAMPLE INFERENCE FOR NORMAL SAMPLES Exact inference about X Y from a paired normal random sample Let (Xi , Yi ), i = 1, . . . , n be a paired random sample that is, Cor(Xi , Yi ) = 0. The paired dierences Di = Xi Yi can be considered as a random sample, and if they come from N (X Y , 2 ) distribution, we can use standard one-sample methods from page 11 to make inference about X Y . Example The number of drivers exceeding the speed limit, per month, was recorded before and after the installation of speed cameras at four locations (data from the Sydney Morning Herald, 22nd September 2003). location before after dierence Concord West 5719 1786 Edgecli Wentworthville 7535 6254 2228 528 Green Valley 2200 260
1. Test the hypothesis that there was no reduction in average number of drivers exceeding the speed limit. 2. Use a condence interval to estimate the average reduction in number of people speeding per month per camera, on average.
Chapter 12 Inference for categorical data

In the previous chapter we introduced common methods for making inference about one or two random samples from a quantitative variable. In this chapter we will review common methods for the categorical case. We will at times treat as distinct the situation where we have Bernoulli categorical variable(s) (i.e. variables with only two categories), because in that case the methods simplify to inference about binomial proportions.
Inference for a Bernoulli variable

We have already studied how to make inferences about a categorical variable that as only two categories in this case, we are interested in making inferences about the probability of success p which can be achieved using the following result: One-sample inference about proportions Consider a random sample X1 , . . . , Xn from a Bernoulli(p) variable. We can make inferences about the probability of success p as follows: An approximate 100(1 )% condence interval for p is p z1/2 p(1 p) , p + z1/2 n p(1 p) n
To test the hypothesis H0 : p = p0 , use as a test statistic Z= p0 (1 p0 )/n p p0 N (0, 1) if H0 is true
These normal approximations are known to work well when np0 and n(1p0 ) exceed 5 (for the Z-test), or when n and n(1 p) exceed 5 (for the CI). p Note the test statistic uses in the denominator the standard error as estimated 199
200
CHAPTER 12. INFERENCE FOR CATEGORICAL DATA
under H0 , p0 (1 p0 )/n, rather than the usual standard error p(1 p)/n. The reason for this is that this statistic is known to be better approximated by a normal distribution. (We could also apply a continuity correction as in Chapter 8, but strangely, this tends not to be common practice.) If n is not large enough for the normal approximation to be reasonable, you can use the binomial distribution directly for hypothesis testing (as you did last year in MATH1B).
Inference for a categorical variable

A random sample from a categorical variable with K levels can be summarised as a table of frequencies. We can test hypothesis about such a table as follows: Chi-square test of a categorical variable Consider a table of frequencies, X1 , . . . , XK , constructed using a random sample from a categorical variable with K < levels. The vector of probabilities p = (p1 , . . . , pK ) give the probabilities that an observation will be categories into each of the K levels. To test the hypothesis H0 : p = p0 , use as a test statistic
K
X =
k=1
(Xk k )2 2 K1 k
if H0 is true
where k is the kth element of np0 , the vector of expected counts under H0 . The chi-squared approximation is known to work well when all k 5. Deriving this test statistic is a little tricky, but it has a quite intuitive form the k Xk can be understood as counts, and Xk standardises the count to a Z-score k (since the variance is k for a Poisson count). For large enough k these Z-scores are approximately normal, so the test statistic is a sum of squared standard normal variables, hence the chi-squared distribution. This is a slight oversimplication more details provided in MATH3811. This test is due to Karl Pearson (1900) hence it is sometimes called a Pearson chisquared test. (Pearson made many important contributions to statistics, including method-of-moments estimation, the sample correlation coecient, and the term spurious correlation for associations that arise due to mutual response to lurking variables.) The degrees of freedom are K 1 rather than K you will see a more detailed explanation in MATH3831, but intuitively, degrees of freedom follow the formula: Degrees of freedom = Number of counts Number of constraints to estimate k
INFERENCE FOR A CATEGORICAL VARIABLE
201
In our case there are K values in the table of frequencies and one constraint (that the values sum to n), hence K 1 degrees of freedom.
Example The seasons in which MATH2801/2901 students were born are as follows:
Summer
Autumn
Winter
Spring
Total
Is there evidence that people are more likely to be born at one time of year than another?
It can actually be shown that the test of a Binomial proportion, as in the previous section, is mathematically equivalent to a chi-square test, when data have been re-expressed in a table of frequencies. Can you show this result?
Example In a famous experiment in the early days of genetic trials, Mendel crossed 556 smooth, yellow male peas with wrinkled, female peas. The expected relative frequencies predicted by the laws of inheritance, and the observed values, are as follows:
202
CHAPTER 12. INFERENCE FOR CATEGORICAL DATA Type Predicted ( k ) n 9 Smooth yellow 16 3 Smooth green 16 3 Wrinkled yellow 16 1 Wrinkled green 16 Observed 315 108 102 31
Do these data provide any evidence against the laws of inheritance?
Goodness-of-t testing for a categorical variable

A related statistic can be used to assess goodness-of-t of a discrete distribution to a random sample: Chi-squared test for goodness-of-t Consider a random sample from a discrete random variable X with an unknown probability function fX (x) over a nite set of K possible values. To test the hypothesis H0 : fX (x) = f (x; 1 , . . . , p ) with unknown 1 , . . . , p , use as a test statistic
K
X =
k=1
(Xk k )2 2 Kp1 k
if H0 is true
where k = nf (x; 1 , . . . , p ). The chi-squared approximation is known to work well when all k 5. The k are typically estimated via maximum likelihood, although other methods of
ASSOCIATIONS BETWEEN CATEGORICAL VARIABLES estimation can also be used.
203
Note that the degrees of freedom are K p 1 rather than K 1: this is because p parameters need to be estimated in order to nd the expected counts.
Example A Poisson distribution often provides a good model for rare counts. A famous example, due to Ladislaus Bortkiewicz (1898 The Law of Small Numbers), is the number of deaths per unit per year in the Prussian cavalry:
Deaths Frequency
0 109
1 65
2 22
3 4 >4 3 1 0
Is there evidence that these data depart from a Poisson distribution?
Chi-squared goodness-of-t tests are an alternative to quantile plots for small discrete datasets. They can be applied to quantitative data as well (by binning the values and comparing observed and expected counts in the ensuing frequency table), although this is not common, as a quantile plot is a more informative way to assess goodness-of-t for quantitative data.
Associations between categorical variables

Recall that when we want to summarise the association between two categorical variables we typically use a two-way table. Denote as xrc the count of the number of subjects which take the rth level of the rst variable and the cth level of the second.
204
Then the two-way table is as follows: x11 x21 . . . xR1 n1 x12 . . . x1C n1 x22 . . . x2C n2 . .. . . . . . . . . . xR2 . . . xRC nR n2 . . . nC n
Pearsons chi squared statistic can be used to test for an association between two categorical variables, using the two-way table. Chi-squared test for association Consider an R C two-way table based on a random sample in which we observe two categorical random variables, for which the marginal probabilities are denoted as pr and pc , respectively, and the joint probability is prc , for r {1, . . . , R} and c {1, . . . , C}. To test the hypothesis H0 : prc = pr pc for each (r, c), use as a test statistic
R C
X =
r=1 c=1
(Xrc rc )2 2 (R1)(C1) rc
if H0 is true
n where rc = nr pc = nrn c . p The chi-squared approximation is known to work well when all rc 5.
Note that the null hypothesis stated above, H0 : prc = pr pc for each (r, c), means that the two categorical variables are independent. Hence evidence against H0 is evidence of an association between variables. A common application of this method is when we wish to compare proportions across two independent samples of subjects (pX and pY , say): results can be written in a 2 2 table, and the test of independence is a test of pX = pY .
Example The results of the great prayer experiment were as follows: Any complications? Yes No Total Prayers 315 289 604 No prayers 304 293 597 Total 619 582 1201 Is there evidence of divine intervention? (i.e. Does whether or not you are prayed for inuence your chance of complications?)
ASSOCIATIONS BETWEEN CATEGORICAL VARIABLES
205
An alternative way to test pX = pY would be to construct a Wald test for pX pY . The Wald test however can be shown to be mathematically equivalent to the chisquared test, but the advantage of the chi-squared test is that it has a natural extension to R C tables where R, C > 2. Another statistic related to the Pearson chi-square statistic is the likelihood ratio statistic the two statistics can be shown to be asymptotically equivalent (using similar methods to page 181). The reason for use of (R1)(C 1) degrees of freedom can be understood as follows: Number of counts = RC Number of constraints for rc = 1 (for the n, since rc = nr pc ) p +R 1 (for the pr for r = 1 . . . , R) +C 1 (for the pc for r = 1 . . . , C)
Degrees of freedom = Number of counts Number of constraints = RC 1 (R 1) (C 1) = RC R C + 1 = (R 1)(C 1)
206
Paired samples of Bernoulli variables

Recall that in Chapter 11, if we had two paired samples, we made inference about the mean dierence by calculating the dierences and applying the standard onesample method (of page 191) to these dierences. A very similar thing happens for categorical data: Inference for paired samples of Bernoulli variables Consider a random sample of a pair of Bernoulli observations, (Xi , Yi ), i = 1, . . . , n, where Xi Bernoulli(px ) and Yi Bernoulli(py ). We want to make inferences about the paired dierence in proportions pX pY . To test: H0 : pX = pY we construct the table of frequencies: Yi = 0 Yi = 1 Xi = 0 x00 x01 x10 x11 Xi = 1 The dierences (x01 , x10 ) can be considered as a sample from Bin(x01 + x10 , p), and we can test the hypothesis H0 : p = 0.5 using page 199 methods. A dierent procedure, known as McNemars test, is commonly recommended for this situation. McNemars formulation is mathematically equivalent to the above test.
Example Johnson & Johnson (1985) studied whether Hodgkins disease was more prevalent in patients who had a tonsillectomy. Their method was to match each of 85 Hodgkins patients with a sibling, and record whether or not each had a tonsillectomy. The results can be summarised using the following tables: Table A No tonsillectomy 44 52
Hodgkins Control
Tonsillectomy 41 33
Patient
No tonsillectomy Tonsillectomy
Table B Sibling No tonsillectomy Tonsillectomy 37 7 tonsillectomy 15 26
PAIRED SAMPLES OF BERNOULLI VARIABLES 1. What type of study design was used?
207
2. Which table should be used to test if tonsillectomies aect the incidence of Hodgkins disease? 3. Test the hypothesis of independence of tonsillectomies and Hodgkins disease.
The Rice text explains that in their original publication, Johnson & Johnson (1985) used the wrong table and analysis, and found no association. Hence the researchers found and published the wrong result because they neglected to take into account the pairing structure in their data! It is important to be able to recognise paired data situations because the appropriate method of analysis is quite distinct from what happens with independent samples, and as above, you can get dierent results.
208
Chapter 13 Summary statistical inference

We have now met most of the common methods of statistical inference that are used in practice. They can be summarised in the following table: Summary of methods of inference Useful methods for making inference about a variable, or about the association between two variables, depend on whether these variables are categorical or quantitative. Does the research question involve: One variable Two variables Both Both One of each Data type: Categorical Quantitative quantative categorical Key p p X pY X Y (MATH2831) parameters: Z-test for p Tests: t-test for 2 test t-test (MATH2831) or 2 test

If the categorical variable has only two categories, i.e. we have a two-sample situation If data are from a known, non-normal distribution, we can use likelihood-based inference (e.g. Wald test) If data are paired, calculate dierences and use the appropriate one-variable method
Whats next?
The above table raises a number of questions, which will be answer during the remainder of a statistics major! How do we make inferences about the relationship between quantitative variables? This uses regression analysis, an important statistical tool (see MATH2831 and MATH3821). What if we have more than two samples? This uses analysis of variance, an important special case of regression (MATH2831). What about when there are more than two variables? 209
210
CHAPTER 13. SUMMARY STATISTICAL INFERENCE
This situation is handled by regression analysis (MATH2831) for quantitative responses and by categorical data analysis (MATH3851) for categorical responses. What if parametric assumptions are not satised? Hence we cannot assume normality, nor can we use likelihood-based inference. Two common approaches in this setting are non-parametric statistics and robust statistics (MATH3811). What if data are not iid? Often data are not collected as an iid sample. A common example is when data follow a stochastic process, where events occur randomly in space or time (MATH3801 and MATH3841). How do we know whether or not an inferential procedure works well? Several methods of making statistical inferences have been recommended in this course. To demonstrate that these are good methods to use, we need to think about power and eciency (MATH3811). How should we collect data in order to use these methods? We will consider two cases survey design (MATH3831) for observation studies and experimental design (MATH3851) for controlled experiments. So youll get to answer all the above questions (and encounter some more) if you choose to do a stats major!

Statistics Introduction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Introduction

Uploaded by

Copyright:

Available Formats

Chapter 1 Introduction

Key themes in statistics

KEY THEMES IN STATISTICS

Sampling introduces variation

KEY THEMES IN STATISTICS

Whats the research question?

KEY THEMES IN STATISTICS

Part One Summarising data

Chapter 2 Descriptive statistics

Two steps to data analysis

CHAPTER 2. DESCRIPTIVE STATISTICS

Graphs: Bar chart

Clustered bar chart

Numerical summaries of categorical data

CHAPTER 2. DESCRIPTIVE STATISTICS

Graphical summaries of categorical data

Clustered bar chart

Frequency 50 100 150

Frequency 200 400

Numerical summaries of quantitative data

is a natural measure of location of a quantitative variable. The sample variance 1 s = n1

CHAPTER 2. DESCRIPTIVE STATISTICS

Graphical summaries of quantitative data

CHAPTER 2. DESCRIPTIVE STATISTICS

SUMMARISING ASSOCIATIONS BETWEEN VARIABLES

Summarising associations between variables

Associations between quantitative variables

Brain mass (ml) [log scale]

Body mass (kg) [log scale]

An eective numerical summary of the relationship between two quantitative vari-

CHAPTER 2. DESCRIPTIVE STATISTICS

ables is the correlation coecient (r): 1 r= n1

SUMMARISING ASSOCIATIONS BETWEEN VARIABLES

Associations between categorical and quantitative variables

CHAPTER 2. DESCRIPTIVE STATISTICS

x s 23.4 12.3 44.3 21.5

xi behaves as a measure of location.

2. The standard deviation sx = spread. 3. The correlation coecient 1 r= n1

x)2 behaves as a measure of

behaves as a measure of shape (consider linear transformations of xi and yi ).

CHAPTER 2. DESCRIPTIVE STATISTICS

TRANSFORMING DATA Nonlinear transformations

Brain mass (ml) [log scale] 0.5 5.0 50.0

Dinosaur Reptile Bird

Dinosaur Reptile Bird

Brain mass (ml)

CHAPTER 2. DESCRIPTIVE STATISTICS

Part Two Modelling data

CHAPTER 2. DESCRIPTIVE STATISTICS

Chapter 3 Random Variables

being the probability that X takes the value x.

CHAPTER 3. RANDOM VARIABLES

Discrete Random Variables and Probability Functions

f (x) 3/8 1/8 x

CONTINUOUS RANDOM VARIABLES AND DENSITY FUNCTIONS

Continuous Random Variables and Density Functions

Result For any continuous random variable X and pair of numbers a b

fX (x)dx = area under fX between a and b.

CHAPTER 3. RANDOM VARIABLES

Some particularly important properties of X are described later in this chapter.

Continuous random variables X have the property P (X = a) = 0 for any a R. (3.1)

Cumulative Distribution Function