You are on page 1of 59

Prob & Stat

PROBABILITY AND STATISTICS Some Basic Denitions Random variable large set of possible values only one will occur The set of possible values and their probabilities = the probability distribution

Prob & Stat

Continuous random variable has a probability density function (pdf) fX such that P (X A) =
A

fX (x)dx for all sets A

Prob & Stat

CDFs cumulative distribution function (CDF) of X is FX (x) := P (X x) If X has a pdf then


x

FX (x) :=

fX (u)du

Prob & Stat

Quantiles if the CDF of X is continuous and strictly increasing then it has a inverse function F 1 for q between 0 and 1, F 1 (q ) is called the q th quantile or 100q th percentile

Prob & Stat

probability X is below its q th quantile is q : P {X F 1 (q )} = q also called the lower quantile the q th upper quantile is the 1 q th lower quantile
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2

0.2

0.4

0.6

F 0.8(q)

0.2

0.4

0.6

F (q) 0.8

1 0.8 0.6 0.4 0.2 0

0.2

0.4

1 F0.6 (q)0.8

Prob & Stat

median is the 50% percentile or .5 quantile 25% and 75% percentiles (.25 and .75 quantiles) are called the rst and third quartiles for 95% condence intervals we use the 0.025 and 0.975 quantiles, i.e., the 0.025 lower and 0.025 upper quantiles.

Prob & Stat

Expectations and Variances The expectation of X is


+

E (X ) :=

xfX (x)dx

variance of X is
2 X :=

{x E (X )}2 fX (x)dx = E {X E (X )}2

2 Useful formula: X = E (X 2 ) {E (X )}2

standard deviation is the square root of the variance: X := E {X E (X )}2

Prob & Stat

If X1 , . . . , Xn is a sample from a prob distn, then expectation estimated by sample mean


n

X = n1
i=1

Xi

the variance estimated by sample variance s2 X =


n i=1 (Xi

X )2 n1

Prob & Stat

Correlation and Covariance XY = E {X E (X )}{Y E (Y )} If (X, Y ) are continuously distributed, then XY = {x E (X )}{y E (Y )}fXY (x, y ) dx dy

Prob & Stat

10

Useful formulas: XY XY XY XY = E (XY ) E (x)E (y ) = E [{X E (X )}Y ] = E [{Y E (Y )}X ] = E (XY ) if E (X ) = 0 or E (Y ) = 0

Prob & Stat

11

Correlation coecient between X and Y : XY := XY /X Y for any (X, Y ) it is true that 1 XY 1

Prob & Stat

12

Given a bivariate sample {(Xi , Yi )}n i=1 , the sample correlation coecient is n1 where X and Y are the sample means sX and sY are the sample standard deviations
n i=1 (Xi

X )(Yi Y ) sX sY

(1)

Prob & Stat

13

r = 0.01
3 2 1 0 1 2 3 5 0 5 3 2 1 0 1 2 3 5

r = 0.25
3 2 1 0 1 2 0 5 3 5

r = 0.5
3 2 1 0 1 2 0 5 3 5

r = 0.95

r = 0.11
4 3 2 2 1 0 0 1 2 5 0 5 2 4 5 6 4

r = 0.83
3 2 1 0

r = 0.89
3 2 1 0 1 2 3 1 2 0 5 3 5

r = 1

4 5

Prob & Stat

14

an absolute correlation of .25 is very weak an absolute correlation of .5 is only moderate an absolute correlation of .95 is rather strong an absolute correlation of 1 implies a linear relationship a strong nonlinear relationship may or may not imply a high correlation

Prob & Stat

15

positive correlations increasing relationship negative correlations decreasing relationship

Prob & Stat

16

X and Y are independent if for all sets A and B , P (X A and Y B ) = P (X A) P (Y B ). If X and Y are independent then for all functions g and h, E {g (X )h(Y )} = E {g (X )} E {h(Y )}. if X and Y are independent, then XY = 0. XY = 0 does not imply independence there could be a strong nonlinear association between X and Y

Prob & Stat

17

Best Linear Prediction Idea: use X (observed) to predict Y (unobserved) A linear predictor is Y := 0 + 1 X prediction error is Y Y Squared error is {Y Y }2 = {Y ( + 1 X )}2

Prob & Stat

18

Best linear prediction means nding 0 and 1 to minimize expected squared prediction error given by E {Y (0 + 1 X )}2 . Similar to linear regression, but uses populations rather than samples.

Prob & Stat

19

So, we want to minimize E {Y (0 + 1 X )}2


2 = E (Y 2 ) 20 E (Y ) 21 E (XY ) + 0 2 +20 1 E (X ) + 1 E (X 2 )

Setting the partial derivatives to zero we get 0 = E (Y ) + 0 + 1 E (X ) 0 = E (XY ) + 0 E (X ) + 1 E (X 2 )

Prob & Stat

20

After some algebra (exercise) we nd that


2 1 = XY /X

and 0 = E (Y ) 1 E (X ) = E (Y ) XY /X E (X ) Thus, the best linear predictor of Y is XY Y := 0 + 1 X = E (Y ) + 2 {X E (X )} X Another way to look at this: Y E (Y ) Y (Exercise) = XY X E (X ) X

Prob & Stat

21

The prediction error is Y Y can be proved that E {Y Y } = 0 the prediction is unbiased expected squared prediction error is
2 XY 2 2 E {Y Y }2 = Y 2 = Y (1 2 XY ) X

Prob & Stat

22

How much does X help us predict Y ? do not observe X predict Y using a constant denote this constant by c The expected squared prediction error is E (Y c)2 = Var(Y ) + {c E (Y )}2 (exercise: check) minimized by c = E (Y )
2 the expected squared prediction error is Y

Prob & Stat

23

X is observed then expected squared prediction error is


2 Y (1 2 XY )

2 XY is the fraction by which the prediction error is reduced when X is known This is an important fact that we will see again

Prob & Stat

24

Example: if XY = .5, then prediction error reduced by 25% by observing X .


2 if Y = 3, the expected squared prediction error is: 3 if X is unobserved only 2.25 = 3{1 (.5)2 } if X is observed

Prob & Stat

25

Conditional Distributions Let fXY (x, y ) be the joint density of a pair of random variables, (X, Y ). The marginal density of X is fX (x) := and similarly for fY . The conditional density of Y given X is fXY (x, y ) fY |X (y |x) = . fX (x) fXY (x, y )dy

Prob & Stat

26

The conditional expectation of Y given X is the expectation calculated using fY |X (y |x): E (Y |X = x) = which is a function of x. The conditional variance of Y given X is Var(Y |X = x) = {y E (Y |X = x)}2 fY |X (y |x)dy. yfY |X (y |x)dy

Prob & Stat

27

The Normal Distribution The standard normal distribution has density 1 (x) := exp 2 The N (, 2 ) density is 1 x
x

x2 2

The standard normal CDF is (x) :=

(u)du.

can be evaluated using tables or more easily using software such as MATLAB or MINITAB.

Prob & Stat

28

Important: If X N (, 2 ) then P (X x) = {(x )/ }. Example: If X N (5, 4) then what is P (X 7)? Answer: Using x=7 =5 2 = 4 we have (x )/ = (7 5)/2 = 1 and then (1) = .8413 in MATLAB 6, cdfn(1) gives ans = 0.8413

Prob & Stat

29

Conditional expectations and variances Calculation of conditional expectations and variances can be dicult, but are easy for a bivariate normal distribution. For a bivariate normal pair, the conditional expected equals the best linear predictor: XY E (Y |X ) = E (Y ) + 2 {X E (X )}. X The conditional variance is the expected squared prediction error:
2 (1 2 Var(Y |X ) = Y XY )

Prob & Stat

30

Linear Functions of Random Variables E (aY + b) = aE (Y ) + b Var(aY + b) = a2 Var(Y ). E (w1 X + w2 Y ) = w1 E (X ) + w2 E (Y ),


2 2 Var(w1 X +w2 Y ) = w1 Var(X )+2w1 w2 Cov(X, Y )+w2 Var(Y )

Note that
Var(w1 X + w2 Y ) = ( w1 w2 )

Var(X ) Cov(X, Y )

Cov(X, Y ) Var(Y )

w1 w2

Prob & Stat

31

Fact: a11 . wN ) . . aN 1
N N

( w1

w1 a1N . . .. . . . . . aN N wN wi wj aij

=
i=1 j =1

Prob & Stat

32

Suppose

X1 . . X= . XN Then, the expectation of X is E (X1 ) . . E (X ) := . . E (XN ) and the covariance matrix of X is


Var(X1 ) * Cov(X , X ) 2 1 * * COV(: ) := * . . ( . Cov(XN , x1 )


Cov(X1 , X2 ) Var(X2 ) . . . Cov(XN , X2 )

.. .

Cov(X1 , XN ) Cov(X2 , XN ) + + +. . + . ) . Var(XN )

Prob & Stat

33

Let w be a vector of weights:

w1 . w= . . wN
N

wT X =
i=1

wi Xi
N

E (wT X ) = wT {E (X )} =
i=1

wi E (Xi )

Prob & Stat

34

Var(wT X ) =
i=1 j =1

wi wj Cov(Xi , Xj )

= wT COV(X )w.

Prob & Stat

35

Example: Suppose that X = (X1 X2 X3 )T , Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find X ). Var(X1 + X2 + 1 2 3 Answer: The covariance between X1 and X3 is 0 by independence the same is true of X2 and X3 . The covariance between X1 and X2 is (.6) (2)(3) = 1.47.

Prob & Stat

36

Therefore, 2 1.47 0 COV(X ) = 1.47 3 0, 0 0 5 and


2 1.47 ) ( 1.47 3 0 0   3.47 ) ( 4.47 ) 2 .5


Var(X1 + X2 + X3 /2)

(1

1 2

0 1 0)( 1 ) 1 5 2



= =

(1

1 2

9.19.

Prob & Stat

37

Let w1 and w2 be two weight vectors:


T T Cov(wT X , w X ) = w 1 2 1 COV(X )w 2 .

If X1 , . . . , Xn are independent, or at least uncorrelated, then


N

Var(wT X ) =
i1

2 wi Var(Xi ).

Prob & Stat

38

Example continued: Suppose as in the previous example that X = ( X1 X2 X3 )T ,

Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find the covariance between (X1 + X2 + X3 )/3 and (X1 + X2 )/2. Answer: Let 1
3 1 3 1 3

1
2 1 2

w1 = and w2 = 0

Prob & Stat

39

Then Cov X1 + X2 X1 + X2 + X3 , 2 3 =

MT 1 COV(: )M 2
(
1 3 1 3 1 3

2 ) ( 1.47 0

1.47 3 0

1 0 2 ) 0)( 1 2 5 0 1 2 1 2



 )

= =

( 1.157 1.323.

1.490

1.667 ) (

Prob & Stat

40

Important fact: If X has a multivariate normal distribution, then wT X is a normal random variable.

Prob & Stat

41

Example: Suppose that E (X1 ) = 1, E (X2 ) = 1.5, 2 2 X = 1, X2 = 2, and Cov(X1 , X2 ) = .5. 1 Find E (.3X1 + .7X2 ) and Var(.3X1 + .7X2 ). If (X1 X2 )T is bivariate normal, nd P (.3X1 + .7X2 < 2) Answers: E (.3X1 + .7X2 ) = (.3)(1) + (.7)(1.5) = 1.35 Var(.3X1 +.7X2 ) = (.3)2 (1)+(.7)2 (2)+(2)(.3)(.7)(.5) = 1.28 P (.3X1 + .7X2 < 2) = {(2 1.35)/ 1.28} = (.5745) = .7172

Prob & Stat

42

Hypothesis Testing Example: H0 : = 1 versus H1 : = 1 Rejection region: set of possible samples that lead us to reject H0 . example: reject H0 if |X 1| exceeds cuto c. Type I error: null hypothesis is true but we reject it Type II error: null hypothesis is false and we accept it

Prob & Stat

43

Rejection region chosen so P(type I error) below a pre-specied value is called the level of the test typical values of used in practice are .01, .05, or .1 as is made smaller, the rejection region must be made smaller

Prob & Stat

44

p-values The p-value for a sample is dened as the smallest value of for which the null hypothesis is rejected for that sample. to do the test we nd (typically, using stat software) the p-value of that sample H0 is rejected if we decide to use larger than the p-value Example: if p = .005 and = .01 then reject H0 H0 is accepted if we use smaller than the p-value Example: if p = .03 and = .01 then accept H0

Prob & Stat

45

Thus small p-value is evidence against the null hypothesis large p-value shows that the data are consistent with the null hypothesis

Prob & Stat

46

Maximum Likelihood Estimation Y = (Y1 , . . . , Yn )T is vector of data = (1 , . . . , p )T is vector of parameters f (y ; ) is the density of Y depends on

Prob & Stat

47

Example: Suppose that Y1 , . . . , Yn are IID N (, 2 ). Then = (, 2 ) and


n

f (y ; ) =
i=1 n

Yi 1 2 ( Y ) i 2 2 1 2 2
n

=
i=1

1 exp 1 / 2 (2 )

1 = exp n n/ 2 (2 )

(Yi )2
i=1

Prob & Stat

48

L( ) := f (Y ; ) is the likelihood function maximum likelihood estimator = MLE = value of that maximizes L( ) denote the MLE by M L often it is mathematically easier to maximize log{L( )}

Prob & Stat

49

Example: In the example M L = Y : log{L(, 2 )} 1 = log n exp n/ 2 (2 ) 1 2 2


n

(Yi )2
i=1 n

n 1 2 = log( ) + log(2 ) 2 2 2 MLE of minimizes


n

(Yi )2
i=1

(Yi )2
i=1 n n

0=
i=1

(Yi )
i=1

Yi = n = Y

Prob & Stat

50

With xed at Y , MLE of 2 solves d n 1 2 0= log{L(Y , )} = 2 + 4 2 d 2 2 Solution is


2 M L 2 M L has a small bias n n

(Yi Y )2
i=1

1 = n

(Yi Y )2
i=1

bias-corrected MLE is the so-called sample variance s2 Y 1 = n1


n

(Yi Y )2
i=1

Prob & Stat

51

In this textbook example there is an explicit formula for the MLE With more complex models, there is no explicit formula Rather, one writes program to compute log{L( )} for any uses optimization software to maximize this function numerically For some models such as the ARIMA time series models, there are software packages, e.g, MINITAB and SAS, that compute the MLE

Prob & Stat

52

Likelihood Ratio Tests (LRTs) LRTs are a convenient, all-purpose tool. Let = 1 2

Want to test a hypothesis about 1 without making any hypothesis about the value of 2 . Example: want to test that population mean is zero; then 1 = and 2 = 2 . Let 1,0 be the hypothesized value of 1 Example: 1,0 = 0 if we want to test that is zero.

Prob & Stat

53

The hypotheses are H0 : 1 = 1,0 and H1 : 0 = 1,0 .

Neither hypothesis says anything about 2 .

Prob & Stat

54

Example: testing that is zero, the hypotheses are H0 : = 0 and H1 : = 0 Neither hypothesis species anything about . M L = maximum likelihood estimator 2,0 be the value of 2 that maximizes L( ) when 1 = 1,0

Prob & Stat

55

The likelihood ratio test rejects H0 if 2 log L( M L ) L( 1,0 , 2,0 )


1)

= 2 log{L( M L )} log{L( 1,0 , 2,0 )} 2 ; dim( dim( 1 ) = number of components of 1 2 ,k is the upper-probability value of the chi-squared distribution with k degrees of freedom

in other words, 2 ,k is the (1 ) quantile so that the probability above 2 ,k is

Prob & Stat

56

Example: Y1 , . . . , Yn are IID N (, 2 ) and = (, 2 ). We want to test that is zero. n n 1 2 log(L) = log(2 ) log( ) 2 2 2 2 log(L) at the MLE is log{L(Y
2 , M L )} n

(Yi )2 .
i=1

n 2 = {1 + log(2 ) + log(M L )}. 2 1 = n


n

The value of 2 that maximizes L when = 0 is


2 0

Yi2 .
i=1

(Exercise: check)

Prob & Stat

57

Therefore,
2 2 2 log{L(Y , M ) } log { L (0 , L 0 )} 2 2 = n log(0 ) log(M L)

= n log = n log

2 0 2 M L

Yi2 n 2 ( Y Y ) i i=1

n i=1

Prob & Stat

58

Likelihood ratio test rejects H0 if n log Yi2 n 2 i=1 (Yi Y )


n i=1

> 2 ,1 .

(2)

To appreciate why (2) is a reasonable test consider if = 0: Simple algebra shows that
n n

Yi2 =
i=1 i=1

(Yi Y )2 + n(Y )2

Y will be close to = 0 and fraction inside the log will be close to 1. The log of 1 is 0 so the left hand side of (2) be small so we do not reject the null (right decision)

Prob & Stat

59

if is not 0: Y =0 then the left hand side of (2) will be large so that we reject the null (correct decision)

You might also like