Mle

Prob & Stat
PROBABILITY AND STATISTICS Some Basic Denitions Random variable large set of possible values only one will occur The set of possible values and their probabilities = the probability distribution
Prob & Stat
Continuous random variable has a probability density function (pdf) fX such that P (X A) =
A
fX (x)dx for all sets A
Prob & Stat
CDFs cumulative distribution function (CDF) of X is FX (x) := P (X x) If X has a pdf then

x
FX (x) :=
fX (u)du
Prob & Stat
Quantiles if the CDF of X is continuous and strictly increasing then it has a inverse function F 1 for q between 0 and 1, F 1 (q ) is called the q th quantile or 100q th percentile
Prob & Stat
probability X is below its q th quantile is q : P {X F 1 (q )} = q also called the lower quantile the q th upper quantile is the 1 q th lower quantile
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2
0.2
0.4
0.6
F 0.8(q)
0.2
0.4
0.6
F (q) 0.8
1 0.8 0.6 0.4 0.2 0
0.2
0.4
1 F0.6 (q)0.8
Prob & Stat
median is the 50% percentile or .5 quantile 25% and 75% percentiles (.25 and .75 quantiles) are called the rst and third quartiles for 95% condence intervals we use the 0.025 and 0.975 quantiles, i.e., the 0.025 lower and 0.025 upper quantiles.
Prob & Stat
Expectations and Variances The expectation of X is

+
E (X ) :=
xfX (x)dx
variance of X is
2 X :=
{x E (X )}2 fX (x)dx = E {X E (X )}2
2 Useful formula: X = E (X 2 ) {E (X )}2
standard deviation is the square root of the variance: X := E {X E (X )}2
Prob & Stat
If X1 , . . . , Xn is a sample from a prob distn, then expectation estimated by sample mean

n
X = n1
i=1
Xi
the variance estimated by sample variance s2 X =

n i=1 (Xi
X )2 n1
Prob & Stat
Correlation and Covariance XY = E {X E (X )}{Y E (Y )} If (X, Y ) are continuously distributed, then XY = {x E (X )}{y E (Y )}fXY (x, y ) dx dy
Prob & Stat
10
Useful formulas: XY XY XY XY = E (XY ) E (x)E (y ) = E [{X E (X )}Y ] = E [{Y E (Y )}X ] = E (XY ) if E (X ) = 0 or E (Y ) = 0
Prob & Stat
11
Correlation coecient between X and Y : XY := XY /X Y for any (X, Y ) it is true that 1 XY 1
Prob & Stat
12
Given a bivariate sample {(Xi , Yi )}n i=1 , the sample correlation coecient is n1 where X and Y are the sample means sX and sY are the sample standard deviations
n i=1 (Xi
X )(Yi Y ) sX sY
(1)
Prob & Stat
13
r = 0.01
3 2 1 0 1 2 3 5 0 5 3 2 1 0 1 2 3 5
r = 0.25
3 2 1 0 1 2 0 5 3 5
r = 0.5
3 2 1 0 1 2 0 5 3 5
r = 0.95
r = 0.11
4 3 2 2 1 0 0 1 2 5 0 5 2 4 5 6 4
r = 0.83
3 2 1 0
r = 0.89
3 2 1 0 1 2 3 1 2 0 5 3 5
r = 1
4 5
Prob & Stat
14
an absolute correlation of .25 is very weak an absolute correlation of .5 is only moderate an absolute correlation of .95 is rather strong an absolute correlation of 1 implies a linear relationship a strong nonlinear relationship may or may not imply a high correlation
Prob & Stat
15
positive correlations increasing relationship negative correlations decreasing relationship
Prob & Stat
16
X and Y are independent if for all sets A and B , P (X A and Y B ) = P (X A) P (Y B ). If X and Y are independent then for all functions g and h, E {g (X )h(Y )} = E {g (X )} E {h(Y )}. if X and Y are independent, then XY = 0. XY = 0 does not imply independence there could be a strong nonlinear association between X and Y
Prob & Stat
17
Best Linear Prediction Idea: use X (observed) to predict Y (unobserved) A linear predictor is Y := 0 + 1 X prediction error is Y Y Squared error is {Y Y }2 = {Y ( + 1 X )}2
Prob & Stat
18
Best linear prediction means nding 0 and 1 to minimize expected squared prediction error given by E {Y (0 + 1 X )}2 . Similar to linear regression, but uses populations rather than samples.
Prob & Stat
19
So, we want to minimize E {Y (0 + 1 X )}2

2 = E (Y 2 ) 20 E (Y ) 21 E (XY ) + 0 2 +20 1 E (X ) + 1 E (X 2 )
Setting the partial derivatives to zero we get 0 = E (Y ) + 0 + 1 E (X ) 0 = E (XY ) + 0 E (X ) + 1 E (X 2 )
Prob & Stat
20
After some algebra (exercise) we nd that

2 1 = XY /X
and 0 = E (Y ) 1 E (X ) = E (Y ) XY /X E (X ) Thus, the best linear predictor of Y is XY Y := 0 + 1 X = E (Y ) + 2 {X E (X )} X Another way to look at this: Y E (Y ) Y (Exercise) = XY X E (X ) X
Prob & Stat
21
The prediction error is Y Y can be proved that E {Y Y } = 0 the prediction is unbiased expected squared prediction error is
2 XY 2 2 E {Y Y }2 = Y 2 = Y (1 2 XY ) X
Prob & Stat
22
How much does X help us predict Y ? do not observe X predict Y using a constant denote this constant by c The expected squared prediction error is E (Y c)2 = Var(Y ) + {c E (Y )}2 (exercise: check) minimized by c = E (Y )
2 the expected squared prediction error is Y
Prob & Stat
23
X is observed then expected squared prediction error is

2 Y (1 2 XY )
2 XY is the fraction by which the prediction error is reduced when X is known This is an important fact that we will see again
Prob & Stat
24
Example: if XY = .5, then prediction error reduced by 25% by observing X .

2 if Y = 3, the expected squared prediction error is: 3 if X is unobserved only 2.25 = 3{1 (.5)2 } if X is observed
Prob & Stat
25
Conditional Distributions Let fXY (x, y ) be the joint density of a pair of random variables, (X, Y ). The marginal density of X is fX (x) := and similarly for fY . The conditional density of Y given X is fXY (x, y ) fY |X (y |x) = . fX (x) fXY (x, y )dy
Prob & Stat
26
The conditional expectation of Y given X is the expectation calculated using fY |X (y |x): E (Y |X = x) = which is a function of x. The conditional variance of Y given X is Var(Y |X = x) = {y E (Y |X = x)}2 fY |X (y |x)dy. yfY |X (y |x)dy
Prob & Stat
27
The Normal Distribution The standard normal distribution has density 1 (x) := exp 2 The N (, 2 ) density is 1 x
x
x2 2
The standard normal CDF is (x) :=
(u)du.
can be evaluated using tables or more easily using software such as MATLAB or MINITAB.
Prob & Stat
28
Important: If X N (, 2 ) then P (X x) = {(x )/ }. Example: If X N (5, 4) then what is P (X 7)? Answer: Using x=7 =5 2 = 4 we have (x )/ = (7 5)/2 = 1 and then (1) = .8413 in MATLAB 6, cdfn(1) gives ans = 0.8413
Prob & Stat
29
Conditional expectations and variances Calculation of conditional expectations and variances can be dicult, but are easy for a bivariate normal distribution. For a bivariate normal pair, the conditional expected equals the best linear predictor: XY E (Y |X ) = E (Y ) + 2 {X E (X )}. X The conditional variance is the expected squared prediction error:
2 (1 2 Var(Y |X ) = Y XY )
Prob & Stat
30
Linear Functions of Random Variables E (aY + b) = aE (Y ) + b Var(aY + b) = a2 Var(Y ). E (w1 X + w2 Y ) = w1 E (X ) + w2 E (Y ),

2 2 Var(w1 X +w2 Y ) = w1 Var(X )+2w1 w2 Cov(X, Y )+w2 Var(Y )
Note that
Var(w1 X + w2 Y ) = ( w1 w2 )
Var(X ) Cov(X, Y )
Cov(X, Y ) Var(Y )
w1 w2
Prob & Stat
31
Fact: a11 . wN ) . . aN 1
N N
( w1
w1 a1N . . .. . . . . . aN N wN wi wj aij
=
i=1 j =1
Prob & Stat
32
Suppose
X1 . . X= . XN Then, the expectation of X is E (X1 ) . . E (X ) := . . E (XN ) and the covariance matrix of X is

Var(X1 ) * Cov(X , X ) 2 1 * * COV(: ) := * . . ( . Cov(XN , x1 )

Cov(X1 , X2 ) Var(X2 ) . . . Cov(XN , X2 )
.. .
Cov(X1 , XN ) Cov(X2 , XN ) + + +. . + . ) . Var(XN )
Prob & Stat
33
Let w be a vector of weights:
w1 . w= . . wN
N
wT X =
i=1
wi Xi
N
E (wT X ) = wT {E (X )} =
i=1
wi E (Xi )
Prob & Stat
34
Var(wT X ) =
i=1 j =1
wi wj Cov(Xi , Xj )
= wT COV(X )w.
Prob & Stat
35
Example: Suppose that X = (X1 X2 X3 )T , Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find X ). Var(X1 + X2 + 1 2 3 Answer: The covariance between X1 and X3 is 0 by independence the same is true of X2 and X3 . The covariance between X1 and X2 is (.6) (2)(3) = 1.47.
Prob & Stat
36
Therefore, 2 1.47 0 COV(X ) = 1.47 3 0, 0 0 5 and

2 1.47 ) ( 1.47 3 0 0 3.47 ) ( 4.47 ) 2 .5

Var(X1 + X2 + X3 /2)
(1
1 2
0 1 0)( 1 ) 1 5 2

= =
(1
1 2
9.19.
Prob & Stat
37
Let w1 and w2 be two weight vectors:

T T Cov(wT X , w X ) = w 1 2 1 COV(X )w 2 .
If X1 , . . . , Xn are independent, or at least uncorrelated, then

N
Var(wT X ) =
i1
2 wi Var(Xi ).
Prob & Stat
38
Example continued: Suppose as in the previous example that X = ( X1 X2 X3 )T ,
Var(X1 ) = 2, Var(X2 ) = 3, Var(X3 ) = 5, X1 ,X2 = .6, and that X1 and X2 are independent of X3 . Find the covariance between (X1 + X2 + X3 )/3 and (X1 + X2 )/2. Answer: Let 1
3 1 3 1 3
1
2 1 2
w1 = and w2 = 0
Prob & Stat
39
Then Cov X1 + X2 X1 + X2 + X3 , 2 3 =
MT 1 COV(: )M 2
(
1 3 1 3 1 3
2 ) ( 1.47 0
1.47 3 0
1 0 2 ) 0)( 1 2 5 0 1 2 1 2

)
= =
( 1.157 1.323.
1.490
1.667 ) (
Prob & Stat
40
Important fact: If X has a multivariate normal distribution, then wT X is a normal random variable.
Prob & Stat
41
Example: Suppose that E (X1 ) = 1, E (X2 ) = 1.5, 2 2 X = 1, X2 = 2, and Cov(X1 , X2 ) = .5. 1 Find E (.3X1 + .7X2 ) and Var(.3X1 + .7X2 ). If (X1 X2 )T is bivariate normal, nd P (.3X1 + .7X2 < 2) Answers: E (.3X1 + .7X2 ) = (.3)(1) + (.7)(1.5) = 1.35 Var(.3X1 +.7X2 ) = (.3)2 (1)+(.7)2 (2)+(2)(.3)(.7)(.5) = 1.28 P (.3X1 + .7X2 < 2) = {(2 1.35)/ 1.28} = (.5745) = .7172
Prob & Stat
42
Hypothesis Testing Example: H0 : = 1 versus H1 : = 1 Rejection region: set of possible samples that lead us to reject H0 . example: reject H0 if |X 1| exceeds cuto c. Type I error: null hypothesis is true but we reject it Type II error: null hypothesis is false and we accept it
Prob & Stat
43
Rejection region chosen so P(type I error) below a pre-specied value is called the level of the test typical values of used in practice are .01, .05, or .1 as is made smaller, the rejection region must be made smaller
Prob & Stat
44
p-values The p-value for a sample is dened as the smallest value of for which the null hypothesis is rejected for that sample. to do the test we nd (typically, using stat software) the p-value of that sample H0 is rejected if we decide to use larger than the p-value Example: if p = .005 and = .01 then reject H0 H0 is accepted if we use smaller than the p-value Example: if p = .03 and = .01 then accept H0
Prob & Stat
45
Thus small p-value is evidence against the null hypothesis large p-value shows that the data are consistent with the null hypothesis
Prob & Stat
46
Maximum Likelihood Estimation Y = (Y1 , . . . , Yn )T is vector of data = (1 , . . . , p )T is vector of parameters f (y ; ) is the density of Y depends on
Prob & Stat
47
Example: Suppose that Y1 , . . . , Yn are IID N (, 2 ). Then = (, 2 ) and

n
f (y ; ) =
i=1 n
Yi 1 2 ( Y ) i 2 2 1 2 2
n
=
i=1
1 exp 1 / 2 (2 )
1 = exp n n/ 2 (2 )
(Yi )2
i=1
Prob & Stat
48
L( ) := f (Y ; ) is the likelihood function maximum likelihood estimator = MLE = value of that maximizes L( ) denote the MLE by M L often it is mathematically easier to maximize log{L( )}
Prob & Stat
49
Example: In the example M L = Y : log{L(, 2 )} 1 = log n exp n/ 2 (2 ) 1 2 2

n
(Yi )2
i=1 n
n 1 2 = log( ) + log(2 ) 2 2 2 MLE of minimizes

n
(Yi )2
i=1
(Yi )2
i=1 n n
0=
i=1
(Yi )
i=1
Yi = n = Y
Prob & Stat
50
With xed at Y , MLE of 2 solves d n 1 2 0= log{L(Y , )} = 2 + 4 2 d 2 2 Solution is

2 M L 2 M L has a small bias n n
(Yi Y )2
i=1
1 = n
(Yi Y )2
i=1
bias-corrected MLE is the so-called sample variance s2 Y 1 = n1

n
(Yi Y )2
i=1
Prob & Stat
51
In this textbook example there is an explicit formula for the MLE With more complex models, there is no explicit formula Rather, one writes program to compute log{L( )} for any uses optimization software to maximize this function numerically For some models such as the ARIMA time series models, there are software packages, e.g, MINITAB and SAS, that compute the MLE
Prob & Stat
52
Likelihood Ratio Tests (LRTs) LRTs are a convenient, all-purpose tool. Let = 1 2
Want to test a hypothesis about 1 without making any hypothesis about the value of 2 . Example: want to test that population mean is zero; then 1 = and 2 = 2 . Let 1,0 be the hypothesized value of 1 Example: 1,0 = 0 if we want to test that is zero.
Prob & Stat
53
The hypotheses are H0 : 1 = 1,0 and H1 : 0 = 1,0 .
Neither hypothesis says anything about 2 .
Prob & Stat
54
Example: testing that is zero, the hypotheses are H0 : = 0 and H1 : = 0 Neither hypothesis species anything about . M L = maximum likelihood estimator 2,0 be the value of 2 that maximizes L( ) when 1 = 1,0
Prob & Stat
55
The likelihood ratio test rejects H0 if 2 log L( M L ) L( 1,0 , 2,0 )

1)
= 2 log{L( M L )} log{L( 1,0 , 2,0 )} 2 ; dim( dim( 1 ) = number of components of 1 2 ,k is the upper-probability value of the chi-squared distribution with k degrees of freedom
in other words, 2 ,k is the (1 ) quantile so that the probability above 2 ,k is
Prob & Stat
56
Example: Y1 , . . . , Yn are IID N (, 2 ) and = (, 2 ). We want to test that is zero. n n 1 2 log(L) = log(2 ) log( ) 2 2 2 2 log(L) at the MLE is log{L(Y
2 , M L )} n
(Yi )2 .
i=1
n 2 = {1 + log(2 ) + log(M L )}. 2 1 = n

n
The value of 2 that maximizes L when = 0 is

2 0
Yi2 .
i=1
(Exercise: check)
Prob & Stat
57
Therefore,
2 2 2 log{L(Y , M ) } log { L (0 , L 0 )} 2 2 = n log(0 ) log(M L)
= n log = n log
2 0 2 M L
Yi2 n 2 ( Y Y ) i i=1
n i=1
Prob & Stat
58
Likelihood ratio test rejects H0 if n log Yi2 n 2 i=1 (Yi Y )

n i=1
> 2 ,1 .
(2)
To appreciate why (2) is a reasonable test consider if = 0: Simple algebra shows that
n n
Yi2 =
i=1 i=1
(Yi Y )2 + n(Y )2
Y will be close to = 0 and fraction inside the log will be close to 1. The log of 1 is 0 so the left hand side of (2) be small so we do not reject the null (right decision)
Prob & Stat
59
if is not 0: Y =0 then the left hand side of (2) will be large so that we reject the null (correct decision)

Mle

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mle

Uploaded by

Copyright:

Available Formats

Prob & Stat

Prob & Stat

fX (x)dx for all sets A

Prob & Stat

CDFs cumulative distribution function (CDF) of X is FX (x) := P (X x) If X has a pdf then

Prob & Stat

Prob & Stat

1 0.8 0.6 0.4 0.2 0

Prob & Stat

Prob & Stat

Expectations and Variances The expectation of X is

{x E (X )}2 fX (x)dx = E {X E (X )}2

2 Useful formula: X = E (X 2 ) {E (X )}2

standard deviation is the square root of the variance: X := E {X E (X )}2

Prob & Stat

If X1 , . . . , Xn is a sample from a prob distn, then expectation estimated by sample mean

the variance estimated by sample variance s2 X =

Prob & Stat

Prob & Stat

Useful formulas: XY XY XY XY = E (XY ) E (x)E (y ) = E [{X E (X )}Y ] = E [{Y E (Y )}X ] = E (XY ) if E (X ) = 0 or E (Y ) = 0

Prob & Stat

Correlation coecient between X and Y : XY := XY /X Y for any (X, Y ) it is true that 1 XY 1

Prob & Stat

Prob & Stat

Prob & Stat

Prob & Stat

positive correlations increasing relationship negative correlations decreasing relationship

Prob & Stat

Prob & Stat

Prob & Stat

Prob & Stat

So, we want to minimize E {Y (0 + 1 X )}2

Setting the partial derivatives to zero we get 0 = E (Y ) + 0 + 1 E (X ) 0 = E (XY ) + 0 E (X ) + 1 E (X 2 )

Prob & Stat

After some algebra (exercise) we nd that

Prob & Stat

Prob & Stat

Prob & Stat

X is observed then expected squared prediction error is

Prob & Stat

Example: if XY = .5, then prediction error reduced by 25% by observing X .

Prob & Stat

Prob & Stat

Prob & Stat

The standard normal CDF is (x) :=

Prob & Stat

Prob & Stat

Prob & Stat

Linear Functions of Random Variables E (aY + b) = aE (Y ) + b Var(aY + b) = a2 Var(Y ). E (w1 X + w2 Y ) = w1 E (X ) + w2 E (Y ),

Prob & Stat

Prob & Stat

X1 . . X= . XN Then, the expectation of X is E (X1 ) . . E (X ) := . . E (XN ) and the covariance matrix of X is

Cov(X1 , X2 ) Var(X2 ) . . . Cov(XN , X2 )

Cov(X1 , XN ) Cov(X2 , XN ) + + +. . + . ) . Var(XN )

Prob & Stat

Let w be a vector of weights:

Prob & Stat

Prob & Stat

Prob & Stat

Therefore, 2 1.47 0 COV(X ) = 1.47 3 0, 0 0 5 and

Prob & Stat

Let w1 and w2 be two weight vectors:

If X1 , . . . , Xn are independent, or at least uncorrelated, then

Prob & Stat