Professional Documents
Culture Documents
1. For a given loss function L, the risk R is given by the expected loss
= E Y 2 X = x 2f (x) E [ Y | X = x] + f (x)2
= Var [Y |X = x] + (E [ Y | X = x] f (x))2 .
f (x) = E [ Y | X = x] .
which occurs when P(Y < f (x)|X = x) = P(Y > f (x)|X = x) = 1/2, i.e. at the conditional
median f (x) = med [ Y | X = x].
2. Let (xi , yi )ni=1 be our dataset, with xi Rp and yi R. Linear regression can be formulated as
empirical risk minimization, where the model is to predict y as x> , and we use the squared loss:
n
X 1
R emp
() = (yi x>
i )
2
2
i=1
1
(a) Show that the optimal parameter is
(X Y)> X = 0
> (X> X) Y> X = 0
= (X> X)1 X> Y
(b) Consider regularizing our empirical risk by incorporating a L2 regularizer. That is, find mini-
mizing
n
X 1
kk22 + (yi x>
i )
2
2 2
i=1
Show that the optimal parameter is given by the ridge regression estimator
(X Y)> X + > = 0
> (I + X> X) Y> X = 0
= (I + X> X)1 X> Y
(c) Compare the regularized and unregularized estimators of for n = 10 samples generated from
the following model:
y N (X, ) (1)
2
use = .1.
Which estimator has smaller L2 norm? Investigate the bias / variance tradeoff across values of
[0, 10] as follows. Generate new datasets for training and testing and make a learning curve
plot comparing training and testing error like the one shown in lecture on slide 13 of Lecture 2.
Put on the x-axis and mean square error (prediction error) on the y-axis. Label the points on
the plot corresponding to unregularized linear regression.
Answer: See the R markdown document: http://www.stats.ox.ac.uk/flaxman/
sml17/problem_2c_on_regularization.html.
3. Consider two univariate normal distributions N (, 2 ) with known parameters A = 10 and A = 5
for class A and B = 20 and B = 5 for class B. Suppose class A represents the random score X of
a medical test of normal patients and class B represents the score of patients with a certain disease. A
priori there are 100 times more healthy patients than patients carrying the disease.
(a) Find the optimal decision rule in terms of misclassification error (0-1 loss) for allocating a new
observation x to either class A or B.
Answer: The optimal decision for X = x is to allocate to class
argmaxk{A,B} k gk (x).
2x(B A ) + 2A 2B 2A
2
log(A /B ) = 0.
For the given values, this implies that the decision boundary is at
that is all patients with a test score above 26.51 are classified as having the disease.
(b) Repeat (a) if the cost of a false negative (allocating a sick patient to group A) is > 1 times that
of a false positive (allocating a healthy person to group B). Describe how the rule changes as
increases. For which value of are 84.1% of all patients with disease correctly classified?
Answer: The optimal decision minimizes E [L(Y, f (x))|X = x]. It is hence optimal to choose
class A (healthy) over class B if and only if
1 (x A )2 1 (x B )2
A exp( 2 ) B exp( 2 ).
2A 2A 2B 2B
3
The decision boundary is now attained if x fulfills
2x(B A ) + 2A 2B 2A
2
log(A /(B )) = 0.
For increasing values of , patients with decreasingly smaller test scores are classified as having
the disease.
84.1% of all patients carrying the disease are correctly classified if the decision boundary is at
the 15.9%-quantile of the N (B , B2 )-distribution, which is at q = 20 + 51 (0.159) 15.
We could also consider corresponding to test data:
M
1 X >
2
= argmin Rte () = argmin yi xi
M
i=1
4
From this statement we see that Rte () Rte (), for any particular test dataset (xi , yi )1iM . If
we now consider the expectation over randomly drawn test datasets we have:
E[Rte ()] E[Rte ()], .
Thus we see:
E[Rtr ()] = E[Rte ()] E[Rte ()]
by Eq. 2.