You are on page 1of 10

Submit Assignment For Help

Go To Answer Directly

info@matlabassignmentexperts.com

Problem Set 2

1. The ytrue vectors correspond to the ideal y values, generated directly from the “true” model. In contrast, the
ynoisy vectors are the actual, noisy observations, generated by adding Gaussian noise to the ytrue vectors. You
should use ynoisy for any estimation. ytrue is provided only to make it easier to evaluate the error in your
predictions (simulate an infinite test data). You would not have ytrue in any real task.

(a) Write MATLAB functions theta = linear regress(y,X) and y hat = linear pred(theta,X test).
(b) The feature mapping can substantially affect the regression results. We will consider two possible feature
mappings:�

φ1 (x1 , x2 , x3 ) = [1, x1 , x2 , x3 ]T
φ2 (x1 , x2 , x3 ) = [1, log x21 , log x22 , log x23 ]T

Use the provided MATLAB function feature mapping to transform the input data matrix into a
matrix

φ(x1 )T

⎢ φ(x2 )T ⎥
X =

⎣ ···


φ(xn )T

For example, X = feature mapping(X in,1) would get you the first feature representation. Using

your completed linear regression functions, compute the mean squared prediction error for each

feature mapping (2 numbers).

(c) The selection of points to query in an active learning framework might depend on the feature rep­
resentation. We will use the same selection criterion as in the lectures, the expected squared error in
the parameters, proportional to T r[(X T X)−1 ]. Write a MATLAB function idx = active learn(X,k1,k2).
Your function should assume that the top k1 rows in X have been queried and your goal is to se­
quentially find the indices of the next k2 points to query. The final set of k1 + k2 indices should be
returned in idx. The latter may contain repeated entries. For each feature mapping, and k1 = 5
and k2 = 10, compute the set of points that should be queried (i.e., X(:,idx)). For each set of
points, use the feature mapping φ2 to perform regression and compute the resulting mean squared
prediction errors (MSE) over the entire data set (again, using φ2 ).

https://www.matlabassignmentexperts.com
(d) Let us repeat the steps of part (c) with randomly selected additional points to query. We have
provided a MATLAB function idx = randomly select(X,k1,k2) which is essentially the same
as active learn except that it selects the k2 points uniformly at random from X. Repeat the
regression steps as in previous part, and compute the resulting mean squared prediction error again.
To get a reasonable comparison you should repeat this process 50 times, and use the median MSE.
Compare the resulting errors with the active learning strategies. What conclusions can you draw?
(e) Let us now compare the two sets of points chosen by active learning due to the different feature
representations. We have provided a function plot points(X,idx r,idx b) which will plot each
row of X as a point in R3 . The points indexed by idx r will be circled in red and those marked by
idx b will be circled (larger) in blue . Can you see evidence of this in the two plots?�

2. We derived a kernel version of linear regression in class using regularized least squares criterion. The

resulting predictions were given by

n

y(x) = (α̂t /λ)K(xt , x)
t=1

where the optimal setting of the coefficients (prediction differences) α̂t were given in a vector form by

a = λ(λI + K)−1 y

Here K is the Gram matrix based on the available data points x1 , . . . , xn and y = [y1 , . . . , yn ]T . We are
interested in exploring how the regularization parameter λ ∈ [0, ∞) affects the solution when the kernel
function is the radial basis kernel
� �
� β � 2
K(x, x ) = exp − �x − x � , β > 0
2

a) Let’s see first that the limit λ → 0 (no regularization) makes sense. In this case we need the Gram
matrix K to be invertible. Indicate why the Gram matrix corresponding to the radial basis kernel
is always invertible based on the following Michelli theorem:
If ρ(t) is a monotonic function of t ∈ [0, ∞) then the matrix ρij = ρ(�xi − xj �) is invertible for
any distinct set of points x1 , . . . , xn .
b) How do we make predictions when λ → 0? In other words, what is the function y(x) in this limit?
b) Show that the corresponding training error is exactly zero.
c) Setting λ = 0 (no regularization) seems hardly optimal. Implement the kernel linear regression
method above for λ > 0. We have provided training and test data as well as helpful MATLAB
scripts in hw2/prob2. You should only need to complete the relevant lines in run prob2 script.
The data pertains to the problem of predicting Boston housing prices based on various indicators
(normalized). Evaluate and plot the training and test errors (mean squared errors) as a function
of λ in the range λ ∈ (0, 1). Use β = 0.05. Explain the qualitative behavior of the two curves.

3. Most linear classifiers can be turned into a kernel form. We will focus here on the simple perceptron
algorithm and use the resulting kernel version to classify data that are not linearly separable.
(a) First we need to turn the perceptron algorithm into a form that involves only inner products between
the feature vectors. We will focus on hyper-planes through origin in the feature space (any offset
component provided as part of the feature vectors). The mistake driven parameter updates are:
θ ← θ + yt φ(xt ) if yt θT φ(xt ) < 0, where θ = 0 initially. Show that we can rewrite the perceptron
updates in terms of simple additive updates on the discriminant function f (x) = θT φ(x):

f (x) ← f (x) + yt K(xt , x) if yt f (xt ) < 0

where K(xt , x) = φ(xt )T φ(x) is any kernel function and f (x) = 0 initially.
(b) We can replace K(xt, x) with any kernel function of our choice such as the radial basis kernel where
the corresponding feature mapping is infinite dimensional. Using the analysis, show that there
always is a separating hyperplane if we use the radial basis kernel.
(c) With the radial basis kernel we can therefore conclude that the perceptron algorithm will converge
(stop updating) after a finite number of steps for any dataset with distinct points. The resulting
function can therefore be written as�
n

f (x) = wi yi K(xi , x)
i=1

where wi is the number of times we made a mistake on example xi . Most of wi ’s are exactly zero
so our function won’t be difficult to handle. The same form holds for any kernel except that we
can no longer tell whether the wi ’s remain bounded (problem is separable with the chosen kernel).
Implement the new kernel perceptron algorithm in MATLAB using a radial basis and polynomial
kernels.
Define functions
alpha = train kernel perceptron(X, y, kernel type) and
f = discriminant function(alpha, X, kernel type, X test)
to train the pereptron and to evaluate the resulting f (x) for test examples, respectively.
(d) Load the data using the load p3 a script. When you use a polynomial kernel to separate the
classes, what degree polynomials do you need? Draw the decision boundary for the lowest-degree
polynomial kernel that separates the data. Repeat the process for the radial basis kernel. Briefly
discuss your observations.
Problem Set 2: Solutions

1. (a) The optimal parameter values for linear

regression given the matrix of training examples X and the corresponding response variables y is:

θ = (XT X)−1 XT y

The quantity (XT X)−1 XT is also known as the pseudo-inverse of X, and often occurs when dealing
with linear systems of equations. When X is a square matrix and invertible, it is exactly the same
as the inverse of X.
MATLAB provides many ways of achieving our desired goal. We can directly write out the expres­
sion above or use the function pinv. Here are the functions linear regress and linear pred:

function theta = linear_regress(y,X)

theta = pinv(X)*y;

function y_hat = linear_pred(theta,X)

y_hat = X*theta;

In linear regress, note that we are not calculating θ0 separately. This differs from the description
in the lecture notes where the training examples were explicitly padded with 1’s, allowing us to
introduce an offset θ0 . Instead, we will use a feature mapping to achieve the same effect.
(b) Before we describe the solution, we first describe how the dataset was created. This
may help us appreciate why some feature mappings may work better than others. The x values
were created by sampling each coordinate uniformly at random from (-1,1):
X in = rand(1000,3)*2 - 1
Given a particular x, the corresponding ytrue and ynoisy values were created as follows:

ytrue = −10 log x21 − 15 log x22 − 7.5 log x23 + 2


ynoisy = ytrue + � � ∼ N (0, 100)

To evaluate the performance of linear regression on given training and test sets, we created
the function test linear reg which combines the regression, prediction, and evaluation steps.
We may, of course, do it in some other way:
function err = test_linear_reg(y_train, X_train, y_test, X_test)
% train linear regression using X_train and y_train
% evaluate the mean squared prediction error on X_test and y_test

theta = linear_regress(y_train, X_train);

y_hat = linear_pred(theta,X_test);

yd = y_hat - y_test;

err = sum(yd.^2)/length(yd); %err = Mean Squared Prediction Error

Using this, we can now calculate the mean squared prediction error (MSPE) for the two feature
mappings:

>> X1 = feature_mapping(X_in,1);
>> test_linear_reg(y_noisy, X1, y_true, X1)
ans = 1.5736e+003
>> X2 = feature_mapping(X_in,2);
>> test_linear_reg(y_noisy, X2, y_true, X2)
ans = 0.5226

The two sets of errors are 1573.6 (φ1 ) and 0.5226 (φ2 ), respectively. Unsurprisingly, the mapping
φ2 performs much better than φ1 – it is exactly the space in which the relationship between X and
y is linear.
(c) The desired quantity we need to maximize is

vT AAv
1 + vT Av
where A = (XT X)−1 . In the notes, an offset parameter is explicitly assumed so that v = [xT , 1]T .
However, in our case this is not necessary and so v = x.
Active learning may, in general, select the same point x to be observed repeatedly. Each of
these observations corresponds to a different ynoisy . However, due to practical limitations, we had
supplied you with only one set of ynoisy values. Thus, if some xi occurs repeatedly in idx, you will
need to use the same ynoisy, i for each occurrence of xi . Alternatively, you could change your code
so as to disallow reptitions in idx. This is the option we have chosen here.
Given any feature space, the criterion above will aim to find points far apart in that space. However,
these points may not be far apart in the feature space where classification actually occurs. This is
of particular concern when the latter feature space might not be easily accessible (e.g., when using
a kernel like the radial basis function kernel).
Here’s the active learning code:
function idx = active_learn(X,k1,k2)

idx = 1:k1;

N = size(X,1);

for i=1:k2

var_reduction = zeros(N,1);

X1 = X(idx,:);

A = inv(X1’*X1);

AA = A*A;

for j=1:N

if ismember(j,idx) %this is the part where we disallow repititions


continue;

end

v = X(j,:);

a = v*AA*v’ / ( 1 + (v*A*v’));

var_reduction(j) = a;

end

[a, aidx] = max(var_reduction);

idx(end+1) = aidx;

end
Using it, we can now compute the desired quantities:

>> idx1 = active_learn(X1,5,10)


idx1 = 1 2 3 4 5 437 270 928 689 803 670 979 932 558 490

>> test_linear_reg(y_noisy(idx1), X2(idx1,:), y_true, X2)


ans = 2.5734e+003
>> idx2 = active_learn(X2,5,10)
idx2 = 1 2 3 4 5 955 803 270 558 628 490 283 592 761 414

>> test_linear_reg(y_noisy(idx2), X2(idx2,:), y_true, X2)


ans = 24.4643

Thus, the MSPE when using φ1 for active learning chooses points that are not well-placed (for this
particular dataset); φ2 performs much better.
The answers change when repititions are allowed (MSPE for φ1= 117.47 ; for φ2 = 25.25), but they
still illustrate our concept.

(d) This error will vary, depending upon the number of iterations you perform and the random points
selected. In my simulations, the value of error (using mapping φ2 for regression and evaluation) was
33.694. When this simulation was run for a 1000 runs, the error was close to 25.34 . Clearly, it seems
to be much better to perform random sampling than perform active learning in φ1’s space. This may
seem surprising at first, but is completely understandable: in the space where regression is
performed (φ2) the points chosen by performing active learning in the φ1 space are not far apart at
all, and are thus particularly bad points to be sampled.
For completeness’ sake, the answer for the original version of the problem is: (φ1: 2218.9 , φ2:
33.694).
(e) The figures are shown in Fig 1. For clarity’s sake, we have only plotted the last 10 points of idx.
In the original feature space (fig a), the points selected by active learning in the φ1 space are spread
far apart, as expected. However, as part (b) showed, a better fit to data is obtained by using φ2. In
this space (fig b), the points selected by active learning in the φ1 space are very closely bunched
together. The points selected by active learning in the φ2 space are, in contrast, spread far apart.
This helps explain why the points learned using active learning on the φ1 space did not lead to good
performance in the regression step.�

2. (a) The function f (t) = −βt2/2 monotonically decreases with t (for β > 0). The function
2 /2
g(t) = et monotonically increases with t. Thus, the function g o f (t) = h(t) = e−βt monotonically
β � 2
decreases with t. As such, the RBF kernel K(x, x� ) = e− 2 �x−x � defines a Gram matrix that
satisfies the conditions of Michelli’s theorem and is hence invertible.
(b) Let A = (λI + K)−1y. Then,

lim A = K−1 y
λ→0

Since K is always invertible, this limit is always well-defined. Now, we have α̂t = λAt where At is
the t-th element of A. We then have:
1

0.8
0
0.6

0.4 −5

0.2
−10

−15
−0.2
−12

−0.4 −10

−8 0
−0.6 −2
−6 −4
−6
−0.8 −4 −8
1 −10
0 −2 −12
−1
1 −14
0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −1 0
−0.8 −1 −16

(a) In φ1 space (b) In φ2 space

Figure 1: The red circles correspond to points chosen by performing active learning on the φ1 space and the

blue ones correspond to those chosen by performing active learning on the φ2 space.

Only the last 10 points of idx are shown for each; the first 5 points are the same.

y(x) = �nt=1 (α̂t /λ)K(xt , x), or



y(x) = �nt=1 (λAt /λ)K(xt , x), or
n
y(x) = t=1 At K(xt , x)

We then have,

limλ→0 y(x) = �nt=1 (limλ→0 At )K(xt , x), or



n −1
limλ→0 y(x) = t=1 Bt K(xt , x), where B = K y

Thus, in the required limit, the function y(x) = ni=1 Bt K(xt , x)


(c) To prove that the training error is zero, we need to prove that y(xt) = yt for t = 1, . . . , n.
From part (b), we have
�n
y(xt ) = �i=1 B
�i K(x i , xt ), or
n n
y(xt ) = ( j =1 yj K−1 (i, j))K(xi , xt ) or
�i=1
n
yj ni=1 K−1 (i, j)K(xi , xt ), or

y(xt ) = j=1
�n
y(xt ) = j =1 yj δ(j, t), or
y(xt ) = yt

where K−1 (i, j) = (i, j)-th entry of K−1 and



0 for i �= j
δ(i, j) =
1 for i = j

Here, we made use of the fact that K(xi , xt ) = K(xt , xi ) and that if A = B −1 then

i A(t, i)B(i, j) =
δ(t, j).
(d) Code for this problem is shown below:
Ntrain = size(Xtrain,1);
Ntest = size(Xtest,1);
for i=1:length(lambda),
lmb = lambda(i);

alpha = lmb * ((lmb*eye(Ntrain) + K)^-1) * Ytrain;

Atrain = (1/lmb) * repmat( alpha’, Ntrain, 1);

yhat_train = sum(Atrain.*K,2);

Atest = (1/lmb) * repmat( alpha’, Ntest, 1);

yhat_test = sum(Atest.*(Ktrain_test’), 2);

E(i,:) = [mean((yhat_train-Ytrain).^2),mean((yhat_test-Ytest).^2)];
end;

70

60

50

40

30

20

10 Train Error
Test Error

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
β

Figure 2: Training and test error for Prob # 2(e)

The resulting plot is shown in Fig 2. As can be seen the training error is zero at λ = 0 and increases
as λ increases. The test error initially decreases, reaches a minimum around 0.1, and then increases
again. This is exactly as we would expect. λ ≈ 0 results in over-fitting (the model is too powerful).
Our regression function has a low bias but high variance. By increasing λ we constrain the model,
thus increasing the training error. While the regularization increases bias, the variance decreases
faster, and we generalize better. High values of λ result in under-fitting (high bias, low variance)
and both training error and test errors are high.

3. (a) Observe that θ is only a sum of yt φ(Xt )’s, so we can just store the coefficients:
n

θ= wt yt φ(Xt ).
t=1

We can update wt ’s by incrementing wt by one when a mistake is made on example t. Classifying


new examples means evaluating:
n

θT φ(X) = wt yt K(Xt , X),
t=1

which only involves kernel operations.

(b) Use the regression argument to show that we can fit the points exactly and not only achieve the
correct sign, but the value of the discriminant function can be made ±1 for every training example.
(c) Here is the solution.
% kernel=‘(1 + transpose(xi)xj )d’; d = 5; % polynomial kernel
% kernel=‘exp(−transpose(xi − xj )(xi − xj )/(2s2))’; s = 3; % radial basis function kernel
function α=train kernel perceptron(X, y, kernel)
· n, d = size(X);
· K = [];
· for i = 1 : n

· · xi = X(i, :)� ;
· · for j = 1 : n
· · · xj = X(j, :)� ;
· · · K(i, j) = eval(kernel);
· · end
· end
·
· α = zeros(n, 1);
· mistakes = 1;
· while mistakes > 0
· · mistakes = 0;
· · for i = 1 : n
· · · if α� K(:, i)y(i) ≤ 0
· · · · α(i) = α(i) + y(i);
· · · · mistakes = mistakes + 1;
· · · end
· · end
· end
function f =discriminant function(α, X, kernel, Xtest )
· n, d = size(X);
· K = [];
· for i = 1 : n
· · xi = X(i, :)� ;
· · xj = Xtest ;
· · K(i) = eval(kernel);
· end
· f = α� K;
(d) The original dataset requires d = 4; the new dataset requires d = 2. An RBF will easily separate
either dataset.
� load p3 a
“X” and “y” loaded.
� kernel = ‘(1 + transpose(xi ) ∗ xj )2 ’;
� α = train kernel perceptron(X, y, kernel);
� figure
� hold on
� plot(X(1 : 1000, 1), X(1 : 1000, 2), ‘rs’)
� plot(X(1001 : 2000, 1), X(1001 : 2000, 2), ‘bo’)
� plot dec boundary(α, X, kernel, [−4, −2], [0.5, 0.5], [2, 4])
4

−1

−2

−3

−4 −3 −2 −1 0 1 2 3

� kernel = ‘exp(−transpose(xi − xj ) ∗ (xi − xj )/18)’;


� α = train kernel perceptron(X, y, kernel);
� figure
� hold on
� plot(X(1 : 1000, 1), X(1 : 1000, 2), ‘rs’)
� plot(X(1001 : 2000, 1), X(1001 : 2000, 2), ‘bo’)
� plot dec boundary(α, X, kernel, [−4, −2], [0.5, 0.5], [2, 4])

−1

−2

−3 −2 −1 0 1 2

You might also like