You are on page 1of 9

Submit Assignment For Help

Go To Answer Directly

info@matlabassignmentexperts.com

Problem Set 5

1. Wilhelm and Klaus are friends, but Wilhelm only speaks English and Klaus only speaks German. Klaus

has sent Wilhelm an urgent message, and Wilhelm is locked in a room with only the proceedings of
the European Parliament (in both German and English) to keep him company. Luckily, Wilhelm is a
machine learning expert.
He decides to treat Klaus as a “noisy channel”: assume that thoughts in Klaus’s head start out in
English, but before they are emitted, they pass through a medium that replaces the English words with
new (German) words and rearranges them from the English word-ordering to a new (German) ordering
all according to some unkown distributions. Next, he plans to come up with parametrizations for the
distributions and then to optimize the parameters over the European Parliament data he is conveniently
locked up with.
Wilhelm thus achieves the following optimization problem:

P (e)
arg max P (e | g) = arg max P (g | e)
e e P (g)
= arg max P (g | e)P (e),
e

where e and g denote English and German sentences, respectively. This corresponds to Wilhelm’s
noisy-channel characterization of Klaus’s head.
Next, Wilhelm decides on parameterizations for the two terms in his optimization criterion. To estimate
probabilities of English sentences, he will pretend that English is a Markov process :

| e|

P (e) = LM (ei | ei−1 , ei−2 ),
i=1

where |e| denotes the length of the English sentence e, ei denotes the ith word of e, and e0 and
e−1 correspond to start symbols S0 and S−1 , respectively, which are before every English sentence for
simplicity.
For the other term, Wilhelm is aware that both the words and their order may change, so he introduces
a hidden variable: the alignment between the words.
|g|

P (g, a | e) = T (gj | eaj )D (aj | j, |e|, |g| )
j=1

https://www.matlabassignmentexperts.com
Thus,

P (g | e) = P (g, a | e)
a
|g|
��
= T (gj | eaj )D (aj | j, |e|, |g| )
a j=1
|g| |e|
� �
= T (gj | eaj )D (aj | j, |e|, |g| ) .
j=1 aj =1

This set-up is known as IBM Model 2.

(a) To perform the optimization, Wilhelm decides to use the EM algorithm. Derive the updates of the
parameters LM , T , and D. We place no constraints on these distributions other than that they
are distributions.
(b) The initial parameter settings are important. Give an example of a bad choice. What should you
choose?
(c) Insert the updates and initial parameter settings in the provided templates ibm2 train lm.m and
ibm2 train.m and train the system using the provided data europarl.m. Use the provided decoder
ibm2 beam decoder.m to use the trained system to decode Klaus’s message klaus.m. The system
is able to achieve surprisingly good results with a very simple probability model!
(d) Suppose Wilhelm believes that his model is reordering the German too aggressively. He is con­
sidering regularizing the alignments. He has several ideas; what effect do you think each of these
regularization techniques will have?�

i. Regularize D (aj | j, |e|, |g| ) toward a uniform distribution.


e| | j, |e|, |g| = 1.
� � � �
ii. Regularize D (aj | j, |e|, |g| ) toward the distribution D aj = j · ||g |
iii. Estimate first a simpler distribution D (aj | |e|, |g| ) and regularize D (aj | j, |e|, |g| ) toward it.

2. In this problem, we will explore the combined use of regression and Expectation Maximization. The
motivating application is as follows: we’d like to understand how the expression levels of genes (in a
cell) change with time. The expression level of a gene indicates, roughly, the amount of gene-product in
the cell. The experimental data we have available are noisy measurements of this level (for each gene)
at certain time points. More precisely, we have m genes (g1 , . . . , gm ) and measurements are available
for r timepoints (t1 , . . . , tr ). The data matrix is Y = [y1 , . . . , ym ]T where each row yiT = [yi1 , . . . , yir ]
corresponds to the expression levels of gene gi across the r timepoints.
Our goal is to estimate continuous functions, each fˆi (t) capturing the expression level of one gene gi .
We’ll call these expression curves. Clearly, if we treat each gene’s expression as independent from others,
we would need to perform m unrelated regression tasks. We can do better, however, since we expect
that there are groups of genes with similar expression curves.
We hypothesize that the expression curves can be grouped into k classes and that, within each group,
the functions fi (t) look very similar. When estimating fˆi (t) we can pool together data from all the
genes in the same class. We have two problems to solve: (1) figure out which class each gene belongs
to, and (2) given class assignments, estimate the expression curves. As you might guess, we will use the
EM algorithm. However, we still need to understand how to estimate continuous curves from discrete
data-points.
To perform regression, we will use an idea similar to kernel regression. Recall that even if f (t) is some
non-linear function of t, it is possible to find a feature mapping φ(t) = [φ1 (t), . . . , φp (t)]T such that f (t)

is a linear function of φ’s: f (t) = wj φj (t). In class, we had discussed how to implement this idea to
perform kernel regression without explicitly constructing the feature mapping. In the current setting,
we extend this to include a prior over the parameters w = [w1 , . . . , wp ]T . We set this to be simply a
spherical multivariate Gaussian N (w; 0, σw 2 I). Now, given any set of time points t , . . . , t , we imagine
1 r
generating function values at those time points according to the following procedure
2 I)
Step 1: draw one sample w ∼ N (0, σw
Step 2: generate function values for the time points of interest
p

f (tk ) = wj φj (tk ), k = 1, . . . , r
j=1

Since wj are normally distributed, so is f = [f (t1 ), . . . , f (tr )]T . The distribution of this vector is
2 φ(t )T φ(t ). Looks familiar? Let’s try to understand it a bit
N (f ; 0, K) where Kkl = K(tk , tl ) = σw k l
better. We have a distribution over function values at any set of time points. In other words, we
have defined a prior distribution over functions! The kernel function K(t, t� ), where the inputs are the
time points, gives rise to a Gram matrix Kkl = K(tk , tl ) over any finite set of inputs. This probability
distribution over functions is known as a Gaussian Process (GP for short).
The choice of the kernel K(t, t� ) controls the kind of functions that are sampled. Consider the following
kernel, similar to the radial basis function kernel seen before

(t − t� )2
K(t, t� ) = σf2 exp(− )
2ρ2
The hyperparameter ρ controls the time span at which the function values are strongly coupled. A large
ρ implies, for example, that the sampled functions would appear roughly constant across a long time
range. σf scales this correlation by a constant factor.

(a) Match each of the following four figures, with the four kernels defined below. Each figure describes
some functions sampled using a particular kernel. Note that these functions are drawn by selecting
finely spaced time points and drawing a sample from the corresponding function values at these
points from N (f ; 0, K) where K is the Gram matrix over the same points.
Here are the kernels:
� 2
i. 3 exp(− (t−t )
2(2.1)2
)
� 2
ii. 2 exp(− (t−t )
2(0.7)2
)
π(t−t� )2
−sin2 ( )
iii. 2 exp( 2(0.3)2
4
)
(t−t� )2
iv. 2 exp(− 2(0.05) 2) + 0.1(tt� )2
Figure 1: Each subfigure corresponds to one kernel

Let us now consider how we will use GP’s to simultaneously perform regression across genes belonging
to the same class. Since we assume they have similar expression curves we will model them as samples
from the same Gaussian process. Our model for the data is as follows. There are k classes in the data
and each gene belongs to a single class. Let the genes in class l be gl1 , gl2 , . . . , glml where ml is the
number of genes in that class. The expression curves for these genes are assumed to be samples from
the same GP. The kernels of the k GP’s (one for each class) all have the same parametric form, but
different hyperparameters θ = (σf , ρ, σn ):

(t − t� )2
K(t, t� ) = σf2 exp(− ) + σn2 δ(t, t� ), where
2ρ2

1 if t = t�
δ(t, t� ) =
0 otherwise

This is essentially the same as before except for the second term that accounts for measurement noise.
In other words, the responses y are modeled as y(t) = f (t) + �t , �t ∼ N (0, σn2 ) and f (t) is a Gaussian
process with kernel given by Egn 4. Thus, for any time points t1 , . . . , tr , the response vector y has a
distribution N (0, K), where K is the Gram matrix from the kernel in Eqn 5.

(b) We have provided for you a function log likelihood gp(params, t, Yobs), where params specify
the hyperparameters of the kernel, t is a vector of r timepoints and Yobs is a q×r matrix of observed
gene expression values (for q genes) at those timepoints. For each of the q sets of observations,
the function computes the log-likelihood that observations were generated from the GP specified
by params. Fill in the missing lines of the code. Attach a printout of your code (you only need to
show the parts you changed).

We now go back to our original problem and set up an Expectation Maximization framework to solve it.
Suppose we know that the genes in the data cluster into k classes (i.e., k is a user-specified parameter).
We start off by guessing some initial values for the hyperparameters of the k GPs. In each E-step, we
compute the posterior assignment probability P (k|i) i.e., the probability that gene gi is in class k. To
simplify the M-step we will turn these probabilities into hard assignments. In each M-step, we compute
the MLE estimates for the hyperparameters of the GPs associated with the classes.

(c) To perform EM, we have provided you a function [W,V,L] = em gp(t, Yobs, k, init class).
The code is missing some steps in the Expectation sub-function. Fill them in. Attach a printout
of your code for the Expectation function.
(d) Using the dataset provided with this problem, run em gp with different choices of k (k = 2, 3, 4, 5).
We have provided a function plot results(W,V,t,Yobs,k,nn) which you may find useful. You can
use it to plot nn randomly chosen curves from each of the k classes. What choice of k seems most
appropriate?
(e) Gene expression data often has missing values, i.e., you may not know the value yij , the expression
of gene gi at time tj . With some machine learning methods, handling these missing values can be
a big hassle. But with our model, the missing values problem can be handled very simply. How?
(f) (Optional) Our initial choice of hyperparameters is constructed as follows: we assign genes to
specific classes, either randomly or based upon some insights. Using this class assignment, the initial
hyperparameters are computed by an MLE approach. Describe a method that could be used to
generate a good initial class assignment.
(g) (Optional) Since we have defined a prior over functions we could turn the clus­tering problem into
a Bayesian model selection problem. We assume that each cluster has a single underlying expression
curve f (t), sampled from a fixed GP model, and the responses for each gene in the cluster are
subsequently sampled from y(t) = f (t) + �t, �t ∼ N (0, σ2 ), using the same f (t). n
Write an expression for the marginal likelihood of the data corresponding to a fixed assignment of
genes into clusters.�
Problem Set 5: Solutions

1. (a) For the LM, there is no need to iterate; the maximum likelihood estimates are easy to derive; they
are simply the normalized counts.

The updates for T and D are:

(k) (k)
1 � D(aj = i | j, �, m)T (fj | ei )
T � (f | e) = (k) (k)
ZT (e) ��
= i� | j, �, m)T (fj , ei� )
i, j, k i� =0 D(aj
(k) (k)
ei =e, fj =f
(k) (k)
� 1 � D(aj = i | j, �, m)T (fj | ei )
D (aj = i | j, �, m) = (k) (k)
,
ZD (j, �, m) ��
= i� | j, �, m)T (fj , ei� )
k i� =0 D(aj
|e(k) |=�, |f (k) |=m

where ZT and ZD are normalization constants. One can derive this easily using the formal EM
formulation; however, just using the soft counts is fine, as well.
(b) Generally, a choice that has lots of zeros will be bad. Other choices that depend on a lot of
symmetry in the data will also cause problems. This model is not convex by a long shot, so there
are plenty of local extrema; it is nontrivial to find a nontrivial initial setting, but a trivial one is
fine here.
(c) My update to the code was this:
LM:
LM(m2m1,english(i,j)) = LM(m2m1,english(i,j)) + 1;

. . .

LM(m2m1,i) = LM(m2m1,i)/LMc(m2,m1);

EM updates:
for j=1:m

a = [];

for i=1:l

a(i) = T(deutsch(idx,j),english(idx,i)) * D(indexpack(j,l,m),i);


end
a = a / sum(a);
for i=1:l
Tn(deutsch(idx,j),english(idx,i)) = Tn(deutsch(idx,j),english(idx,i)) + a(i);
Dn(indexpack(j,l,m),i) = Dn(indexpack(j,l,m),i) + a(i);
end
end
normalization:
T = sparse(de_vocab, en_vocab);

D = sparse((mmax+1) * 50 * 50, lmax);

for enword=1:en_vocab

nfactor = sum(Tn(:,enword));

for deword=find(Tn(:,enword))

T(deword, enword) = Tn(deword, enword)/nfactor;

end

end

Dnfactor = sum(Dn,2) * sparse(ones(1,size(D,2)));

idxpack = find(Dnfactor);

D(idxpack) = Dn(idxpack)./Dnfactor(idxpack);

Here is the output I got:


Dear Klaus !

I have you . locked a small room .

You can read the books .

It

Some problems with this is that it tends to be somewhat sensitive to the start point; random
was not the best choice, but it is ok. A better choice is to choose uniform for D and to use co­
occurrence counts for T . In practice, one would train a simpler model (IBM Model 1) and transfer
the probabilities. Also, a NULL word on the English side would help to “explain” common particle
words in the German side. The last sentence is seven words long in German; this size does not
appear in the corpus, so the translation is ridiculous. All in all, this method has many limitations,
but considering its simplicity, the results are quite good and using this type of method is very
appealing.
(d) (i) It will reorder more freely.
(ii) It will reorder less.
(iii) This is (essentially) the same as (i). If there are spots that tend to have attractive words on
the English side (words that explain a lot of the German words), then all of the words will try
to reorder there. In general, however, it will have a smoothing effect on the alignments.

2. (a) If the kernel is defined so as to encode a high covariance between two points x1 and x2,
then the observations at those two values should be highly correlated (i.e. should have roughly
similar values). Thus, if the covariance between neighboring points is high along a long stretch of
the x-axis, the function value will remain roughly constant along that stretch.
Using this intuition, we have the following set of matches: 1-d, 2-b, 3-a, 4-c. The kernels can also
capture periodicity. For example, kernel (3) indicates that points which are t − t� = 2 apart will
have low covariances while points which are t − t� = 4 will have higher covariances (i.e. roughly
similar values). This introduces periodicity in the function values, as observed in Fig (a). Also,
the (tt� )2 component of kernel (4) implies that the covariance between neighboring points increases
with t, resulting in the curve moving in a single direction.
(b) The log-likelihood of observing a set of values y1, . . . , yr at time-points t1, . . . , tr is simply
the probability of sampling the r-dimensional vector (y1 , . . . , yr ) from N (0, G) where G is the Gram
matrix corresponding to the time-points. The code looks as follows:
function ll = log_likelihood_gp(params, t, Yobs)

%Gram matrix
G = ...

Ginv = inv(G);

detG = det(G);

ll = zeros(q,1);
for i=1:q

y = reshape(Yobs(i,:), r, 1);

a = 0.5*(y’*Ginv*y);

b = (r/2) * log(2*pi) + (0.5*log(detG));

l = -a - b;
ll(i) = l;
end

(c) The Expectation code is shown below. Recall that Wij is the probability of gene i
belonging to cluster j.
function W_new = Expectation(t, Y_obs, k, W, V)
P = sum(W);
P = P/sum(P);

[n,T] = size(Y_obs);
W_new = zeros(n,k);

for j=1:k,
% get likelihood

ll = log_likelihood_gp(V(j,:), t, Y_obs);

% get posterior assignment probability

W_new(:,j) = ll + log(P(j));

end

%normalize posterior

% a more sophisticated handling of very small

% number will be better, but the simplest

% normalization method will suffice

% for grading purposes

for i=1:n

W_new(i,:) = exp(W_new(i,:)) / sum(exp(W_new(i,:)));

end

(d) The best results are obtained with k = 3 clusters. With higher k, some of the clusters are either
empty or two different clusters have similar curves. With lower k (e.g. k = 2), one of the clusters
contains two different kinds of curves.
(e) There are a few different ways of dealing with missing values. One approach would be to use the
function curves estimated in the previous iteration of EM to compute the expected value of gene’s
expression at the missing time-point.�
(f) Optional: There are many possible ways to guess an initial estimate of the gene clusters

(for grading purposes, any reasonable approach is fine). Here we describe one such approach:

• Between each pair of genes g1 and g2 , define a distance


��measure. This distance measure may be
r
the Euclidean distance between the observations: i=1 (y1i − y2i )2 . Another measure might

be the Pearson correlation between these sets of observations.

• Using these pairwise distances, perform Hierarchical Agglomerative Clustering between the
genes, i.e., start with each gene as a singleton cluster and at each step, merge the two most
similar clusters. For example, if we perform complete linkage clustering, the distance between
two clusters will defined as the largest distance between any pair of genes, one from each
cluster.

(g) Optional: For notational convenience, we define the marginal likelihood for the case of

a single cluster. The generalization to multiple clusters is straightforward.


Suppose there are r timepoints: t1 , . . . , tr and let f (t) be the cluster mean curve. Then, for any
gene, the probability of seeing the observations [y(1), . . . , y(r)] in the cluster is the probability that
each y(l) is sampled from the Normal distribution N (f (tl ), �n2 ):
r
N (y(l); f (tl ), �n2 ),

or
l=1

r
� 1 (y(l) − f (tl ))2
exp(− )
l=1
2��n �n2

The prior probability of seeing a cluster mean curve f (t) is given by the probability of sampling
the r-dimensional vector f = [f (t1 ), ...f (tr )] from the r-dimensional Gaussian formed by using the
Gram matrix constructed from t1 , . . . , tr .

N (f ; 0, G)

where G is the Gram matrix as per the fixed GP.

The marginal likelihood is then:

r
� � � �
N (y(l); f (tl ), �n2 ) N (f ; 0, G) df
l=1

You might also like