Professional Documents
Culture Documents
There are a few ways we can divide up the material as we go along, e.g.,
We’ll adopt the first method and work in the second two along the way.
OVERVIEW: S UPERVISED LEARNING
1
t
−1
0 x 1
Is this spam?
hi everyone,
Documents
Topic
proportions
Topic
assignments
health 0.03 team 0.03 government 0.04 business 0.04 computer 0.03
medical 0.03 basketball 0.02 law 0.02 money 0.02 system 0.02
Topics disease 0.02 points 0.01 politics 0.01 economic 0.02 software 0.02
hospital 0.01 score 0.01 legislation 0.01 company 0.01 program 0.01
... ... ... ... ...
With unsupervised learning our goal is often to uncover structure in the data.
This helps with predictions, recommendations, efficient data exploration.
1 Figure from Koren, Y., Robert B., and Volinsky, C.. “Matrix factorization techniques for recommender systems.” Computer 42.8 (2009): 30-37.
E XAMPLE U NSUPERVISED P ROBLEM
Probabilistic Models
I A probabilistic model is a set of probability distributions, p(x|θ).
I We pick the distribution family p(·), but don’t know the parameter θ.
This value best explains the data according to the chosen distribution family.
Logarithm trick
Qn
Calculating ∇θ i=1 p(xi |θ) can be complicated. We use the fact that the
logarithm is monotonically increasing on R+ , and the equality
Y X
ln fi = ln(fi ).
i i
arg max ln g(y) = arg max g(y) The location does not change.
y y
B LOCK #3: A NALYTIC MAXIMUM LIKELIHOOD
n
Y n
Y n
X
θ̂ML = arg max p(xi |θ) = arg max ln p(xi |θ) = arg max ln p(xi |θ)
θ θ θ
i=1 i=1 i=1
for µ and Σ. (Try doing this without the log to appreciate it’s usefulness.)
E XAMPLE : G AUSSIAN M EAN MLE
Are we done? There are many assumptions/issues with this approach that
makes finding the “best” parameter values not a complete victory.
I We made a model assumption (multivariate Gaussian).
I We made an i.i.d. assumption.
I We assumed that maximizing the likelihood is the best thing to do.
Comment: We often use θML to make predictions about xnew (Block #4).
How does θML generalize to xnew ?
If x1:n don’t “capture the space” well, θML can overfit the data.