CSE291D Lecture 4: Exponential Families Generalized Linear Models

CSE291D Lecture 4
Exponential Families
Generalized Linear Models
1
Why exponential families?
• So far, we’ve been studying fairly concrete models,
capturing phenomena such as coin flips, dice rolls,
drawing balls from urns…
• Today we’ll study a more general class of models at a

higher level of abstraction, the exponential family
2
Why exponential families?
• Many standard distributions are in the exponential family
(Gaussian, Bernoulli, Dirichlet, Poisson, exponential, …)
• Derive algorithms for the exponential family, and you

get all those distributions for free!
• Finite set of sufficient statistics capture all relevant properties

of a data set
• Conjugate priors
3
Exponential family distributions
in a nutshell
• Suppose want to come up with a new probability distribution
What’s the simplest way that parameters and data could be mapped to
probabilities?
– How about a linear mapping via a dot product?
• Probabilities and probability densities must be non-negative.

What is the simplest way that we can we ensure non-negativity?
4
in a nutshell
probabilities?

5
in a nutshell
probabilities?

6
in a nutshell
• Probabilities are proportional to e to the power

of a dot product of parameters and data
• Actual definition is slightly more complicated, but

just generalizes this basic idea.
7
Maximum entropy motivation
• Suppose we’d like to model some phenomenon with a
probability distribution, and all we know is how certain
features/properties behave on average.
– Which distribution should we use?
• The principle of maximum entropy states that in the

absence of further knowledge, we should select the
distribution with the most entropy
• Entropy measures the “flatness” of a distribution, and

can be interpreted as the amount of uncertainty
8
Entropy
9
Maximum entropy motivation
• It can be shown (see the readings) that the

maximum entropy distribution has the form of the
exponential family.
• So, in the absence of any further information, we

should select an exponential family distribution.
10
Statistical mechanics
• The second law of thermodynamics:
“the entropy of the universe tends
towards a maximum”
Rudolf Clausius
• We would therefore expect exponential family

distributions to be prevalent in nature
11
Statistical mechanics
Ludwig Boltzmann
• Suppose we are modeling particles in a gas. The

Boltzmann distribution (Gibbs distribution) has the form:
12
Learning outcomes
By the end of the lesson, you should be able to:
• Write distributions in exponential family form
• Perform maximum likelihood estimation and

Bayesian inference for exponential families
• Specify generalized linear models by selecting an

appropriate exponential family distribution and
link function
13
14
15
16
• A pdf or pmf is said to be in the exponential
family if it can be written in the form:
Some scaling factor which doesn’t

Dot product of parameters and
depend on the parameters,
sufficient statistics
often = 1
17
With normalizing constant
• Two ways to write it:
“partition function”
• Or:
“log partition function”
18
With normalizing constant
• Two ways to write it:
“partition function”
• Or:
“log partition function”
19
Sufficient statistics
• is called sufficient because the likelihood
only depends on the data through
• To learn the parameters, all information we

need is captured in the sufficient statistics.
• Exponential family distributions are the only

distributions with a fixed-dimensional set of
sufficient statistics (under weak assumptions)
20
A slight generalization
• Allow a different parameterization, as long as it can
still be written the same way after transformation
21
Summary / notation
22
Writing distributions in
exponential family form: Bernoulli
23
24
25
26
Minimal exponential families
• The sufficient statistics are in some sense redundant,
as we can compute x and 1 – x as linear functions of
each other.
• The representation is said to be over-complete.
• If the sufficient statistics are linearly independent,

the representation is said to be minimal.
27
28
29
30
31
32
Definitions
• Natural exponential family:
• Canonical form:
• Curved exponential family:
(a representation where the transformed

parameters live in some “curved” space)
33
Gaussian distribution
34
35
36
37
38
Poisson distribution
39
40
Gamma distribution
41
Bayesian inference
42
Bayesian inference
43
Bayesian inference
44
Bayesian inference
45
Log partition function is
expected sufficient statistics
46
47
48
49
50
MLE for exponential families
• The likelihood function is:
• Take the log:
51
• Take the log:
52
• Take the log:
53
• Take the log:
54
• Take the derivative and set to zero
55
56
57
58
• At the MLE, the expected sufficient statistics under
the model match the average observed statistics
• If you were to simulate “fantasy data” under the

model, on average the statistics would match
those of the data
59
Maximum likelihood gradient updates
• The gradient is the difference between:

– the average sufficient statistics under the data
– the expected sufficient statistics under “fantasy data”
60
Mean parameterization
• Many distributions are traditionally parameterized by
their mean
• For minimal exponential families, there is a

1-1 mapping between natural parameters and the
mean of the distribution. In this case, we can use
the mean parameterization,
• The MLE is just the sample mean!
61
Generalized linear models
• Linear regression:
• Logistic regression:
1
- 0 +
• What about other types of data?
62
Generalized linear models
Linear predictor
Mean parameter
Mean-parameterized exponential family
63
Link functions
is called the mean function,

or inverse link function
is called the link function
64
Example: Poisson regression
65
Estimation of GLMs
• MLE or MAP estimation:
– Second-order optimization methods common,
Newton-Raphson /
iterative reweighted least squares
• Bayesian inference:
– MCMC, or Laplace approximation
66
GLM example: Think-pair-share
• You are a data scientist working on a competitor to
Google Maps. Your boss asks you to design a GLM to
predict the number of cars that pass each of various
stretches of the 5 freeway on any given hour on any
given day in the next 12 months.
• What input features will you use, and what exponential

family response distribution / link function / inverse
link function will you use for the GLM?
Assume you have all of the same data that Google can
collect.
67

CSE291D Lecture 4: Exponential Families Generalized Linear Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE291D Lecture 4: Exponential Families Generalized Linear Models

Uploaded by

Copyright:

Available Formats

CSE291D Lecture 4

• Today we’ll study a more general class of models at a

• Derive algorithms for the exponential family, and you

• Finite set of sufficient statistics capture all relevant properties

• Probabilities and probability densities must be non-negative.

• Probabilities and probability densities must be non-negative.

• Probabilities and probability densities must be non-negative.

• Probabilities are proportional to e to the power

• Actual definition is slightly more complicated, but

• The principle of maximum entropy states that in the

• Entropy measures the “flatness” of a distribution, and

• It can be shown (see the readings) that the

• So, in the absence of any further information, we

• We would therefore expect exponential family

• Suppose we are modeling particles in a gas. The

• Write distributions in exponential family form

• Perform maximum likelihood estimation and

• Specify generalized linear models by selecting an

Some scaling factor which doesn’t

“log partition function”

“log partition function”

• To learn the parameters, all information we

• Exponential family distributions are the only

• The representation is said to be over-complete.

• If the sufficient statistics are linearly independent,

• Curved exponential family:

(a representation where the transformed

• Take the log:

• Take the log:

• Take the log:

• Take the log:

• If you were to simulate “fantasy data” under the

• The gradient is the difference between:

• For minimal exponential families, there is a

• The MLE is just the sample mean!

Mean-parameterized exponential family

is called the mean function,

is called the link function

• What input features will you use, and what exponential

You might also like