Professional Documents
Culture Documents
of
Computa%onal
Journalism
Columbia
Journalism
School
Week
3:
Document
Topic
Modelling
September
24,
2012
- H. P. Luhn, A Sta's'cal approach to mechanized encoding and searching of literary informa'on, 1957
Example
D1
=
I
like
databases
D2
=
I
hate
hate
databases
Each
row
=
document
vector
All
rows
=
term-document
matrix
Individual
entry
=
F(t,d)
=
term
frequency
Distance
func%on
Useful
for:
clustering
documents
nding
docs
similar
to
example
matching
a
search
query
Basic
idea:
look
for
overlapping
terms
Cosine
similarity
Given
document
vectors
a,b
dene
similarity(a, b) a b
If
each
word
occurs
exactly
once
in
each
document,
equivalent
to
coun%ng
overlapping
words.
Note:
not
a
distance
func%on,
as
similarity
increases
when
documents
are
similar.
(What
part
of
the
deni%on
of
a
distance
func%on
is
violated
here?)
a
b
q
this 1 0 0
car 1 3 1
runs 1 0 0
fast 1 0 1
my 0 1 0
is 0 1 0
old 0 1 0
I 0 1 0
want 0 1 0
a 0 1 0
new 0 1 0
shiny 0 1 0
similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2 similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3 Longer document more similar, by virtue of repea%ng words.
Cosine
similarity
ab cos = similarity(a, b) a b
ab dist(a, b) 1 a b
Context
makers
General
News
Car Reviews
Document
Frequency
Idea:
de-weight
common
words
Common
=
appears
in
many
documents
df (t, D) = d D : t d D
document
frequency
=
frac%on
of
docs
containing
term
F-idf
Mul%ply
term
frequency
by
inverse
document
frequency
tfidf (t, d, D) = tf (t, d) idf (d, D)
= n(t, d) log D n(t, D)
n(t,d)
=
number
of
%mes
term
t
in
doc
d
n(t,D)
=
number
docs
in
D
containing
t
- from Salton, Wong, Yang, A Vector Space Model for Automa'c Indexing, 1975
F-idf
nj-sentator-menendez corpus, Overview sample les color = human tags generated from F-idf clusters
Cluster
Hypothesis
documents
in
the
same
cluster
behave
similarly
with
respect
to
relevance
to
informa%on
needs
-
Manning,
Raghavan,
Schtze,
Introduc'on
to
Informa'on
Retrieval
Not really a precise statement but the crucial link between human seman%cs and mathema%cal proper%es. Ar%culated as early as 1971, has been shown to hold at web scale, widely assumed.
Problem
Statement
Can
the
computer
tell
us
the
topics
in
a
document
set?
Can
the
computer
organize
the
documents
by
topic?
Bag
of
words
model
cannot
dieren%ate
polysemes,
and
incorrectly
splits
synonyms.
Contextual
informa%on
Word
sense
disambigua%on
research
has
shown
that
surrounding
words
can
be
used
to
determine
sense:
The
bank
has
my
money
The
plane
banked
sharply
les
Contextual
informa%on
Synonym
detec%on:
similar
usages
imply
similar
meanings:
I
bought
a
couch
We
need
to
buy
more
stamps
I
purchased
a
couch
We
need
to
purchase
more
stamps
U = words to concepts D = diagonal matrix, concept strength in document set V = concepts to documents
Synonymy = many rows in U (many words) map to same column in D (same concept) Polysemy = same row in U (word) maps to mul%ple columns in D (dierent concepts)
Intui%on:
these
concepts
dont
contribute
much
to
term
weights.
LSA
tends
to
do
best
on
retrieval
metrics
with
~100
factors
PLSI
in
prac%ce
Start
with
term-document
matrix
X
=
p(w|d)
Find:
topics
p(w|z)
document
coordinates
p(z|d)
by maximizing probability p(w|d) that they produced observed X, aka the likelihood p(w | d) = p(w | z)p(z | d) dD zZ
Dimensionality
reduc%on
Output
of
LSA,
PLSI,
LDA
is
a
vector
of
much
lower
dimension
for
each
document.
Dimensions
are
concepts
or
topics
instead
of
words.
Can
measure
cosine
distance,
cluster,
etc.
in
this
new
space.