Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling

Fron%ers
of Computa%onal Journalism
Columbia Journalism School Week 3: Document Topic Modelling September 24, 2012
Week 3: Document Topic Modeling

Vector Space Model Cosine Distance F-idf Topic Models

Vector representa%on for documents

As before, we want to nd numerical features that describe the document. An old idea, going back to proto-search engine research by Luhn at IBM in the 1950s.
- H. P. Luhn, A Sta's'cal approach to mechanized encoding and searching of literary informa'on, 1957
Turns out features = words works ne

Encode each document as the list of words it contains. Dimensions = vocabulary of document set. Value on each dimension = # of %mes word appears in document
Example
D1 = I like databases D2 = I hate hate databases Each row = document vector All rows = term-document matrix Individual entry = F(t,d) = term frequency
Aka Bag of words model

Throws out word order. e.g. soldiers shot civilians and civilians shot soldiers encoded iden%cally.

Distance func%on
Useful for: clustering documents nding docs similar to example matching a search query Basic idea: look for overlapping terms
Cosine similarity
Given document vectors a,b dene similarity(a, b) a b
If each word occurs exactly once in each document, equivalent to coun%ng overlapping words. Note: not a distance func%on, as similarity increases when documents are similar. (What part of the deni%on of a distance func%on is violated here?)
Problem: long documents always win

Let a = This car runs fast. Let b = My car is old. I want a new car, a shiny car Let query = fast car
a b
q
this 1 0 0
car 1 3 1
runs 1 0 0
fast 1 0 1
my 0 1 0
is 0 1 0
old 0 1 0
I 0 1 0
want 0 1 0
a 0 1 0
new 0 1 0
shiny 0 1 0
Problem: long documents always win

similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2 similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3 Longer document more similar, by virtue of repea%ng words.
Normalize document vectors

ab similarity(a, b) a b
= cos() returns result in [0,1]
Normalized query example

this a b q 1 0 0 car 1 3 1 runs 1 0 0 fast 1 0 1 my 0 1 0 is 0 1 0 old 0 1 0 I 0 1 0 want 0 1 0 a 0 1 0 new 0 1 0 shiny 0 1 0
2 1 similarity(a, q) = = 0.707 4 2 2 3 similarity(b, q) = 0.514 17 2
Cosine similarity
ab cos = similarity(a, b) a b
Cosine distance (nally)
ab dist(a, b) 1 a b

Problem: common words

We want to weight words that discriminate among documents.
Stopwords: if all documents contain the, are all documents similar? Common words: if most documents contain car then car doesnt tell us much about (contextual) similarity.
Context makers
General News
Car Reviews
= contains car = does not contain car
Document Frequency
Idea: de-weight common words Common = appears in many documents df (t, D) = d D : t d D
document frequency = frac%on of docs containing term
Inverse Document Frequency

Invert (so more common = smaller weight) and take log idf (t, D) = log D d D : t d
F-idf
Mul%ply term frequency by inverse document frequency tfidf (t, d, D) = tf (t, d) idf (d, D) = n(t, d) log D n(t, D) n(t,d) = number of %mes term t in doc d n(t,D) = number docs in D containing t
Saltons descrip%on of F-idf
- from Salton, Wong, Yang, A Vector Space Model for Automa'c Indexing, 1975
F-idf
nj-sentator-menendez corpus, Overview sample les color = human tags generated from F-idf clusters
Cluster Hypothesis
documents in the same cluster behave similarly with respect to relevance to informa%on needs
- Manning, Raghavan, Schtze, Introduc'on to Informa'on Retrieval
Not really a precise statement but the crucial link between human seman%cs and mathema%cal proper%es. Ar%culated as early as 1971, has been shown to hold at web scale, widely assumed.
Bag of words + F-idf hard to beat

Prac%cal win: good precision-recall metrics in tests with human-tagged document sets. S%ll the dominant text indexing scheme used today. (Lucene, FAST, Google) Many variants. Some, but not much, theory to explain why this works. (E.g. why that par%cular idf formula? why doesnt indexing bigrams improve performance?)
Collec%vely: the vector space document model

Problem Statement
Can the computer tell us the topics in a document set? Can the computer organize the documents by topic?
Topic Modeling approaches

Find clusters directly in F-idf space
Fast and simple This is what Overview does
Reduce dimensionality of space, so each dimension is a topic

Poten%ally beker results, less sensi%ve to noise LSI, PLSI, LDA
Polysemy and Synonymy

Polysemy: same word, dierent meanings
bank the ins%tu%on vs. bank an airplane
Synonymy: dierent word, same meaning

buy vs. purchase
Bag of words model cannot dieren%ate polysemes, and incorrectly splits synonyms.

Contextual informa%on
Word sense disambigua%on research has shown that surrounding words can be used to determine sense:
The bank has my money The plane banked sharply les
Contextual informa%on
Synonym detec%on: similar usages imply similar meanings:
I bought a couch We need to buy more stamps I purchased a couch We need to purchase more stamps
Latent Seman%c Analysis Basic idea: factor term-document matrix, reconstruct

low-rank approxima%on from leading factors.
(what?)
Given term document matrix X, factor by singular value decomposi%on.

(what?)
Latent Seman%c Analysis
U = words to concepts D = diagonal matrix, concept strength in document set V = concepts to documents
Polysemy and synonymy
Synonymy = many rows in U (many words) map to same column in D (same concept) Polysemy = same row in U (word) maps to mul%ple columns in D (dierent concepts)
Throw out trailing concepts

Technically: low rank approxima%on to X

Intui%on: these concepts dont contribute much to term weights. LSA tends to do best on retrieval metrics with ~100 factors
Probabilis%c Latent Seman%c Indexing

Fixed set of topics Z For each topic z Z, theres a distribu%on of words p(w|z) Assume each document generated by: 1. Choose mixture of topics p(z|d) 2. Choose each word from topic mixture, p(w|z)p(z|d)
PLSI in prac%ce
Start with term-document matrix X = p(w|d) Find:
topics p(w|z) document coordinates p(z|d)
by maximizing probability p(w|d) that they produced observed X, aka the likelihood p(w | d) = p(w | z)p(z | d) dD zZ
Extracted topics p(w|z)

topic Z words in topic p(w|z)
Latent Dirichlet Alloca%on

Same underlying idea, but now generate a document by:
1. For each doc d, choose mixture of topics p(z|d) 2. For each word w in d, choose a topic z from p(z|d) 3. Then choose word from p(w|z) Dierence: each word in doc can come from a dierent topic.
Dimensionality reduc%on
Output of LSA, PLSI, LDA is a vector of much lower dimension for each document. Dimensions are concepts or topics instead of words. Can measure cosine distance, cluster, etc. in this new space.
Which method is best?

LSA, PLSI, LDA all improve performance in informa%on retrieval applica%ons (area under precision/recall curve) But are they useful for journalism? Not clear.
The smoothing they apply may destroy interes%ng outliers.
Homework: lets nd out.

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling

Uploaded by

Copyright:

Available Formats

Fron%ers

Week 3: Document Topic Modeling

Week 3: Document Topic Modeling

Vector representa%on for documents

Turns out features = words works ne

Aka Bag of words model

Week 3: Document Topic Modeling

Problem: long documents always win

Problem: long documents always win

Normalize document vectors

= cos() returns result in [0,1]

Normalized query example

2 1 similarity(a, q) = = 0.707 4 2 2 3 similarity(b, q) = 0.514 17 2

Cosine distance (nally)

Week 3: Document Topic Modeling

Problem: common words

= contains car = does not contain car

Inverse Document Frequency

Saltons descrip%on of F-idf

Bag of words + F-idf hard to beat

Collec%vely: the vector space document model

Week 3: Document Topic Modeling

Topic Modeling approaches

Reduce dimensionality of space, so each dimension is a topic

Polysemy and Synonymy

Synonymy: dierent word, same meaning

Latent Seman%c Analysis Basic idea: factor term-document matrix, reconstruct

Given term document matrix X, factor by singular value decomposi%on.

Latent Seman%c Analysis

Polysemy and synonymy

Throw out trailing concepts

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

Extracted topics p(w|z)

Latent Dirichlet Alloca%on

Which method is best?

Homework: lets nd out.

You might also like