You are on page 1of 48

Fron%ers

of Computa%onal Journalism
Columbia Journalism School Week 3: Document Topic Modelling September 24, 2012

Week 3: Document Topic Modeling


Vector Space Model Cosine Distance F-idf Topic Models

Week 3: Document Topic Modeling


Vector Space Model Cosine Distance F-idf Topic Models

Vector representa%on for documents


As before, we want to nd numerical features that describe the document. An old idea, going back to proto-search engine research by Luhn at IBM in the 1950s.

- H. P. Luhn, A Sta's'cal approach to mechanized encoding and searching of literary informa'on, 1957

Turns out features = words works ne


Encode each document as the list of words it contains. Dimensions = vocabulary of document set. Value on each dimension = # of %mes word appears in document

Example
D1 = I like databases D2 = I hate hate databases Each row = document vector All rows = term-document matrix Individual entry = F(t,d) = term frequency

Aka Bag of words model


Throws out word order. e.g. soldiers shot civilians and civilians shot soldiers encoded iden%cally.

Week 3: Document Topic Modeling


Vector Space Model Cosine Distance F-idf Topic Models

Distance func%on
Useful for: clustering documents nding docs similar to example matching a search query Basic idea: look for overlapping terms

Cosine similarity
Given document vectors a,b dene similarity(a, b) a b
If each word occurs exactly once in each document, equivalent to coun%ng overlapping words. Note: not a distance func%on, as similarity increases when documents are similar. (What part of the deni%on of a distance func%on is violated here?)

Problem: long documents always win


Let a = This car runs fast. Let b = My car is old. I want a new car, a shiny car Let query = fast car

a b
q

this 1 0 0

car 1 3 1

runs 1 0 0

fast 1 0 1

my 0 1 0

is 0 1 0

old 0 1 0

I 0 1 0

want 0 1 0

a 0 1 0

new 0 1 0

shiny 0 1 0

Problem: long documents always win


similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2 similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3 Longer document more similar, by virtue of repea%ng words.

Normalize document vectors


ab similarity(a, b) a b

= cos() returns result in [0,1]

Normalized query example


this a b q 1 0 0 car 1 3 1 runs 1 0 0 fast 1 0 1 my 0 1 0 is 0 1 0 old 0 1 0 I 0 1 0 want 0 1 0 a 0 1 0 new 0 1 0 shiny 0 1 0

2 1 similarity(a, q) = = 0.707 4 2 2 3 similarity(b, q) = 0.514 17 2

Cosine similarity
ab cos = similarity(a, b) a b

Cosine distance (nally)

ab dist(a, b) 1 a b

Week 3: Document Topic Modeling


Vector Space Model Cosine Distance F-idf Topic Models

Problem: common words


We want to weight words that discriminate among documents.
Stopwords: if all documents contain the, are all documents similar? Common words: if most documents contain car then car doesnt tell us much about (contextual) similarity.

Context makers
General News

Car Reviews

= contains car = does not contain car

Document Frequency
Idea: de-weight common words Common = appears in many documents df (t, D) = d D : t d D
document frequency = frac%on of docs containing term

Inverse Document Frequency


Invert (so more common = smaller weight) and take log idf (t, D) = log D d D : t d

F-idf
Mul%ply term frequency by inverse document frequency tfidf (t, d, D) = tf (t, d) idf (d, D) = n(t, d) log D n(t, D) n(t,d) = number of %mes term t in doc d n(t,D) = number docs in D containing t

Saltons descrip%on of F-idf

- from Salton, Wong, Yang, A Vector Space Model for Automa'c Indexing, 1975

F-idf

nj-sentator-menendez corpus, Overview sample les color = human tags generated from F-idf clusters

Cluster Hypothesis
documents in the same cluster behave similarly with respect to relevance to informa%on needs
- Manning, Raghavan, Schtze, Introduc'on to Informa'on Retrieval

Not really a precise statement but the crucial link between human seman%cs and mathema%cal proper%es. Ar%culated as early as 1971, has been shown to hold at web scale, widely assumed.

Bag of words + F-idf hard to beat


Prac%cal win: good precision-recall metrics in tests with human-tagged document sets. S%ll the dominant text indexing scheme used today. (Lucene, FAST, Google) Many variants. Some, but not much, theory to explain why this works. (E.g. why that par%cular idf formula? why doesnt indexing bigrams improve performance?)

Collec%vely: the vector space document model

Week 3: Document Topic Modeling


Vector Space Model Cosine Distance F-idf Topic Models

Problem Statement
Can the computer tell us the topics in a document set? Can the computer organize the documents by topic?

Topic Modeling approaches


Find clusters directly in F-idf space
Fast and simple This is what Overview does

Reduce dimensionality of space, so each dimension is a topic


Poten%ally beker results, less sensi%ve to noise LSI, PLSI, LDA

Polysemy and Synonymy


Polysemy: same word, dierent meanings
bank the ins%tu%on vs. bank an airplane

Synonymy: dierent word, same meaning


buy vs. purchase

Bag of words model cannot dieren%ate polysemes, and incorrectly splits synonyms.

Contextual informa%on
Word sense disambigua%on research has shown that surrounding words can be used to determine sense:
The bank has my money The plane banked sharply les

Contextual informa%on
Synonym detec%on: similar usages imply similar meanings:
I bought a couch We need to buy more stamps I purchased a couch We need to purchase more stamps

Latent Seman%c Analysis Basic idea: factor term-document matrix, reconstruct


low-rank approxima%on from leading factors.
(what?)

Given term document matrix X, factor by singular value decomposi%on.


(what?)

Latent Seman%c Analysis

U = words to concepts D = diagonal matrix, concept strength in document set V = concepts to documents

Polysemy and synonymy

Synonymy = many rows in U (many words) map to same column in D (same concept) Polysemy = same row in U (word) maps to mul%ple columns in D (dierent concepts)

Throw out trailing concepts


Technically: low rank approxima%on to X


Intui%on: these concepts dont contribute much to term weights. LSA tends to do best on retrieval metrics with ~100 factors

Probabilis%c Latent Seman%c Indexing


Fixed set of topics Z For each topic z Z, theres a distribu%on of words p(w|z) Assume each document generated by: 1. Choose mixture of topics p(z|d) 2. Choose each word from topic mixture, p(w|z)p(z|d)

Probabilis%c Latent Seman%c Indexing

Probabilis%c Latent Seman%c Indexing

PLSI in prac%ce
Start with term-document matrix X = p(w|d) Find:
topics p(w|z) document coordinates p(z|d)

by maximizing probability p(w|d) that they produced observed X, aka the likelihood p(w | d) = p(w | z)p(z | d) dD zZ

Extracted topics p(w|z)


topic Z words in topic p(w|z)

Latent Dirichlet Alloca%on


Same underlying idea, but now generate a document by:
1. For each doc d, choose mixture of topics p(z|d) 2. For each word w in d, choose a topic z from p(z|d) 3. Then choose word from p(w|z) Dierence: each word in doc can come from a dierent topic.

Dimensionality reduc%on
Output of LSA, PLSI, LDA is a vector of much lower dimension for each document. Dimensions are concepts or topics instead of words. Can measure cosine distance, cluster, etc. in this new space.

Which method is best?


LSA, PLSI, LDA all improve performance in informa%on retrieval applica%ons (area under precision/recall curve) But are they useful for journalism? Not clear.
The smoothing they apply may destroy interes%ng outliers.

Homework: lets nd out.

You might also like