You are on page 1of 3

Text mining: Application and Theory

Introduction
Keywords provide a compact representation of a documents content.
It represents in condensed form the essential content of a document
Keywords are widely used in queries within information retrieval
systems that are easy to:
Define
Revise
Remember
Share
Also applied to improve the functionality of IR systems
Enrich the presentation of search results
Keyword Extraction Methods
Most documents dont have assigned keywords
Research has focused on methods to automatically extract
keywords from documents as an aid either to suggest keywords for a professional
indexer or to generate summeary features for documents that would otherwise be
innaccessible
Early approaches
Evaluating corpus-oriented statistics of individual words
Later on, keyword extrractoin research applies these
methods to select discriminating words as keywords for individual
documents
However this method only work on a single word
Further limits the measurement of
statiscally discriminating words because single words are often
used in multiple and different context
As a result, they focus on keyword extraction on individual
documents regardless of the current state of the corpus
This will provide independent document features,
enablling additional analytic methods
This can applie to many other contexts in order to
enrich IR systems and analysis tools
This will involve
Machine learning
Natural language processing
POS tags
Statistical methods
Rapid Automatic keyword Extraction (RAKE)
Motivation
To develop a keyword extraction method that is

extremely efficient,
operates on individual documents to enable
application to dynamic collection
Easily applied to new domains
Operates well on multiple types of documents
Particululary those that dont follo
specific grammer conventions
RAKE is based on observation that keywords contains multiple words but rarely
contain standard punctuation or stop words or others with minimal lexical meaning
The input parameters for RAKE comprise a list of stop words, a set
of phrase delimiters, and a set ogf word delimiters.
Uses stop words and phrase delimiters to partition
the document text into candidate keywords, which are sequences of
content words as they occur in the text
Co-occurances help us identify word
co-occurance w/o the application of an arbitrarily sized sliding
window
Candidate keywords
RAKE starts by parsing text in a document into a
set of candidate keywords
The text is split itno an array of
words by specificed word delimiter.
Then the array is split into sequences
of contgous words at phrase delimiters and stop word positions
Words witin a
sequence are assifgned the same position in the text and
together are considered a candiditate keyword
Keyword Scores
Each candidate keyword found is recoreded as a
score
It uses graphs to calculate word
score. Its based on:
Word frequency
(freq(w))
Word degree (deg(w))
Ration of degree to
frequency (deg(w)/freq(w))
Adjoining Keywords
Extracted keywords dont contain interiror stop
words
RAKE splits candidate keywords by
stop words

Axis of evil
Look for pairs of keywords that
adjoin one another at least twice the same docuiment and in the
same order.
A new candidate
keyword is then created as a combination of those
keywords and their interior stop words
Scoe is
added to its member keyboard scores
Extracted Keywords
Once candidate keywords are scored, the top T
scoring candidates are selected as keywords for the document
Benchmark Evaluation