Professional Documents
Culture Documents
Agenda
• The KNIME Textprocessing Extension
– Preliminaries
– Philosophy & Usage
• Classification of Tripadvisor Reviews
– Tripadvisor data
– Classification of reviews
http://tech.knime.org/knime-text-processing
• Documentation
• Examples
• Forum
• White Papers
1.) 2.)
• Settings (knime.ini)
– Set maximum memory for KNIME
– -Xmx3G
Prepare KNIME
• Go to KNIME directory
• Change knime.ini file (optional)
– -Xmx3G
• Start KNIME
• Install Textprocessing Extension
– (or better have it already installed)
Classifi-
cation
• Document Cell
– Encapsulates a document
• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term
• Words, tags
• Document table
– List of documents
• Bag of words
– Tuples of documents
and terms
• Document vectors
– Numerical
representations of
documents
Enrichment Preprocessing
1110
1001
Rating Title
Author
Fulltext
Goal:
• Build classifier to distinguish between chinese and
italian restaurants, based on their reviews.
Review about italian or
chinese restaurant?
Goal:
Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming
• Keep only nouns, verbs, adjectives
Transformation
• Transform to bag of word
• Compute TF value for terms
• Transform to document vectors
• Extract category (class) value
Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
Questions
• http://tech.knime.org/forum
• Kilian.Thiel@knime.com
60k
Follow us
40k
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
20k