You are on page 1of 28

Boston KNIME Users

Text Processing Applications


Kilian Thiel
KNIME

Copyright © 2014 KNIME.com AG


Agenda

• KNIME Crash Course


• Text Mining with KNIME: Mining Tripadvisor Data
• Text Mining with KNIME: Mining Amazon Reviews
(Anil Tarachandani)
• Networking Apero

Copyright © 2014 KNIME.com AG 2


Text Mining with KNIME: Mining Tripadvisor Data

Agenda
• The KNIME Textprocessing Extension
– Preliminaries
– Philosophy & Usage
• Classification of Tripadvisor Reviews
– Tripadvisor data
– Classification of reviews

Copyright © 2014 KNIME.com AG 3


Resources

http://tech.knime.org/knime-text-processing
• Documentation
• Examples
• Forum
• White Papers

Copyright © 2014 KNIME.com AG 4


Installation

1.) 2.)

Copyright © 2014 KNIME.com AG 5


Requirements

Requirements to import and run demo workflows


• KNIME 2.10
• Textprocessing (labs)
• Distance Matrix (KNIME)
• Palladian (Community)

Copyright © 2014 KNIME.com AG 6


Tips

• Settings (knime.ini)
– Set maximum memory for KNIME
– -Xmx3G

Copyright © 2014 KNIME.com AG 7


Demo

Prepare KNIME
• Go to KNIME directory
• Change knime.ini file (optional)
– -Xmx3G
• Start KNIME
• Install Textprocessing Extension
– (or better have it already installed)

Copyright © 2014 KNIME.com AG 8


Philosophy

Classifi-
cation

… perhaps your name


Cluster-
is ing
Rumpelstiltskin[Perso
n] ? … Visualization

… perhaps your name


is
Rumpelstiltskin[Perso
111010011
n] ? …
011001000
001110110

Copyright © 2014 KNIME.com AG 9


Additional Data Types

• Document Cell
– Encapsulates a document
• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term
• Words, tags

Copyright © 2014 KNIME.com AG 10


Data Table Structures

• Document table
– List of documents

• Bag of words
– Tuples of documents
and terms

• Document vectors
– Numerical
representations of
documents

Copyright © 2014 KNIME.com AG 11


Philosophy and Data Table Structures

Enrichment Preprocessing
1110
1001

Documents Documents Documents Bow Vectors

Copyright © 2014 KNIME.com AG 12


Tripadvisor Data

Rating Title

Author

Fulltext

Copyright © 2014 KNIME.com AG 13


Tripadvisor Data

Reviews about italian and chinese restaurants in


Boston
• Chinese: 272
• Italian: 268

Copyright © 2014 KNIME.com AG 14


Tripadvisor Data

Goal:
• Build classifier to distinguish between chinese and
italian restaurants, based on their reviews.
Review about italian or
chinese restaurant?

Copyright © 2014 KNIME.com AG 15


Tripadvisor Data

Goal:

Copyright © 2014 KNIME.com AG 16


1.) Reading

Read/Parse textual data

Copyright © 2014 KNIME.com AG 17


Demo

Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column

Copyright © 2014 KNIME.com AG 18


2.) Enrichment

Enrich documents with semantic information

Copyright © 2014 KNIME.com AG 19


Demo

Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result

Copyright © 2014 KNIME.com AG 20


3.) Preprocessing

Preprocess documents and filter words

Copyright © 2014 KNIME.com AG 21


Demo

Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming
• Keep only nouns, verbs, adjectives

Copyright © 2014 KNIME.com AG 22


4.) Transformation

Creation of numerical representation of documents

Copyright © 2014 KNIME.com AG 23


Demo

Transformation
• Transform to bag of word
• Compute TF value for terms
• Transform to document vectors
• Extract category (class) value

Copyright © 2014 KNIME.com AG 24


5.) Classification

Training of a model (decision tree) and scoring

Copyright © 2014 KNIME.com AG 25


Demo

Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy

Copyright © 2014 KNIME.com AG 26


Additional Workflows

• Multi Word Tagging


– Detection of frequent Ngrams
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger

• Classification with Multi Words


• Clustering of documents

Copyright © 2014 KNIME.com AG 27


Thank You

Questions
• http://tech.knime.org/forum
• Kilian.Thiel@knime.com
60k
Follow us
40k
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
20k

• KNIME Blog: http://www.knime.org/blog

Copyright © 2014 KNIME.com AG 28

You might also like