CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report

CS771: GROUP-19
Project Report
Sentiment Analysis in Movie Reviews

Submitted in partial fulfillment of
the requirements for the course of the Machine Learning
Submitted by
Jaimita Bansal
14111017
Dept. of CSE
Zahira Nasrin
Rajat Kumar
Sharin KG
14111047
14111028
14111033
Dept. of CSE
Dept. of CSE
Dept.
of CSE
Under the guidance of

Dr. Harish Karnick
Department of Computer Science and Engineering

INDIAN INSTITUTE OF TECHNOLOGY KANPUR
Kanpur, Uttar Pradesh, India 208016
CS771 : Machine Learning Tools and

Techniques
Acknowledgments
We would like to express our sincere appreciation to our
supervisor Prof. Harish Karnick. This project would not have been
possible without his guidance. We owe our knowledge in Machine
Learning to his course CS771: Machine Learning Tools and
Techniques.
2 | Page

Techniques
Abstract
In this project, we aim to tackle the problem of Sentiment Analysis
which has been an active area of Research recently. We
experiment with different machine learning algorithms ranging
from simple Bag of Words to the more complex models like GloVe,
with the aim of predicting the sentiment of unseen reviews. We
also discuss the performance of different classifiers used and
compare their results to obtain the best classifier configuration for
the problem for the given dataset. To do this, we make use of
IMDB dataset which provides us with a set of 25,000 highly polar
movie reviews for training, 25,000 reviews for testing and
additional 50,000 unlabeled data.
3 | Page

Techniques
Table of Contents
1. Introduction
2. Dataset
3. Pre-processing 7
4. Approaches
4.1 Bag Of Words 8

4.1.1Results
4.2 Word2vec 9
4.2.1 Results 12
4.3 Doc2vec 14
4.3.1 Results 16
4.4 GloVe19
4.5 Perceptron
4.4.1 Results 22
24
4.5.1 Results 25
5. References 27
4 | Page

Techniques
1. Introduction
Sentiment
analysis
and
classification
is
an
area
of
text
classification that has recently been receiving lots of attention

from
researchers.
Formally
defined
as
The
process
of
computationally identifying and categorizing opinions expressed

in a piece of text, especially in order to determine whether the
writer's attitude towards a particular topic, product, etc. is
positive, negative, or neutral.
Sentiment analysis involves analyzing textual datasets which may
contain opinions (e.g., social media, movie reviews etc.) with the
aim of classifying the opinions as positive, negative, or neutral.
Classification of textual objects according to
sentiment is
considered a more difficult task than classification of textual

objects according to content because people express their
emotions in language that is often obscured by sarcasm,
ambiguity, and plays on words, all of which could be very
misleading for both humans and computers
Figure 1. Learning Task Diagram
5 | Page

Techniques
2. Dataset
The labeled data set consists of 50,000 IMDB movie reviews,
specially selected for sentiment analysis. The sentiment of
reviews is binary, meaning the IMDB rating < 5 results in a
sentiment score of 0, and rating >=7 have a sentiment score of 1.
No individual movie has more than 30 reviews. The 25,000 review
labeled training set does not include any of the same movies as
the 25,000 review test set. In addition, there are another 50,000
IMDB reviews provided without any rating labels.
Data fields are

id
sentiment
Unique ID of each review

Sentiment of the review; 1 - positive
and 0 - negative
review Text of the review
6 | Page

Techniques
3. Pre-processing
As the reviews in the dataset contained many HTML tags, stop
words and punctuation symbols, we were required to clean the
data and do some pre-processing on the reviews to make it
suitable for analysis.
7 | Page

Techniques
We used BeautifulSoup library to remove HTML Tags and markups.

Regular expressions and punctuations were dealt using re
package. Finally, we got list of English- language stop words from
Python Natural Language Toolkit and filter all of those from our
data corpus and utilize stemming package to treat same meaning
def review_to_wordlist( review, remove_stopwords=False ):

# 1. Remove HTML
review_text = BeautifulSoup(review).get_text()
# 2. Remove non-letters
review_text = re.sub("[^a-zA-Z]"," ", review_text)
# 2.1 Remove single letters
review_text = re.sub('/(?<!\S).(?!\S)\s*/', '', review_text);
# 3. Convert words to lower case and split them
words = review_text.lower().split()
newwords=[]
for word in words:
if len(word)>2:
newwords.append(word)
# 4. Optionally remove stop words (false by default)
if remove_stopwords:
stops = set(stopwords.words("english"))
newwords = [w for w in newwords if not w in stops]
# 5. Return a list of words
return(newwords)
def stemPorter(review_text):
porter = PorterStemmer()
preprocessed_docs = []
for doc in review_text:
final_doc = []
for word in doc:
final_doc.append(porter.stem(word))
final_doc.append(wordnet.lemmatize(word))
preprocessed_docs.append(final_doc)
return preprocessed_docs
Code Snippet 1. Functions for Preprocessing
words as one.
4. Approaches
4.1 Bag of Words.
8 | Page

Techniques
The bag-of-words model is a simplifying representation used in

natural language processing and information retrieval. In this
model, a text (such as a sentence or a document) is represented
as the bag of its words, disregarding grammar and even word
order but keeping multiplicity.
The bag-of-words model is commonly used in methods of
document classification, where the occurrence of each word is
used as a feature for training a classifier. This model focuses
completely on the words, or sometimes a string of words, but
usually pays no attention to the context. The Bag of Words model
learns a vocabulary from all of the documents, then models each
document by counting the number of times each word appears. To
get our bags of words, we count the number of times each word
occurs in each sentence.
In the IMDB data, since we have a very large number of reviews, it
will give us a large vocabulary. Thus, to limit the size of the
feature vectors, we vary and choose some maximum vocabulary
size. And then build different models on these features and
calculate the accuracy. We hold these results as a baseline for
further analysis.
1.1.1
S.No
9 | Page
Results
Classifier
Random Forest
Perceptron
SVM
Accuracy(%)
83.98
81.658
85.91

Techniques
1.1 Word2vec
One year ago, Tomas Mikolov made some ripples by releasing
word2vec, an unsupervised algorithm for learning the meaning
behind words. Word2Vec is a deep-learning inspired method that
focuses on the meaning of words. It ultimately learns word
vectors and word context vectors and attempts to understand
meaning and semantic relationships among words.
How it works
The word2vec tool takes a text corpus as input and produces the
word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words.
The resulting word vector file can be used as features in many
natural language processing and machine learning applications.
Word2Vec Algorithm
Use documents to train a neural network model maximizing the
conditional probability of context given the word
The goal is to optimize the parameter () maximizing the

conditional probability of context (c) given the word (w). D is the
set of all (w, c) pairs.
Then we apply the model to each word to get its corresponding
vector
10 | P a g e

Techniques
And calculate the vector of sentences by averaging the vector of

their words.
These Sentences will give similarity matrix between them.
And then PageRank is used to score the sentences in graph

11 | P a g e

Techniques
Here we had used two different techniques on vector received by

word2vec model
1. Vector Averaging
Since each word is taken as a vector in 300-dimensional
space, One method we tried was to simply average the word
vectors in a given review
2. Clustering
Word2Vec creates clusters of semantically related words, so
another possible approach we tried was to exploit the
similarity of words within a cluster
12 | P a g e

Techniques
4.2.1
Results
1. Change in Accuracy with number of Trees (k-means)
Random Forest
85.0
84.5
Accuracy
84.0
83.5
83.0
60
80
120
140
Number of Trees
2. Vector Averaging
S.No
Classifier
Accuracy(%)
Random Forest
84.748
Perceptron
83.456
SVM
86.424
13 | P a g e
160

Techniques
3. Clustering
S.No
Classifier
Accuracy(%)
Random Forest
84.10
Perceptron
82.47
SVM
86.10
4. Change in Accuracy with number of Features (k-means)
Random Forest
85.0
84.0
83.0
Accuracy
82.0
81.0
80.0
200
300
Number of Features
14 | P a g e
400

Techniques
4.3 Doc2vec
Doc2vec (aka paragraph2vec, aka sentence embeddings)
modifies the word2vec algorithm to unsupervised learning of
continuous representations for larger blocks of text, such
as sentences, paragraphs or entire documents.
Algorithm
Our approach for learning paragraph vectors is inspired by the
methods for learning the word vectors. The inspiration is that the
word vectors are asked to contribute to a prediction task about
the next word in the sentence. So despite the fact that the word
vectors are initialized randomly, they can eventually capture
semantics as an indirect result of the prediction task. We will use
this idea in our paragraph vectors in a similar manner. The
paragraph vectors are also asked to contribute to the prediction
task of the next word given many contexts sampled from the
paragraph.
In our Paragraph Vector framework every paragraph is mapped to
a unique vector, represented by a column in matrix D and every
word is also mapped to a unique vector, represented by a column
in matrix W. The paragraph vector and word vectors are averaged
or concatenated to predict the next word in a context. In the
experiments, we use concatenation as the method to combine the
vectors.
15 | P a g e

Techniques
The paragraph token can be thought of as another word. It acts as

a memory that remembers what is missing from the current
context or the topic of the paragraph. For this reason, we often
call this model the Distributed Memory Model of Paragraph
Vectors (PV-DM).
The contexts are fixed-length and sampled from a sliding window
over the paragraph. The paragraph vector is shared across all
contexts generated from the same paragraph but not across
paragraphs. The word vector matrix W, however, is shared across
paragraphs. I.e., the vector for powerful is the same for all
paragraphs.
The paragraph vectors and word vectors are trained using
stochastic gradient descent and the gradient is obtained via backpropagation. At every step of stochastic gradient descent, one
can sample a fixed-length context from a random paragraph,
compute the error gradient from the network and use the gradient
to update the parameters in our model.
At prediction time, one needs to perform an inference step to
compute the paragraph vector for a new paragraph. This is also
obtained by gradient descent. In this step, the parameters for the
rest of the model, the word vectors W and the softmax weights,
are fixed. Suppose that there are N paragraphs in the corpus, M
words in the vocabulary, and we want to learn paragraph vectors
such that each paragraph is mapped to p dimensions and each
16 | P a g e

Techniques
word is mapped to q dimensions, then the model has the total of

N p + M q parameters (excluding the softmax parameters).
Even though the number of parameters can be large when N is
large, the updates during training are typically sparse and thus
efficient.
After being trained, the paragraph vectors can be used as
features for the paragraph. We can feed these features directly to
conventional machine learning techniques such as logistic
regression, support vector machines or random forest.
Advantages of paragraph vectors:
An important advantage of paragraph vectors is that they are
learned from unlabeled data and thus can work well for tasks that
do not have enough labeled data.
4.3.1Results
1. Change in Accuracy with Random Forest Classifier(Vector
Averaging)
17 | P a g e

Techniques
Random Forest
69.0
68.0
67.0
Accuracy
66.0
65.0
64.0
60
80
120
140
Number of Trees
2. Vector Averaging
S.No
Classifier
Accuracy(%)
Random Forest
67.824
Perceptron
67.356
SVM
67.332
3. Vector Averaging
S.No
18 | P a g e
Classifier
Accuracy(%)
160

Techniques
Random Forest
67.723
Perceptron
67.351
SVM
68.91
4.4 GloVe
GloVe (Global Vectors for Word Representation) is a tool recently
released by Stanford NLP Group researchers Jeffrey Pennington,
19 | P a g e

Techniques
Richard Socher, and Chris Manning for learning continuous-space

vector representations of words.
The GloVe authors present some results which suggest that their
tool is competitive with Googles popular word2vec package.
Theory
The GloVe model learns word vectors by examining word cooccurrences within a text corpus. Before we train the actual
model, we need to construct a co-occurrence matrix X, where a
cell Xij is a strength which represents how often the word i
appears in the context of the word j.
We will produce vectors with a soft constraint that for each word
pair of word i and word j, where b i and bj are scalar bias terms
associated with words i and j, respectively. Intuitively speaking,
we want to build word vectors that retain some useful information
about how every pair of words i and j co-occur.
Well do this by minimizing an objective function J, which

evaluates the sum of all squared errors based on the above
equation, weighted with a function f:
We choose an f that helps prevents common word pairs (i.e.,

those with large Xij values) from skewing our objective too much:
20 | P a g e

Techniques
When we encounter extremely common word pairs (where

Xij>xmax) this function will cut off its normal output and simply
return 1. For all other word pairs, we return some weight in the
range (0,1), where the distribution of weights in this range is
decided by .
For example, if we have the following nine preprocessed
sentences, and set window=5, the co-occurrence matrix looks like
this. Note how the matrix is very sparse and symmetrical;
The GloVe algorithm then transforms such raw integer counts into
a matrix where the co-occurrences are weighted based on their
21 | P a g e

Techniques
distance within the window (word pairs farther apart get less cooccurrence weight):
Glove vs. Word2Vec

Word2vec was a bit hazy about whats going on underneath,
GloVe explicitly names the objective matrix, identifies the
factorization, and provides some intuitive justification as to why
this should give us working similarities.
Basically, where GloVe precomputes the large word x word cooccurrence matrix in memory and then quickly factorizes it,
word2vec sweeps through the sentences in an online fashion,
handling each co-occurrence separately. So, there is a tradeoff
between taking more memory (GloVe) vs. taking longer to train
(word2vec).
GloVe can re-use the co-occurrence matrix to quickly factorize
with any dimensionality, whereas word2vec has to be trained
from scratch after changing its embedding dimensionality.
Implementation of Glove
Maciej Kula implemented GloVe in Python, using Cython for
performance. Using his neat implementation, we can try to make
sense of the performance and accuracy ourselves.
22 | P a g e

Techniques
Fig. Sample Code to train GloVe in Python
4.4.1 Results
http://nlp.stanford.edu/projects/glove/glove.pdf
- The paper
suggests For the same corpus, vocabulary, window size, and
training time, GloVe consistently outperforms word2vec
Our findings suggest that GloVe doesnt really outperform the
original word2vec.
S.No
Method Used
1
2
3
4
23 | P a g e
Clustering
Vector Avg
Classifier
Random
Forest
Perceptron
Accuracy(%)
SVM
Random
Forest
69.66
79.9
73.16
74.71

Techniques
Perceptron
79.1
SVM
79.59
Comparing Word2Vec with Glove.
24 | P a g e

Techniques
4.5 Approach 6: Perceptron

Invented in 1957 by cognitive psychologist Frank Rosenblatt, the
perceptron algorithm was the first artificial neural net
implemented in hardware. It is an algorithm for supervised
classification of an input into one of several possible non-binary
outputs. It is a type of linear classifier, i.e. a classification
algorithm that makes its predictions based on a linear predictor
function combining a set of weights with the feature vector. The
algorithm allows for online learning, in that it processes elements
in the training set one at a time.
Weights
A Perceptron works by assigning weights to incoming connections.
With the McCulloch-Pitts Neuron we took the sum of the values
from the incoming connections, then looked if it was over or
below a certain threshold. With the Perceptron we instead take
the dotproduct.
def dot_product(features,weights):
dotp = 0
for f in features:
dotp += weights[f[0]] * f[1]
return dotp
Code Snippet: Function for Dot Product
Learning
A perceptron is a supervised classifier. It learn by first making a

prediction: Is the dotproduct over or below the threshold? If it
over the threshold it predicts a 1, if it is below threshold it
predicts a 0.
Then the perceptron looks at the label of the sample. If the
prediction was correct, then the error is 0, and it leaves the
25 | P a g e

Techniques
weights alone. If the prediction was wrong, the error is either -1

or 1 and the perceptron will update the weights accordingly.
Online
dp = dot_product(features, weights) > 0.5

error = label - dp # error is 1 if misclassified as 0, error is -1 if
misclassified as 1
if error != 0:
error_counter += 1
# Updating the weights
for index, value in features:
weights[index] += opts["learning_rate"] * error * log(1.+value)
Code Snippet : Updating Weights
learning
The perceptron is capable of online learning (learning from
samples one at a time). This is useful for larger datasets since you
do not need entire datasets in memory.
Hashing trick
The vectorizing hashing trick originated with Vowpal. This trick
sets number of incoming connections to the perceptron to a fixed
size. We hash all the raw features to a number lower than the
fixed size.
Progressive Validation Loss
Learning from samples one at a time also gives us progressive
training loss. When the model encounters a sample it first makes
a prediction without looking at the target. Then we compare this
prediction with the target label and calculate an error rate. A low
error rate tells us we are close to a good model.
Normalization
26 | P a g e

Techniques
We use a simple log (feature value + 1) for online feature value

normalization, just like Vowpal Wabbit.
Multiple passes
We can also run through datasets multiple times for as long as the
error rate gets lower and lower. If the training data is linearly
separable then a perceptron should, in practice, converge to 0
errors on the train set.
4.5.1 Results
S.No
Accurac
y
Weight Initialization
N-Gram
Random
1 Gram
93.24
Zero
1 Gram
93.67
Random
2 Gram
95.10
Zero
2 Gram
95.67
27 | P a g e

Techniques
6. References
1.
Kaggle Tutorials.
-
2.
http://www.kaggle.com/c/word2vec-nlp-tutorial
Word2Vec implementation Tutorial.

-
http://radimrehurek.com/2014/12/making-sense-of-
word2vec/
3.
http://deeplearning4j.org/word2vec.html
Published Paper - Distributed Representations of Sentences

and Documents. By QuocLe, Tomas Mikolo
4.
Perceptron Tutorial: mlwave.com/online-learning-perceptron/
5.
Doc2Vec implementation Tutorials

-
28 | P a g e
radimrehurek.com/2014/12/doc2vec-tutorial/

CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report

Uploaded by

Copyright:

Available Formats

CS771: GROUP-19

Sentiment Analysis in Movie Reviews

Under the guidance of

Department of Computer Science and Engineering

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

4.1 Bag Of Words 8

CS771 : Machine Learning Tools and

classification that has recently been receiving lots of attention

computationally identifying and categorizing opinions expressed

considered a more difficult task than classification of textual

Figure 1. Learning Task Diagram

CS771 : Machine Learning Tools and

Data fields are

Unique ID of each review

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

We used BeautifulSoup library to remove HTML Tags and markups.

def review_to_wordlist( review, remove_stopwords=False ):

CS771 : Machine Learning Tools and

The bag-of-words model is a simplifying representation used in

CS771 : Machine Learning Tools and

The goal is to optimize the parameter () maximizing the

CS771 : Machine Learning Tools and

And calculate the vector of sentences by averaging the vector of

These Sentences will give similarity matrix between them.

And then PageRank is used to score the sentences in graph

CS771 : Machine Learning Tools and

Here we had used two different techniques on vector received by

CS771 : Machine Learning Tools and

1. Change in Accuracy with number of Trees (k-means)

CS771 : Machine Learning Tools and

4. Change in Accuracy with number of Features (k-means)

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

The paragraph token can be thought of as another word. It acts as

CS771 : Machine Learning Tools and

word is mapped to q dimensions, then the model has the total of

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

CS771 : Machine Learning Tools and

Richard Socher, and Chris Manning for learning continuous-space

Well do this by minimizing an objective function J, which

We choose an f that helps prevents common word pairs (i.e.,

CS771 : Machine Learning Tools and

When we encounter extremely common word pairs (where

CS771 : Machine Learning Tools and

Glove vs. Word2Vec

CS771 : Machine Learning Tools and

Fig. Sample Code to train GloVe in Python

CS771 : Machine Learning Tools and

Comparing Word2Vec with Glove.

CS771 : Machine Learning Tools and

4.5 Approach 6: Perceptron

A perceptron is a supervised classifier. It learn by first making a

CS771 : Machine Learning Tools and

weights alone. If the prediction was wrong, the error is either -1

dp = dot_product(features, weights) > 0.5

Code Snippet : Updating Weights

CS771 : Machine Learning Tools and

We use a simple log (feature value + 1) for online feature value

CS771 : Machine Learning Tools and