Professional Documents
Culture Documents
Project Report
Submitted by
Jaimita Bansal
14111017
Dept. of CSE
Zahira Nasrin
Rajat Kumar
Sharin KG
14111047
14111028
14111033
Dept. of CSE
Dept. of CSE
Dept.
of CSE
Acknowledgments
We would like to express our sincere appreciation to our
supervisor Prof. Harish Karnick. This project would not have been
possible without his guidance. We owe our knowledge in Machine
Learning to his course CS771: Machine Learning Tools and
Techniques.
2 | Page
Abstract
In this project, we aim to tackle the problem of Sentiment Analysis
which has been an active area of Research recently. We
experiment with different machine learning algorithms ranging
from simple Bag of Words to the more complex models like GloVe,
with the aim of predicting the sentiment of unseen reviews. We
also discuss the performance of different classifiers used and
compare their results to obtain the best classifier configuration for
the problem for the given dataset. To do this, we make use of
IMDB dataset which provides us with a set of 25,000 highly polar
movie reviews for training, 25,000 reviews for testing and
additional 50,000 unlabeled data.
3 | Page
Table of Contents
1. Introduction
2. Dataset
3. Pre-processing 7
4. Approaches
4.2 Word2vec 9
4.2.1 Results 12
4.3 Doc2vec 14
4.3.1 Results 16
4.4 GloVe19
4.5 Perceptron
4.4.1 Results 22
24
4.5.1 Results 25
5. References 27
4 | Page
1. Introduction
Sentiment
analysis
and
classification
is
an
area
of
text
researchers.
Formally
defined
as
The
process
of
sentiment is
5 | Page
2. Dataset
The labeled data set consists of 50,000 IMDB movie reviews,
specially selected for sentiment analysis. The sentiment of
reviews is binary, meaning the IMDB rating < 5 results in a
sentiment score of 0, and rating >=7 have a sentiment score of 1.
No individual movie has more than 30 reviews. The 25,000 review
labeled training set does not include any of the same movies as
the 25,000 review test set. In addition, there are another 50,000
IMDB reviews provided without any rating labels.
6 | Page
3. Pre-processing
As the reviews in the dataset contained many HTML tags, stop
words and punctuation symbols, we were required to clean the
data and do some pre-processing on the reviews to make it
suitable for analysis.
7 | Page
words as one.
4. Approaches
4.1 Bag of Words.
8 | Page
1.1.1
S.No
9 | Page
Results
Classifier
Random Forest
Perceptron
SVM
Accuracy(%)
83.98
81.658
85.91
1.1 Word2vec
One year ago, Tomas Mikolov made some ripples by releasing
word2vec, an unsupervised algorithm for learning the meaning
behind words. Word2Vec is a deep-learning inspired method that
focuses on the meaning of words. It ultimately learns word
vectors and word context vectors and attempts to understand
meaning and semantic relationships among words.
How it works
The word2vec tool takes a text corpus as input and produces the
word vectors as output. It first constructs a vocabulary from the
training text data and then learns vector representation of words.
The resulting word vector file can be used as features in many
natural language processing and machine learning applications.
Word2Vec Algorithm
Use documents to train a neural network model maximizing the
conditional probability of context given the word
10 | P a g e
12 | P a g e
4.2.1
Results
Random Forest
85.0
84.5
Accuracy
84.0
83.5
83.0
60
80
120
140
Number of Trees
2. Vector Averaging
S.No
Classifier
Accuracy(%)
Random Forest
84.748
Perceptron
83.456
SVM
86.424
13 | P a g e
160
3. Clustering
S.No
Classifier
Accuracy(%)
Random Forest
84.10
Perceptron
82.47
SVM
86.10
Random Forest
85.0
84.0
83.0
Accuracy
82.0
81.0
80.0
200
300
Number of Features
14 | P a g e
400
4.3 Doc2vec
Doc2vec (aka paragraph2vec, aka sentence embeddings)
modifies the word2vec algorithm to unsupervised learning of
continuous representations for larger blocks of text, such
as sentences, paragraphs or entire documents.
Algorithm
Our approach for learning paragraph vectors is inspired by the
methods for learning the word vectors. The inspiration is that the
word vectors are asked to contribute to a prediction task about
the next word in the sentence. So despite the fact that the word
vectors are initialized randomly, they can eventually capture
semantics as an indirect result of the prediction task. We will use
this idea in our paragraph vectors in a similar manner. The
paragraph vectors are also asked to contribute to the prediction
task of the next word given many contexts sampled from the
paragraph.
In our Paragraph Vector framework every paragraph is mapped to
a unique vector, represented by a column in matrix D and every
word is also mapped to a unique vector, represented by a column
in matrix W. The paragraph vector and word vectors are averaged
or concatenated to predict the next word in a context. In the
experiments, we use concatenation as the method to combine the
vectors.
15 | P a g e
4.3.1Results
1. Change in Accuracy with Random Forest Classifier(Vector
Averaging)
17 | P a g e
Random Forest
69.0
68.0
67.0
Accuracy
66.0
65.0
64.0
60
80
120
140
Number of Trees
2. Vector Averaging
S.No
Classifier
Accuracy(%)
Random Forest
67.824
Perceptron
67.356
SVM
67.332
3. Vector Averaging
S.No
18 | P a g e
Classifier
Accuracy(%)
160
Random Forest
67.723
Perceptron
67.351
SVM
68.91
4.4 GloVe
GloVe (Global Vectors for Word Representation) is a tool recently
released by Stanford NLP Group researchers Jeffrey Pennington,
19 | P a g e
20 | P a g e
The GloVe algorithm then transforms such raw integer counts into
a matrix where the co-occurrences are weighted based on their
21 | P a g e
distance within the window (word pairs farther apart get less cooccurrence weight):
22 | P a g e
4.4.1 Results
http://nlp.stanford.edu/projects/glove/glove.pdf
- The paper
suggests For the same corpus, vocabulary, window size, and
training time, GloVe consistently outperforms word2vec
Our findings suggest that GloVe doesnt really outperform the
original word2vec.
S.No
Method Used
1
2
3
4
23 | P a g e
Clustering
Vector Avg
Classifier
Random
Forest
Perceptron
Accuracy(%)
SVM
Random
Forest
69.66
79.9
73.16
74.71
Perceptron
79.1
SVM
79.59
24 | P a g e
Learning
learning
The perceptron is capable of online learning (learning from
samples one at a time). This is useful for larger datasets since you
do not need entire datasets in memory.
Hashing trick
The vectorizing hashing trick originated with Vowpal. This trick
sets number of incoming connections to the perceptron to a fixed
size. We hash all the raw features to a number lower than the
fixed size.
Progressive Validation Loss
Learning from samples one at a time also gives us progressive
training loss. When the model encounters a sample it first makes
a prediction without looking at the target. Then we compare this
prediction with the target label and calculate an error rate. A low
error rate tells us we are close to a good model.
Normalization
26 | P a g e
4.5.1 Results
S.No
Accurac
y
Weight Initialization
N-Gram
Random
1 Gram
93.24
Zero
1 Gram
93.67
Random
2 Gram
95.10
Zero
2 Gram
95.67
27 | P a g e
6. References
1.
Kaggle Tutorials.
-
2.
http://www.kaggle.com/c/word2vec-nlp-tutorial
http://radimrehurek.com/2014/12/making-sense-of-
word2vec/
3.
http://deeplearning4j.org/word2vec.html
4.
5.
28 | P a g e
radimrehurek.com/2014/12/doc2vec-tutorial/