You are on page 1of 18

MOVIE SUCCESS PREDICTION USING MULTIPLE ALGORITHMS

A project report for Data Mining Techniques

(ITE2006)

Submitted in partial fulfilment for the award of the degree of

B. Tech

In

Information Technology

By

NEELENDU WADHWA-16BIT0033

AKASH-16BIT0144

UJJWAL KUMAR-16BIT0223

Under the guidance of

Dr. Prabhavathy P

November, 2018

1
ACKNOWLEDGEMENT

We sincerely thank our Chancellor - Dr. G. Viswanathan, VIT


University, for giving us the opportunity to pursue this course of
Data Mining Techniques and introduce J-component as part of our
academic curriculum. We also thank our Professor Dr Prabhavathy
P for giving us the right support and guidance to make us able to
complete the project. We would also thank the Dean and HOD of
School of Information Technology and Engineering (SITE), for
their continued encouragement and motivation to achieve greater
heights.

2
DECLARATION BY THE CANDIDATES

We, the members of the group, hereby declare that the project report
entitled “Movie Success Prediction using multiple algorithms”
submitted by us to VIT University, Vellore in partial fulfilment of the
requirement for the award of the degree of Bachelor of Technology is
a record of J component of project work carried out by us under the
guidance of Prof Dr. Prabhavathy P. We further declare that the work
reported in this project has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.

Place: Vellore

Date: 1/11/2018

TEAM MEMBERS:-

NEELENDU WADHWA

AKASH

UJJWAL KUMAR

3
CERTIFICATE

This is to certify that the project work titled “Movie Success


Prediction Using multiple algorithms” that is being submitted by
“Neelendu Wadhwa” ,“Akash” and “Ujjwal Kumar” for Data Mining
Techniques is a record of bonafide work done under my supervision.
The contents of this Project work, in full or in parts, have neither been
taken from any other source nor have been submitted for any other CAL
course.

Place:-Vellore

Date:-1/11/2018

Signature of faculty:

Prof. Dr. Prabhavathy P

4
TABLE OF CONTENTS

SL.NO CONTENTS PAGE NO.


1 Abstract 6
2 Introduction 7
3 Requirements 8
4 Proposed System 8
5 Implementation 9
6 Code and Output 12
7 Results 17
8 Conclusion 18
9 Future Work 18

5
ABSTRACT

The success prediction of a movie plays a vital role in movie industry because it
involves huge investments. However, success cannot be predicted based on a
particular attribute. In this project, we developed a mathematical model to predict
the success and failure of the upcoming movies based on several attributes. Some of
the criteria in calculating movie success included budget, actors, director, producer,
set locations, story writer, movie release day, competing movie releases at the same
time, music, release location and target audience. So, we have built a model based
on interesting relation between attributes. The movie industry can use this model to
modify the movie criteria for obtaining likelihood of blockbusters. Also, this model
can be used by movie watchers in determining a blockbuster before purchasing a
ticket. We have used three algorithms namely random forest algorithm, k-means
algorithm and k-nearest neighbours’ algorithm for movie success prediction and
compared the accuracy of the different algorithms on the same dataset.

6
INTRODUCTION

A lot of money is invested in movie making by the producers and directors to create
a good movie. The money is put in to pay the cast, get the right technicians etc. If
the movie is not successful then a lot of money goes waste. Thus it is very important
for the stakeholders to know whether the movie will be successful or not. In this
project we aim to solve this problem of the movie stakeholders by predicting the
success of the movie with the help of the IMDB ratings of the movie. We are going
to apply three algorithms k nearest neighbours, k-means algorithm and random
forest algorithm to predict the success of the movie. K-Nearest Neighbors is one of
the most basic yet essential classification algorithms in Machine Learning. It belongs
to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. K -means clustering is a type of
unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find
groups in the data, with the number of groups represented by the variable K .
The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on
feature similarity. Random Forest is a flexible, easy to use machine learning
algorithm that produces, even without hyper-parameter tuning, a great result most of
the time. It is also one of the most used algorithms, because its simplicity and the
fact that it can be used for both classification and regression tasks. We are going to
compare the accuracy of the three algorithms in this project.

7
REQUIREMENTS

Hardware Requirements:-

8 GB RAM recommended

i5 processor recommended

Software Requirements:-

Colab

Programming Language:-

Python

PROPOSED SYSTEM

In this project we have first imported a movie dataset which has attributes such as
iamb score, actor name, actor face book likes, director name, director face book
likes etc. Then we perform pre-processing on the data by filling the missing values
and performing normalisation using standard scalar normalisation technique. Now
the dataset is ready for processing. Now to predict the class label of the movies in
the movie dataset we define a function classify with 5 values 0-4 which will be
defined based on the IMDB rating of the concerned movie in the dataset. This
column is then added to the entire dataset. Now we write the code for the KNN, K-
means and random forest algorithm and check it on our dataset to find the accuracy
of the algorithms. We then compare the accuracy of the algorithms on chosen
dataset.

8
IMPLEMENTATION

1. Importing the libraries

2. Dataset movie_metadata.csv is downloaded from kaggle

3. The dataset is loaded in the data frame.

9
4. The movie imdb ratings are converted to 5 class labels from 0 to 4 following
which the predictions are done using these 5 class labels.

5. Success column is added to the data frame which consists of the class values
for the different rows.

6. Null values in the dataset are replaced with the median values of the column
and normalization is done.

10
7. Using the class label values in the success column the predictions are
validated.

8. Scaling of the attribute values is done using the standard scaling algorithm.

9. Data is clustered using k means clustering and accuracy score for the
predicted values is calculated.

10. 5 clusters are created and the rows are clustered into the 5 different clusters
and a graph is plotted.

11. The values of the gross collection of the movie and IMDB rating are used as
features for the clustering.

12. Random forest is applied on the dataset and accuracy score for the
predictions is calculated.

13. Number of estimators or trees used in the random forest algorithm are 250.
XGBoost algorithm is used for boosting the random forest.

14. K nearest neighbour’s algorithm is applied on the dataset and accuracy score
for the predictions is calculated.

11
CODE AND OUTPUT:-

CODE:-

1. KNN Algorithm:-

from sklearn.metrics import accuracy_score


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
#from split_dataset import split_train_test
import numpy as np
import matplotlib.pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Success'],
test_size=0.2)
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=50)
#Train the model using the training sets
knn.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn.predict(X_test)
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
X_test['Predictions'] = y_pred[:]
X_test

2. K-Means Algorithm:-

km = KMeans(n_clusters = 5)

12
P_fit = km.fit(df[['gross','imdb_score']])

P_fit.labels_

df['cluster'] = P_fit.labels_

correct = 0

print(len(df))

print(df.iloc[0])

for i in range(len(df)):

predict_me = np.array(df[['gross','imdb_score']].iloc[i].astype(float))

predict_me = predict_me.reshape(-1, len(predict_me))

prediction = km.predict(predict_me)

if prediction[0] == df['Success'].iloc[i]:

correct += 1

print(correct/len(df))

df['cluster'] = P_fit.labels_
np.unique(P_fit.labels_)
for i in np.unique(P_fit.labels_):
temp = df[df['cluster'] == i]
plt.scatter(temp['gross'], temp['imdb_score'],
color=np.random.rand(3)*0.7)

3. Random Forest Algorithm:-

features = col
features.remove('imdb_score')
X_train, X_test, y_train, y_test = train_test_split(df[features],

13
df['Success'], test_size=0.2)
rf = xgb.XGBClassifier(n_estimators=250)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
predictions = predictions.astype(int)
np.unique(predictions)
accuracy_score(y_test, predictions)
run_knn(df_knn)

OUTPUT:-

1. K Means Algorithm

Displaying the accuracy of the K Means Algorithm

14
Displaying the predicted cluster for the class based on IMDB score

Plotting the output of the k means Algorithm

2. Random Forest Algorithm:-

15
Displaying the Features

Predicted class vs real class by random forest classifier

Random forest algorithm accuracy

Displaying the heatmap

16
3. KNN Algorithm

Displaying the KNN Predicted class vs. the actual dataset

Displaying the accuracy of the KNN model

RESULTS

Accuracy of the Random forest algorithm:-1

Accuracy of the KNN Algorithm:-0.83

Accuracy of the K Means Algorithm:-0.3805

Here in this iteration the accuracy of the K Means Algorithm is the least followed by
KNN and the accuracy of the random forest algorithm is the highest.

17
CONCLUSION

Movie Success Prediction plays an important role in telling the stakeholders about
the success of any movie. It not only helps the directors and producers who have
their money put into the script but also the viewers who come to know whether he
movie will be going to be successful or not. Applying various algorithms like k
nearest neighbours, random forest algorithm and k means algorithm on the same
movie dataset we were able to compare the accuracies of the algorithms and also
test which algorithm turns out to be better. In our comparision with an accuracy
score of 1 the random forest algorithm has highest accuracy while the k mean has
the least accuracy with 0.38.However these accuracy values may change in every
iteration because the training and test data keep on changing. To find the best
algorithm we need to find the mean of the accuracy value over a large number of
iterations.

FUTURE WORK

The project can be extended by applying more algorithms on the dataset and
comparing their accuracies on the dataset.

18

You might also like