Movie Success Prediction Using Multiple Algorithms

MOVIE SUCCESS PREDICTION USING MULTIPLE ALGORITHMS
A project report for Data Mining Techniques
(ITE2006)
Submitted in partial fulfilment for the award of the degree of
B. Tech
In
Information Technology
By
NEELENDU WADHWA-16BIT0033
AKASH-16BIT0144
UJJWAL KUMAR-16BIT0223
Under the guidance of
Dr. Prabhavathy P
November, 2018
1
ACKNOWLEDGEMENT
We sincerely thank our Chancellor - Dr. G. Viswanathan, VIT

University, for giving us the opportunity to pursue this course of
Data Mining Techniques and introduce J-component as part of our
academic curriculum. We also thank our Professor Dr Prabhavathy
P for giving us the right support and guidance to make us able to
complete the project. We would also thank the Dean and HOD of
School of Information Technology and Engineering (SITE), for
their continued encouragement and motivation to achieve greater
heights.
2
DECLARATION BY THE CANDIDATES
We, the members of the group, hereby declare that the project report
entitled “Movie Success Prediction using multiple algorithms”
submitted by us to VIT University, Vellore in partial fulfilment of the
requirement for the award of the degree of Bachelor of Technology is
a record of J component of project work carried out by us under the
guidance of Prof Dr. Prabhavathy P. We further declare that the work
reported in this project has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date: 1/11/2018
TEAM MEMBERS:-
NEELENDU WADHWA
AKASH
UJJWAL KUMAR
3
CERTIFICATE
This is to certify that the project work titled “Movie Success

Prediction Using multiple algorithms” that is being submitted by
“Neelendu Wadhwa” ,“Akash” and “Ujjwal Kumar” for Data Mining
Techniques is a record of bonafide work done under my supervision.
The contents of this Project work, in full or in parts, have neither been
taken from any other source nor have been submitted for any other CAL
course.
Place:-Vellore
Date:-1/11/2018
Signature of faculty:
Prof. Dr. Prabhavathy P
4
TABLE OF CONTENTS
SL.NO CONTENTS PAGE NO.

1 Abstract 6
2 Introduction 7
3 Requirements 8
4 Proposed System 8
5 Implementation 9
6 Code and Output 12
7 Results 17
8 Conclusion 18
9 Future Work 18
5
ABSTRACT
The success prediction of a movie plays a vital role in movie industry because it
involves huge investments. However, success cannot be predicted based on a
particular attribute. In this project, we developed a mathematical model to predict
the success and failure of the upcoming movies based on several attributes. Some of
the criteria in calculating movie success included budget, actors, director, producer,
set locations, story writer, movie release day, competing movie releases at the same
time, music, release location and target audience. So, we have built a model based
on interesting relation between attributes. The movie industry can use this model to
modify the movie criteria for obtaining likelihood of blockbusters. Also, this model
can be used by movie watchers in determining a blockbuster before purchasing a
ticket. We have used three algorithms namely random forest algorithm, k-means
algorithm and k-nearest neighbours’ algorithm for movie success prediction and
compared the accuracy of the different algorithms on the same dataset.
6
INTRODUCTION
A lot of money is invested in movie making by the producers and directors to create
a good movie. The money is put in to pay the cast, get the right technicians etc. If
the movie is not successful then a lot of money goes waste. Thus it is very important
for the stakeholders to know whether the movie will be successful or not. In this
project we aim to solve this problem of the movie stakeholders by predicting the
success of the movie with the help of the IMDB ratings of the movie. We are going
to apply three algorithms k nearest neighbours, k-means algorithm and random
forest algorithm to predict the success of the movie. K-Nearest Neighbors is one of
the most basic yet essential classification algorithms in Machine Learning. It belongs
to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. K -means clustering is a type of
unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find
groups in the data, with the number of groups represented by the variable K .
The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on
feature similarity. Random Forest is a flexible, easy to use machine learning
algorithm that produces, even without hyper-parameter tuning, a great result most of
the time. It is also one of the most used algorithms, because its simplicity and the
fact that it can be used for both classification and regression tasks. We are going to
compare the accuracy of the three algorithms in this project.
7
REQUIREMENTS
Hardware Requirements:-
8 GB RAM recommended
i5 processor recommended
Software Requirements:-
Colab
Programming Language:-
Python
PROPOSED SYSTEM
In this project we have first imported a movie dataset which has attributes such as
iamb score, actor name, actor face book likes, director name, director face book
likes etc. Then we perform pre-processing on the data by filling the missing values
and performing normalisation using standard scalar normalisation technique. Now
the dataset is ready for processing. Now to predict the class label of the movies in
the movie dataset we define a function classify with 5 values 0-4 which will be
defined based on the IMDB rating of the concerned movie in the dataset. This
column is then added to the entire dataset. Now we write the code for the KNN, K-
means and random forest algorithm and check it on our dataset to find the accuracy
of the algorithms. We then compare the accuracy of the algorithms on chosen
dataset.
8
IMPLEMENTATION
1. Importing the libraries
2. Dataset movie_metadata.csv is downloaded from kaggle
3. The dataset is loaded in the data frame.
9
4. The movie imdb ratings are converted to 5 class labels from 0 to 4 following
which the predictions are done using these 5 class labels.
5. Success column is added to the data frame which consists of the class values
for the different rows.
6. Null values in the dataset are replaced with the median values of the column
and normalization is done.
10
7. Using the class label values in the success column the predictions are
validated.
8. Scaling of the attribute values is done using the standard scaling algorithm.
9. Data is clustered using k means clustering and accuracy score for the
predicted values is calculated.
10. 5 clusters are created and the rows are clustered into the 5 different clusters
and a graph is plotted.
11. The values of the gross collection of the movie and IMDB rating are used as
features for the clustering.
12. Random forest is applied on the dataset and accuracy score for the
predictions is calculated.
13. Number of estimators or trees used in the random forest algorithm are 250.
XGBoost algorithm is used for boosting the random forest.
14. K nearest neighbour’s algorithm is applied on the dataset and accuracy score
for the predictions is calculated.
11
CODE AND OUTPUT:-
CODE:-
1. KNN Algorithm:-
from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
#from split_dataset import split_train_test
import numpy as np
import matplotlib.pyplot as plt
X_train, X_test, y_train, y_test = train_test_split(df[features], df['Success'],
test_size=0.2)
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=50)
#Train the model using the training sets
knn.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = knn.predict(X_test)
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
X_test['Predictions'] = y_pred[:]
X_test
2. K-Means Algorithm:-
km = KMeans(n_clusters = 5)
12
P_fit = km.fit(df[['gross','imdb_score']])
P_fit.labels_
df['cluster'] = P_fit.labels_
correct = 0
print(len(df))
print(df.iloc[0])
for i in range(len(df)):
predict_me = np.array(df[['gross','imdb_score']].iloc[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = km.predict(predict_me)
if prediction[0] == df['Success'].iloc[i]:
correct += 1
print(correct/len(df))
df['cluster'] = P_fit.labels_
np.unique(P_fit.labels_)
for i in np.unique(P_fit.labels_):
temp = df[df['cluster'] == i]
plt.scatter(temp['gross'], temp['imdb_score'],
color=np.random.rand(3)*0.7)
3. Random Forest Algorithm:-
features = col
features.remove('imdb_score')
X_train, X_test, y_train, y_test = train_test_split(df[features],
13
df['Success'], test_size=0.2)
rf = xgb.XGBClassifier(n_estimators=250)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
predictions = predictions.astype(int)
np.unique(predictions)
accuracy_score(y_test, predictions)
run_knn(df_knn)
OUTPUT:-
1. K Means Algorithm
Displaying the accuracy of the K Means Algorithm
14
Displaying the predicted cluster for the class based on IMDB score
Plotting the output of the k means Algorithm
2. Random Forest Algorithm:-
15
Displaying the Features
Predicted class vs real class by random forest classifier
Random forest algorithm accuracy
Displaying the heatmap
16
3. KNN Algorithm
Displaying the KNN Predicted class vs. the actual dataset
Displaying the accuracy of the KNN model
RESULTS
Accuracy of the Random forest algorithm:-1
Accuracy of the KNN Algorithm:-0.83
Accuracy of the K Means Algorithm:-0.3805
Here in this iteration the accuracy of the K Means Algorithm is the least followed by
KNN and the accuracy of the random forest algorithm is the highest.
17
CONCLUSION
Movie Success Prediction plays an important role in telling the stakeholders about
the success of any movie. It not only helps the directors and producers who have
their money put into the script but also the viewers who come to know whether he
movie will be going to be successful or not. Applying various algorithms like k
nearest neighbours, random forest algorithm and k means algorithm on the same
movie dataset we were able to compare the accuracies of the algorithms and also
test which algorithm turns out to be better. In our comparision with an accuracy
score of 1 the random forest algorithm has highest accuracy while the k mean has
the least accuracy with 0.38.However these accuracy values may change in every
iteration because the training and test data keep on changing. To find the best
algorithm we need to find the mean of the accuracy value over a large number of
iterations.
FUTURE WORK
The project can be extended by applying more algorithms on the dataset and
comparing their accuracies on the dataset.
18

Movie Success Prediction Using Multiple Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Movie Success Prediction Using Multiple Algorithms

Uploaded by

Copyright:

Available Formats

MOVIE SUCCESS PREDICTION USING MULTIPLE ALGORITHMS

A project report for Data Mining Techniques

Submitted in partial fulfilment for the award of the degree of

Under the guidance of

We sincerely thank our Chancellor - Dr. G. Viswanathan, VIT

This is to certify that the project work titled “Movie Success

Prof. Dr. Prabhavathy P

SL.NO CONTENTS PAGE NO.

1. Importing the libraries

2. Dataset movie_metadata.csv is downloaded from kaggle

3. The dataset is loaded in the data frame.

from sklearn.metrics import accuracy_score

predict_me = predict_me.reshape(-1, len(predict_me))

3. Random Forest Algorithm:-

Displaying the accuracy of the K Means Algorithm

Plotting the output of the k means Algorithm

2. Random Forest Algorithm:-

Predicted class vs real class by random forest classifier

Random forest algorithm accuracy

Displaying the heatmap

Displaying the KNN Predicted class vs. the actual dataset

Displaying the accuracy of the KNN model

Accuracy of the Random forest algorithm:-1

Accuracy of the KNN Algorithm:-0.83

Accuracy of the K Means Algorithm:-0.3805

You might also like