Professional Documents
Culture Documents
(ITE2006)
B. Tech
In
Information Technology
By
NEELENDU WADHWA-16BIT0033
AKASH-16BIT0144
UJJWAL KUMAR-16BIT0223
Dr. Prabhavathy P
November, 2018
1
ACKNOWLEDGEMENT
2
DECLARATION BY THE CANDIDATES
We, the members of the group, hereby declare that the project report
entitled “Movie Success Prediction using multiple algorithms”
submitted by us to VIT University, Vellore in partial fulfilment of the
requirement for the award of the degree of Bachelor of Technology is
a record of J component of project work carried out by us under the
guidance of Prof Dr. Prabhavathy P. We further declare that the work
reported in this project has not been submitted and will not be
submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
Place: Vellore
Date: 1/11/2018
TEAM MEMBERS:-
NEELENDU WADHWA
AKASH
UJJWAL KUMAR
3
CERTIFICATE
Place:-Vellore
Date:-1/11/2018
Signature of faculty:
4
TABLE OF CONTENTS
5
ABSTRACT
The success prediction of a movie plays a vital role in movie industry because it
involves huge investments. However, success cannot be predicted based on a
particular attribute. In this project, we developed a mathematical model to predict
the success and failure of the upcoming movies based on several attributes. Some of
the criteria in calculating movie success included budget, actors, director, producer,
set locations, story writer, movie release day, competing movie releases at the same
time, music, release location and target audience. So, we have built a model based
on interesting relation between attributes. The movie industry can use this model to
modify the movie criteria for obtaining likelihood of blockbusters. Also, this model
can be used by movie watchers in determining a blockbuster before purchasing a
ticket. We have used three algorithms namely random forest algorithm, k-means
algorithm and k-nearest neighbours’ algorithm for movie success prediction and
compared the accuracy of the different algorithms on the same dataset.
6
INTRODUCTION
A lot of money is invested in movie making by the producers and directors to create
a good movie. The money is put in to pay the cast, get the right technicians etc. If
the movie is not successful then a lot of money goes waste. Thus it is very important
for the stakeholders to know whether the movie will be successful or not. In this
project we aim to solve this problem of the movie stakeholders by predicting the
success of the movie with the help of the IMDB ratings of the movie. We are going
to apply three algorithms k nearest neighbours, k-means algorithm and random
forest algorithm to predict the success of the movie. K-Nearest Neighbors is one of
the most basic yet essential classification algorithms in Machine Learning. It belongs
to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection. K -means clustering is a type of
unsupervised learning, which is used when you have unlabeled data (i.e., data
without defined categories or groups). The goal of this algorithm is to find
groups in the data, with the number of groups represented by the variable K .
The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on
feature similarity. Random Forest is a flexible, easy to use machine learning
algorithm that produces, even without hyper-parameter tuning, a great result most of
the time. It is also one of the most used algorithms, because its simplicity and the
fact that it can be used for both classification and regression tasks. We are going to
compare the accuracy of the three algorithms in this project.
7
REQUIREMENTS
Hardware Requirements:-
8 GB RAM recommended
i5 processor recommended
Software Requirements:-
Colab
Programming Language:-
Python
PROPOSED SYSTEM
In this project we have first imported a movie dataset which has attributes such as
iamb score, actor name, actor face book likes, director name, director face book
likes etc. Then we perform pre-processing on the data by filling the missing values
and performing normalisation using standard scalar normalisation technique. Now
the dataset is ready for processing. Now to predict the class label of the movies in
the movie dataset we define a function classify with 5 values 0-4 which will be
defined based on the IMDB rating of the concerned movie in the dataset. This
column is then added to the entire dataset. Now we write the code for the KNN, K-
means and random forest algorithm and check it on our dataset to find the accuracy
of the algorithms. We then compare the accuracy of the algorithms on chosen
dataset.
8
IMPLEMENTATION
9
4. The movie imdb ratings are converted to 5 class labels from 0 to 4 following
which the predictions are done using these 5 class labels.
5. Success column is added to the data frame which consists of the class values
for the different rows.
6. Null values in the dataset are replaced with the median values of the column
and normalization is done.
10
7. Using the class label values in the success column the predictions are
validated.
8. Scaling of the attribute values is done using the standard scaling algorithm.
9. Data is clustered using k means clustering and accuracy score for the
predicted values is calculated.
10. 5 clusters are created and the rows are clustered into the 5 different clusters
and a graph is plotted.
11. The values of the gross collection of the movie and IMDB rating are used as
features for the clustering.
12. Random forest is applied on the dataset and accuracy score for the
predictions is calculated.
13. Number of estimators or trees used in the random forest algorithm are 250.
XGBoost algorithm is used for boosting the random forest.
14. K nearest neighbour’s algorithm is applied on the dataset and accuracy score
for the predictions is calculated.
11
CODE AND OUTPUT:-
CODE:-
1. KNN Algorithm:-
2. K-Means Algorithm:-
km = KMeans(n_clusters = 5)
12
P_fit = km.fit(df[['gross','imdb_score']])
P_fit.labels_
df['cluster'] = P_fit.labels_
correct = 0
print(len(df))
print(df.iloc[0])
for i in range(len(df)):
predict_me = np.array(df[['gross','imdb_score']].iloc[i].astype(float))
prediction = km.predict(predict_me)
if prediction[0] == df['Success'].iloc[i]:
correct += 1
print(correct/len(df))
df['cluster'] = P_fit.labels_
np.unique(P_fit.labels_)
for i in np.unique(P_fit.labels_):
temp = df[df['cluster'] == i]
plt.scatter(temp['gross'], temp['imdb_score'],
color=np.random.rand(3)*0.7)
features = col
features.remove('imdb_score')
X_train, X_test, y_train, y_test = train_test_split(df[features],
13
df['Success'], test_size=0.2)
rf = xgb.XGBClassifier(n_estimators=250)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
predictions = predictions.astype(int)
np.unique(predictions)
accuracy_score(y_test, predictions)
run_knn(df_knn)
OUTPUT:-
1. K Means Algorithm
14
Displaying the predicted cluster for the class based on IMDB score
15
Displaying the Features
16
3. KNN Algorithm
RESULTS
Here in this iteration the accuracy of the K Means Algorithm is the least followed by
KNN and the accuracy of the random forest algorithm is the highest.
17
CONCLUSION
Movie Success Prediction plays an important role in telling the stakeholders about
the success of any movie. It not only helps the directors and producers who have
their money put into the script but also the viewers who come to know whether he
movie will be going to be successful or not. Applying various algorithms like k
nearest neighbours, random forest algorithm and k means algorithm on the same
movie dataset we were able to compare the accuracies of the algorithms and also
test which algorithm turns out to be better. In our comparision with an accuracy
score of 1 the random forest algorithm has highest accuracy while the k mean has
the least accuracy with 0.38.However these accuracy values may change in every
iteration because the training and test data keep on changing. To find the best
algorithm we need to find the mean of the accuracy value over a large number of
iterations.
FUTURE WORK
The project can be extended by applying more algorithms on the dataset and
comparing their accuracies on the dataset.
18