Bi 501 03 Final Imdb Rating

MIS 6324
Business Analytics with SAS
Fall 2016 - Section: 501
Group- 3
Project Report
Instructor: Professor Zhe (James) Zhang
Group Members:
1) Amit Shirsat
2) Bushra Saleem
3) Ram Prakash Reddy Kolli
4) Sagar Sanghavi
5) Sarath Chandra Mandava
6) Yeshwanth Reddy Gummitha
Table of Contents
pg. 1
1. Description:............................................................................. 3
2. Objective:................................................................................ 3
3. Data Source:............................................................................ 3
4. Exploratory Data Analysis:.......................................................4
4.1.............................................................Variable Description
4
4.2. Variables selection..........................................................6
4.3. Data Partitioning:...........................................................7
4.4. Data Summary................................................................8
5. Data Pre-processing:..............................................................10
5.1. Data Cleaning:..............................................................10
5.1.1. Replacement Node:.............................................11
5.1.2. Impute Node:......................................................12
5.2. Transforming Data:.......................................................14
6. Un-Supervised Analysis:.........................................................16
6.1. Clustering:.....................................................................16
6.2. Decision Tree:................................................................20
7. Supervised Analysis:..............................................................24
7.1. Regression:....................................................................24
7.2. Neural Network.............................................................27
8. Model comparison:.................................................................31
9. SAS Enterprise data model Diagram:.......................................32
10. Observations:...................................................................... 32
Option 1: Mining Real-Life Data

Data Set: IMDB 5000 Movie Dataset
pg. 2
Type of Dataset: Second-hand Data
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset
1. Description:
Every year, there are more than thousands of movies released on
different platforms. In order to evaluate what movies are the best,
many people rely on reviews by critics, personal reviews by their
friends and family, or on their personal instincts. However, these
sources are not always immediate and reliable. The dataset that we
chose contains information from IMDB (The Internet Movie Data Base)
about 5043 movies through 28 variables, spanning across 100 years in
66 countries. IMDB stores data in a table which contains information
such as movie title, director, duration, year of release, genre, IMDB
rating and few other interesting variables, thus making it multivariate.
2. Objective:
Our project focuses on creating a reliable system which would
predict the IMDB ratings of the upcoming movies. Our aim is to predict
movie ratings by utilising the concepts of Data Mining.
3.Data Source:
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset
Title Value
Observations in Data 5043
Set
Number of Variables 28
pg. 3
4.Exploratory Data Analysis:
4.1. Variable Description
Variable Description Data Sample Values
Title Type
director_name film director controls a Nominal Hoyt Yeatman
film's artistic and Jonathan Liebesman
dramatic aspects, and
visualizes the script
while guiding the
technical crew and
actors in the
fulfillment of that
vision
num_critic_for Professionals rating Interval 723
_reviews the movie 302
duration Time duration of the Interval 88min
movie 99min
director_faceb Facebook likes that Interval 0
ook_likes define director rating 563
as good or bad
actor_3_faceb Facebook likes that Interval 855
ook_likes define actor rating as 1000
good or bad
actor_2_name a person whose Nominal Joel David Moore
profession is acting on Orlando Bloom
the stage, in movies,
or on television
good or bad
gross Gross collection Interval 760505847
includes total income 309404152
earned from theaters
genres categorize the movie Nominal Action|Adventure|
in to comedy, action, Fantasy|Sci-Fi
romantic etc. Action|Adventure|
Fantasy
actor_1_name a person whose Nominal CCH Pounder
profession is acting on Johnny Depp
or on television
pg. 4
movie_title Define the name of Nominal Avatar
movie Pirates of the
Caribbean: At World's
End
num_voted_us Count of people Interval 886204
ers voted/rate the movie 471220
cast_total_fac All the actors in a Interval 4834
ebook_likes play, movie, or other 48350
theatrical presentation
facebook likes
defining rating as
good or bad
actor_3_name a person whose Nominal Wes Studi
profession is acting on Jack Davenport
or on television
facenumber_i number of actors on Interval 1
n_poster movie poster for 4
advertise purpose
plot_keywords keywords defining the Nominal avatar|future|marine|
movie by users native|paraplegic
goddess|marriage
ceremony|marriage
proposal|pirate|
singapore
movie_imdb_li Movie Weblinks Nominal http://www.imdb.com
nk /title/tt0091415/
Internet movie
database
num_user_for_ amount of people/user Interval 3054
reviews rate the movie 1238
language Language of the Nominal English
movie English
country country in which Nominal US, UK
movie recording
happens
content_rating Age rating Nominal PG-13
PG-13
budget a particular amount of Interval 237000000
money in a budget 300000000
title_year Copyright movie year Nominal 2009
2016
pg. 5
good or bad
imdb_score IMDB rating Interval 7.9
7.1
aspect_ratio The aspect ratio of an Interval 1.78
image describes the 2.35
proportional
relationship between
its width and its
height.
movie_facebo facebook likes that Interval 33000
ok_likes define rating of movie 0
as good or bad
color Black and white or Nominal Color
colored screen movie Black and White
Target Variable: imdb_score
SAS Enterprise Miner Column Metadata
pg. 6
4.2. Variables selection
Below mentioned variables from data set are excluded from the model
creation:
actor_1_name
actor_2_name
actor_3_name
director_name
movie_imdb_link
plot_keywords
Excluding the variables from the dataset using the DROP node in SAS
Enterprise Miner.
SAS Enterprise Miner DROP node setting
pg. 7
4.3. Data Partitioning:
For model training, validation and testing purpose the Data in the
Dataset was partitioned into following data sets:
Data Set Composition % No of
Observations
Training Data Set 60 3026
Validation Data Set 30 1513
Test Data Set 10 504
The Data is partitioned using dataset using the DATA PARTITION node in
SAS Enterprise Miner.
SAS Enterprise Miner DATA PARTITION node setting
DATA PARTITION node Output
pg. 8
4.4. Data Summary
Data Summary can be obtained using STATEXPLORE node in SAS
Enterprise Miner.
SAS Enterprise Miner Train Data set STATEXPLORE node Output
pg. 9
Train Data Set Data Distribution
5.Data Pre-processing:
5.1. Data Cleaning:
The Missing values in the data set are replaced using the REPLACEMENT
and IMPUTE node in SAS Enterprise Miner
Below table describes the steps taken for variables missing values in Data
Set
Missing Value
Variables Count Treated value
Median value of the variable in
actor_1_facebook_likes 7
Data set
actor_1_name 0 ..
Data set
actor_2_name 1 UNKNOWN
Data set
actor_3_name 2 UNKNOWN
aspect_ratio 329 1.77
pg. 10
Median of the variable in Data
budget 492
set
cast_total_facebook_likes 0 ..
color 19 Color
content_rating 303
set
country 5 UNKNOWN
director_facebook_likes 104
set
director_name 20 UNKNOWN
duration 15
set
facenumber_in_poster 13
set
genres 0 ..
gross 884
set
imdb_score 0 ..
language 12 UNKNOWN
movie_facebook_likes 0 ..
movie_imdb_link 0 ..
movie_title 0 ..
num_critic_for_reviews 50
set
num_user_for_reviews 21
set
num_voted_users 0 ..
plot_keywords 0 ..
title_year 108 UNKNOWN
5.1.1. Replacement Node:

Replacement node is used to use to replace missing class
variable values in Dataset
SAS Enterprise Miner Replacement Node Settings
pg. 11
SAS Enterprise Miner Replacement Node Output
5.1.2. Impute Node:

Impute node is used to replace missing Interval variable values
in Dataset
SAS Enterprise Miner Impute Node Settings
pg. 12
SAS Enterprise Miner Impute Node Output
pg. 13
Data Summary after Replacement and Impute node
5.2. Transforming Data:

Removing the variable data distribution skewness by applying Log 10
function by using Transform Variables node in SAS Miner
pg. 14
SAS Enterprise Miner Transform Node Settings
SAS Enterprise Miner Transform Node Output
pg. 15
6. Un-Supervised Analysis:
6.1. Clustering:
Cluster analysis is performed on the data model using CLUSTER
Node in SAS Enterprise Miner it gets input data from DROP node in the Model
pg. 16
Cluster Node Settings:
1. Clustering Method = Ward
2. CCC cutoff =3
SAS Enterprise Miner Cluster Node Settings
Cluster Node Output:
Cluster Pie Diagram
pg. 17
CCC Plot
Cluster Hierarchy Structure
pg. 18
Cluster Scatter plot
Variable Importance Table
pg. 19
Cluster Analysis Observations:
1. 20 Cluster created by SAS Miner having CCC value 389.55
2. In CCC Plot the CCC value increases continually as the number of
clusters increases in the distribution it may be due to following
reasons:
Data was grainy
Data may have been excessively rounded
Recorded with just a few digits.
3. We can observe from Cluster scatter plot that generated clusters
are close to each other
4. By cluster analysis following variables are important:
cast_total_facebook_likes
facenumber_in_poster
actor_2_facebook_likes
actor_3_facebook_likes
num_voted_users
color
movie_facebook_likes
pg. 20
6.2. Decision Tree:
Decision Tree analysis is performed on the data model using Decision Tree
Node in SAS Enterprise Miner it gets input data from Replacement node in the
Model
Decision Tree Node Settings:

1. Assessment Measure= Average Square Error
2. Target Variable= imdb rating
SAS Enterprise Miner Cluster Node Settings
pg. 21
Decision Tree Output:
Tree diagram
Sub tree Assessment plot: -Average Square Error
pg. 22
Variable Importance Table
Fit Statistics
pg. 23
Decision Tree Analysis Observations:
1. SAS Miner Created Decesion Tree with 35 Leaves
2. Decision Tree Average Square Error for :
Train Data : 0.6928
Validation Data :0.901
Test Data : 0.7237
3. By Decisio Tree analysis following variables are important:
num_voted_users
duration
geners
budget
gross
pg. 24
7.Supervised Analysis:
7.1. Regression:
Regression analysis is performed on the data model using Decision
Tree Node in SAS Enterprise Miner it gets input data from Transform
Variables node in the Model
Since the Target variable is an interval variable SAS Miner Performs
Linear Regression
Regression Node Settings:
1. Selection mode: Stepwise
2. Target variable: imdb rating
SAS Enterprise Miner Regression Node Settings
Regression Node Output:
pg. 25
Fit Statistics
Residual
Residual Box plot
pg. 26
Regression Analysis Observations:
1. The R Square value of Regression Model was 0.5552. It means the
variables in the regression model are able to explain 55.2% of data .
2. By observing the Box plot of Residual plot tha 50 % of train data
lies around the regression line between -0.3448 to 0.4057
3. From Analysis of Variation output p value <0.0001 so we can
reject the null hypothesis and conclude that following variables are
required for regression analysis:
LG10_IMP_actor_2_facebook_likes
LG10_IMP_budget
LG10_IMP_duration
LG10_IMP_num_critic_for_reviews
LG10_num_voted_users
REP_color
REP_content_rating
REP_country
REP_genres
pg. 27
7.2. Neural Network
We have performed two types of Neural Network analysis with and
without using data input from Variable Selection node SAS Miner Node to
Neural Networks Node
Neural Network Node Settings:

1. Model Selection Criteria: Average Error
2. Remaining are default settings
SAS Enterprise Miner Regression Node Settings
7.2.1. Neural Network without variable selection

node
Variable selection Node removes the very minimum R square variables
from input data to neural network node
pg. 28
Variable Selection Output:
Neural Network Iteration Plot: Average Square Error
pg. 29
Fit Statistics
7.2.2. Neural Network with variable selection

node
Neural Network Iteration Plot: Average Square Error
pg. 30
Fit Statistics
Observation from Neural networks:

1. Average Square error of train data:
Neural Network without variable selection = 0.59558
Neural Network with variable selection = 0.744519
2. No of training iterations:
Neural Network without variable selection =48
Neural Network with variable selection = 0
pg. 31
8.Model comparison:
Using Model Comparison Node in SAS Miner to compare all analysis

method and
Input:
Output from Decision Tree node

Output from Regression node
Output from Neural Network without variable selection
Output from Neural Network with variable selection
Selection Criteria = Average Square Error
Fit Statistics
Model Comparison observations:

Neural network without variable selection is the best model
with Average square error = 0.779947
pg. 32
9.SAS Enterprise data model Diagram:
10. Observations:
1. Neural network without variable selection is the best data analysis

method for predicting IMDB rating
2. The variable that plays key role in IMDB rating prediction are:
Number of Actor 2 facebook Likes
Movie Budget
Duration of the movie
Number of cretics given review for the movie
Number of user voted for the movie
The Movie was color or BlackandWhite
Movie content rating
Country where movie made
Gener of the movie
pg. 33

Bi 501 03 Final Imdb Rating

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bi 501 03 Final Imdb Rating

Uploaded by

Copyright:

Available Formats

MIS 6324

Business Analytics with SAS

Fall 2016 - Section: 501

Instructor: Professor Zhe (James) Zhang

Option 1: Mining Real-Life Data

Target Variable: imdb_score

SAS Enterprise Miner Column Metadata

SAS Enterprise Miner DROP node setting

SAS Enterprise Miner DATA PARTITION node setting

DATA PARTITION node Output

SAS Enterprise Miner Train Data set STATEXPLORE node Output

5.1.1. Replacement Node:

SAS Enterprise Miner Replacement Node Settings

5.1.2. Impute Node:

SAS Enterprise Miner Impute Node Settings

5.2. Transforming Data:

SAS Enterprise Miner Transform Node Output

SAS Enterprise Miner Cluster Node Settings

Cluster Node Output:

Cluster Pie Diagram

Cluster Hierarchy Structure

Variable Importance Table

Decision Tree Node Settings:

SAS Enterprise Miner Cluster Node Settings

Sub tree Assessment plot: -Average Square Error

Regression Node Output:

Residual Box plot

Neural Network Node Settings:

SAS Enterprise Miner Regression Node Settings

7.2.1. Neural Network without variable selection

Neural Network Iteration Plot: Average Square Error

7.2.2. Neural Network with variable selection

Neural Network Iteration Plot: Average Square Error

Observation from Neural networks:

Using Model Comparison Node in SAS Miner to compare all analysis

Output from Decision Tree node

Model Comparison observations:

1. Neural network without variable selection is the best data analysis

You might also like