Professional Documents
Culture Documents
Group- 3
Project Report
Group Members:
1) Amit Shirsat
2) Bushra Saleem
3) Ram Prakash Reddy Kolli
4) Sagar Sanghavi
5) Sarath Chandra Mandava
6) Yeshwanth Reddy Gummitha
Table of Contents
pg. 1
1. Description:............................................................................. 3
2. Objective:................................................................................ 3
3. Data Source:............................................................................ 3
4. Exploratory Data Analysis:.......................................................4
4.1.............................................................Variable Description
4
4.2. Variables selection..........................................................6
4.3. Data Partitioning:...........................................................7
4.4. Data Summary................................................................8
5. Data Pre-processing:..............................................................10
5.1. Data Cleaning:..............................................................10
5.1.1. Replacement Node:.............................................11
5.1.2. Impute Node:......................................................12
5.2. Transforming Data:.......................................................14
6. Un-Supervised Analysis:.........................................................16
6.1. Clustering:.....................................................................16
6.2. Decision Tree:................................................................20
7. Supervised Analysis:..............................................................24
7.1. Regression:....................................................................24
7.2. Neural Network.............................................................27
8. Model comparison:.................................................................31
9. SAS Enterprise data model Diagram:.......................................32
10. Observations:...................................................................... 32
pg. 2
Type of Dataset: Second-hand Data
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset
1. Description:
Every year, there are more than thousands of movies released on
different platforms. In order to evaluate what movies are the best,
many people rely on reviews by critics, personal reviews by their
friends and family, or on their personal instincts. However, these
sources are not always immediate and reliable. The dataset that we
chose contains information from IMDB (The Internet Movie Data Base)
about 5043 movies through 28 variables, spanning across 100 years in
66 countries. IMDB stores data in a table which contains information
such as movie title, director, duration, year of release, genre, IMDB
rating and few other interesting variables, thus making it multivariate.
2. Objective:
Our project focuses on creating a reliable system which would
predict the IMDB ratings of the upcoming movies. Our aim is to predict
movie ratings by utilising the concepts of Data Mining.
3.Data Source:
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset
Title Value
Observations in Data 5043
Set
Number of Variables 28
pg. 3
4.Exploratory Data Analysis:
4.1. Variable Description
Variable Description Data Sample Values
Title Type
director_name film director controls a Nominal Hoyt Yeatman
film's artistic and Jonathan Liebesman
dramatic aspects, and
visualizes the script
while guiding the
technical crew and
actors in the
fulfillment of that
vision
num_critic_for Professionals rating Interval 723
_reviews the movie 302
duration Time duration of the Interval 88min
movie 99min
director_faceb Facebook likes that Interval 0
ook_likes define director rating 563
as good or bad
actor_3_faceb Facebook likes that Interval 855
ook_likes define actor rating as 1000
good or bad
actor_2_name a person whose Nominal Joel David Moore
profession is acting on Orlando Bloom
the stage, in movies,
or on television
actor_1_faceb Facebook likes that Interval 1000
ook_likes define actor rating as 40000
good or bad
gross Gross collection Interval 760505847
includes total income 309404152
earned from theaters
genres categorize the movie Nominal Action|Adventure|
in to comedy, action, Fantasy|Sci-Fi
romantic etc. Action|Adventure|
Fantasy
actor_1_name a person whose Nominal CCH Pounder
profession is acting on Johnny Depp
the stage, in movies,
or on television
pg. 4
movie_title Define the name of Nominal Avatar
movie Pirates of the
Caribbean: At World's
End
num_voted_us Count of people Interval 886204
ers voted/rate the movie 471220
cast_total_fac All the actors in a Interval 4834
ebook_likes play, movie, or other 48350
theatrical presentation
facebook likes
defining rating as
good or bad
actor_3_name a person whose Nominal Wes Studi
profession is acting on Jack Davenport
the stage, in movies,
or on television
facenumber_i number of actors on Interval 1
n_poster movie poster for 4
advertise purpose
plot_keywords keywords defining the Nominal avatar|future|marine|
movie by users native|paraplegic
goddess|marriage
ceremony|marriage
proposal|pirate|
singapore
movie_imdb_li Movie Weblinks Nominal http://www.imdb.com
nk /title/tt0091415/
Internet movie
database
num_user_for_ amount of people/user Interval 3054
reviews rate the movie 1238
language Language of the Nominal English
movie English
country country in which Nominal US, UK
movie recording
happens
content_rating Age rating Nominal PG-13
PG-13
budget a particular amount of Interval 237000000
money in a budget 300000000
title_year Copyright movie year Nominal 2009
2016
pg. 5
actor_2_faceb Facebook likes that Interval 936
ook_likes define actor rating as 5000
good or bad
imdb_score IMDB rating Interval 7.9
7.1
aspect_ratio The aspect ratio of an Interval 1.78
image describes the 2.35
proportional
relationship between
its width and its
height.
movie_facebo facebook likes that Interval 33000
ok_likes define rating of movie 0
as good or bad
color Black and white or Nominal Color
colored screen movie Black and White
pg. 6
4.2. Variables selection
Below mentioned variables from data set are excluded from the model
creation:
actor_1_name
actor_2_name
actor_3_name
director_name
movie_imdb_link
plot_keywords
Excluding the variables from the dataset using the DROP node in SAS
Enterprise Miner.
pg. 7
4.3. Data Partitioning:
For model training, validation and testing purpose the Data in the
Dataset was partitioned into following data sets:
Data Set Composition % No of
Observations
Training Data Set 60 3026
Validation Data Set 30 1513
Test Data Set 10 504
The Data is partitioned using dataset using the DATA PARTITION node in
SAS Enterprise Miner.
pg. 8
4.4. Data Summary
Data Summary can be obtained using STATEXPLORE node in SAS
Enterprise Miner.
pg. 9
Train Data Set Data Distribution
5.Data Pre-processing:
5.1. Data Cleaning:
The Missing values in the data set are replaced using the REPLACEMENT
and IMPUTE node in SAS Enterprise Miner
Below table describes the steps taken for variables missing values in Data
Set
Missing Value
Variables Count Treated value
Median value of the variable in
actor_1_facebook_likes 7
Data set
actor_1_name 0 ..
Median value of the variable in
actor_2_facebook_likes 13
Data set
actor_2_name 1 UNKNOWN
Median value of the variable in
actor_3_facebook_likes 23
Data set
actor_3_name 2 UNKNOWN
aspect_ratio 329 1.77
pg. 10
Median of the variable in Data
budget 492
set
cast_total_facebook_likes 0 ..
color 19 Color
Median of the variable in Data
content_rating 303
set
country 5 UNKNOWN
Median of the variable in Data
director_facebook_likes 104
set
director_name 20 UNKNOWN
Median of the variable in Data
duration 15
set
Median of the variable in Data
facenumber_in_poster 13
set
genres 0 ..
Median of the variable in Data
gross 884
set
imdb_score 0 ..
language 12 UNKNOWN
movie_facebook_likes 0 ..
movie_imdb_link 0 ..
movie_title 0 ..
Median of the variable in Data
num_critic_for_reviews 50
set
Median of the variable in Data
num_user_for_reviews 21
set
num_voted_users 0 ..
plot_keywords 0 ..
title_year 108 UNKNOWN
pg. 11
SAS Enterprise Miner Replacement Node Output
pg. 12
SAS Enterprise Miner Impute Node Output
pg. 13
Data Summary after Replacement and Impute node
pg. 14
SAS Enterprise Miner Transform Node Settings
pg. 15
6. Un-Supervised Analysis:
6.1. Clustering:
Cluster analysis is performed on the data model using CLUSTER
Node in SAS Enterprise Miner it gets input data from DROP node in the Model
pg. 16
Cluster Node Settings:
1. Clustering Method = Ward
2. CCC cutoff =3
pg. 17
CCC Plot
pg. 18
Cluster Scatter plot
pg. 19
Cluster Analysis Observations:
1. 20 Cluster created by SAS Miner having CCC value 389.55
2. In CCC Plot the CCC value increases continually as the number of
clusters increases in the distribution it may be due to following
reasons:
Data was grainy
Data may have been excessively rounded
Recorded with just a few digits.
3. We can observe from Cluster scatter plot that generated clusters
are close to each other
4. By cluster analysis following variables are important:
cast_total_facebook_likes
facenumber_in_poster
actor_2_facebook_likes
actor_3_facebook_likes
num_voted_users
color
movie_facebook_likes
pg. 20
6.2. Decision Tree:
Decision Tree analysis is performed on the data model using Decision Tree
Node in SAS Enterprise Miner it gets input data from Replacement node in the
Model
pg. 21
Decision Tree Output:
Tree diagram
pg. 22
Variable Importance Table
Fit Statistics
pg. 23
Decision Tree Analysis Observations:
1. SAS Miner Created Decesion Tree with 35 Leaves
2. Decision Tree Average Square Error for :
Train Data : 0.6928
Validation Data :0.901
Test Data : 0.7237
3. By Decisio Tree analysis following variables are important:
num_voted_users
duration
geners
budget
gross
pg. 24
7.Supervised Analysis:
7.1. Regression:
Regression analysis is performed on the data model using Decision
Tree Node in SAS Enterprise Miner it gets input data from Transform
Variables node in the Model
Since the Target variable is an interval variable SAS Miner Performs
Linear Regression
Regression Node Settings:
1. Selection mode: Stepwise
2. Target variable: imdb rating
SAS Enterprise Miner Regression Node Settings
pg. 25
Fit Statistics
Residual
pg. 26
Regression Analysis Observations:
1. The R Square value of Regression Model was 0.5552. It means the
variables in the regression model are able to explain 55.2% of data .
2. By observing the Box plot of Residual plot tha 50 % of train data
lies around the regression line between -0.3448 to 0.4057
3. From Analysis of Variation output p value <0.0001 so we can
reject the null hypothesis and conclude that following variables are
required for regression analysis:
LG10_IMP_actor_2_facebook_likes
LG10_IMP_budget
LG10_IMP_duration
LG10_IMP_num_critic_for_reviews
LG10_num_voted_users
REP_color
REP_content_rating
REP_country
REP_genres
pg. 27
7.2. Neural Network
We have performed two types of Neural Network analysis with and
without using data input from Variable Selection node SAS Miner Node to
Neural Networks Node
pg. 28
Variable Selection Output:
pg. 29
Fit Statistics
pg. 30
Fit Statistics
pg. 31
8.Model comparison:
Fit Statistics
pg. 32
9.SAS Enterprise data model Diagram:
10. Observations:
pg. 33