You are on page 1of 33

MIS 6324

Business Analytics with SAS

Fall 2016 - Section: 501

Group- 3
Project Report

Instructor: Professor Zhe (James) Zhang

Group Members:

1) Amit Shirsat
2) Bushra Saleem
3) Ram Prakash Reddy Kolli
4) Sagar Sanghavi
5) Sarath Chandra Mandava
6) Yeshwanth Reddy Gummitha

Table of Contents

pg. 1
1. Description:............................................................................. 3
2. Objective:................................................................................ 3
3. Data Source:............................................................................ 3
4. Exploratory Data Analysis:.......................................................4
4.1.............................................................Variable Description
4
4.2. Variables selection..........................................................6
4.3. Data Partitioning:...........................................................7
4.4. Data Summary................................................................8
5. Data Pre-processing:..............................................................10
5.1. Data Cleaning:..............................................................10
5.1.1. Replacement Node:.............................................11
5.1.2. Impute Node:......................................................12
5.2. Transforming Data:.......................................................14
6. Un-Supervised Analysis:.........................................................16
6.1. Clustering:.....................................................................16
6.2. Decision Tree:................................................................20
7. Supervised Analysis:..............................................................24
7.1. Regression:....................................................................24
7.2. Neural Network.............................................................27
8. Model comparison:.................................................................31
9. SAS Enterprise data model Diagram:.......................................32
10. Observations:...................................................................... 32

Option 1: Mining Real-Life Data


Data Set: IMDB 5000 Movie Dataset

pg. 2
Type of Dataset: Second-hand Data
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset

1. Description:
Every year, there are more than thousands of movies released on
different platforms. In order to evaluate what movies are the best,
many people rely on reviews by critics, personal reviews by their
friends and family, or on their personal instincts. However, these
sources are not always immediate and reliable. The dataset that we
chose contains information from IMDB (The Internet Movie Data Base)
about 5043 movies through 28 variables, spanning across 100 years in
66 countries. IMDB stores data in a table which contains information
such as movie title, director, duration, year of release, genre, IMDB
rating and few other interesting variables, thus making it multivariate.

2. Objective:
Our project focuses on creating a reliable system which would
predict the IMDB ratings of the upcoming movies. Our aim is to predict
movie ratings by utilising the concepts of Data Mining.

3.Data Source:
Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-
dataset

Title Value
Observations in Data 5043
Set
Number of Variables 28

pg. 3
4.Exploratory Data Analysis:
4.1. Variable Description
Variable Description Data Sample Values
Title Type
director_name film director controls a Nominal Hoyt Yeatman
film's artistic and Jonathan Liebesman
dramatic aspects, and
visualizes the script
while guiding the
technical crew and
actors in the
fulfillment of that
vision
num_critic_for Professionals rating Interval 723
_reviews the movie 302
duration Time duration of the Interval 88min
movie 99min
director_faceb Facebook likes that Interval 0
ook_likes define director rating 563
as good or bad
actor_3_faceb Facebook likes that Interval 855
ook_likes define actor rating as 1000
good or bad
actor_2_name a person whose Nominal Joel David Moore
profession is acting on Orlando Bloom
the stage, in movies,
or on television
actor_1_faceb Facebook likes that Interval 1000
ook_likes define actor rating as 40000
good or bad
gross Gross collection Interval 760505847
includes total income 309404152
earned from theaters
genres categorize the movie Nominal Action|Adventure|
in to comedy, action, Fantasy|Sci-Fi
romantic etc. Action|Adventure|
Fantasy
actor_1_name a person whose Nominal CCH Pounder
profession is acting on Johnny Depp
the stage, in movies,
or on television

pg. 4
movie_title Define the name of Nominal Avatar
movie Pirates of the
Caribbean: At World's
End
num_voted_us Count of people Interval 886204
ers voted/rate the movie 471220
cast_total_fac All the actors in a Interval 4834
ebook_likes play, movie, or other 48350
theatrical presentation
facebook likes
defining rating as
good or bad
actor_3_name a person whose Nominal Wes Studi
profession is acting on Jack Davenport
the stage, in movies,
or on television
facenumber_i number of actors on Interval 1
n_poster movie poster for 4
advertise purpose
plot_keywords keywords defining the Nominal avatar|future|marine|
movie by users native|paraplegic
goddess|marriage
ceremony|marriage
proposal|pirate|
singapore
movie_imdb_li Movie Weblinks Nominal http://www.imdb.com
nk /title/tt0091415/
Internet movie
database
num_user_for_ amount of people/user Interval 3054
reviews rate the movie 1238
language Language of the Nominal English
movie English
country country in which Nominal US, UK
movie recording
happens
content_rating Age rating Nominal PG-13
PG-13
budget a particular amount of Interval 237000000
money in a budget 300000000
title_year Copyright movie year Nominal 2009
2016

pg. 5
actor_2_faceb Facebook likes that Interval 936
ook_likes define actor rating as 5000
good or bad
imdb_score IMDB rating Interval 7.9
7.1
aspect_ratio The aspect ratio of an Interval 1.78
image describes the 2.35
proportional
relationship between
its width and its
height.
movie_facebo facebook likes that Interval 33000
ok_likes define rating of movie 0
as good or bad
color Black and white or Nominal Color
colored screen movie Black and White

Target Variable: imdb_score

SAS Enterprise Miner Column Metadata

pg. 6
4.2. Variables selection
Below mentioned variables from data set are excluded from the model
creation:
actor_1_name
actor_2_name
actor_3_name
director_name
movie_imdb_link
plot_keywords
Excluding the variables from the dataset using the DROP node in SAS
Enterprise Miner.

SAS Enterprise Miner DROP node setting

pg. 7
4.3. Data Partitioning:
For model training, validation and testing purpose the Data in the
Dataset was partitioned into following data sets:
Data Set Composition % No of
Observations
Training Data Set 60 3026
Validation Data Set 30 1513
Test Data Set 10 504

The Data is partitioned using dataset using the DATA PARTITION node in
SAS Enterprise Miner.

SAS Enterprise Miner DATA PARTITION node setting

DATA PARTITION node Output

pg. 8
4.4. Data Summary
Data Summary can be obtained using STATEXPLORE node in SAS
Enterprise Miner.

SAS Enterprise Miner Train Data set STATEXPLORE node Output

pg. 9
Train Data Set Data Distribution

5.Data Pre-processing:
5.1. Data Cleaning:
The Missing values in the data set are replaced using the REPLACEMENT
and IMPUTE node in SAS Enterprise Miner

Below table describes the steps taken for variables missing values in Data
Set

Missing Value
Variables Count Treated value
Median value of the variable in
actor_1_facebook_likes 7
Data set
actor_1_name 0 ..
Median value of the variable in
actor_2_facebook_likes 13
Data set
actor_2_name 1 UNKNOWN
Median value of the variable in
actor_3_facebook_likes 23
Data set
actor_3_name 2 UNKNOWN
aspect_ratio 329 1.77

pg. 10
Median of the variable in Data
budget 492
set
cast_total_facebook_likes 0 ..
color 19 Color
Median of the variable in Data
content_rating 303
set
country 5 UNKNOWN
Median of the variable in Data
director_facebook_likes 104
set
director_name 20 UNKNOWN
Median of the variable in Data
duration 15
set
Median of the variable in Data
facenumber_in_poster 13
set
genres 0 ..
Median of the variable in Data
gross 884
set
imdb_score 0 ..
language 12 UNKNOWN
movie_facebook_likes 0 ..
movie_imdb_link 0 ..
movie_title 0 ..
Median of the variable in Data
num_critic_for_reviews 50
set
Median of the variable in Data
num_user_for_reviews 21
set
num_voted_users 0 ..
plot_keywords 0 ..
title_year 108 UNKNOWN

5.1.1. Replacement Node:


Replacement node is used to use to replace missing class
variable values in Dataset

SAS Enterprise Miner Replacement Node Settings

pg. 11
SAS Enterprise Miner Replacement Node Output

5.1.2. Impute Node:


Impute node is used to replace missing Interval variable values
in Dataset

SAS Enterprise Miner Impute Node Settings

pg. 12
SAS Enterprise Miner Impute Node Output

pg. 13
Data Summary after Replacement and Impute node

5.2. Transforming Data:


Removing the variable data distribution skewness by applying Log 10
function by using Transform Variables node in SAS Miner

pg. 14
SAS Enterprise Miner Transform Node Settings

SAS Enterprise Miner Transform Node Output

pg. 15
6. Un-Supervised Analysis:
6.1. Clustering:
Cluster analysis is performed on the data model using CLUSTER
Node in SAS Enterprise Miner it gets input data from DROP node in the Model

pg. 16
Cluster Node Settings:
1. Clustering Method = Ward
2. CCC cutoff =3

SAS Enterprise Miner Cluster Node Settings

Cluster Node Output:

Cluster Pie Diagram

pg. 17
CCC Plot

Cluster Hierarchy Structure

pg. 18
Cluster Scatter plot

Variable Importance Table

pg. 19
Cluster Analysis Observations:
1. 20 Cluster created by SAS Miner having CCC value 389.55
2. In CCC Plot the CCC value increases continually as the number of
clusters increases in the distribution it may be due to following
reasons:
Data was grainy
Data may have been excessively rounded
Recorded with just a few digits.
3. We can observe from Cluster scatter plot that generated clusters
are close to each other
4. By cluster analysis following variables are important:
cast_total_facebook_likes
facenumber_in_poster
actor_2_facebook_likes
actor_3_facebook_likes
num_voted_users
color
movie_facebook_likes

pg. 20
6.2. Decision Tree:
Decision Tree analysis is performed on the data model using Decision Tree
Node in SAS Enterprise Miner it gets input data from Replacement node in the
Model

Decision Tree Node Settings:


1. Assessment Measure= Average Square Error
2. Target Variable= imdb rating

SAS Enterprise Miner Cluster Node Settings

pg. 21
Decision Tree Output:

Tree diagram

Sub tree Assessment plot: -Average Square Error

pg. 22
Variable Importance Table

Fit Statistics

pg. 23
Decision Tree Analysis Observations:
1. SAS Miner Created Decesion Tree with 35 Leaves
2. Decision Tree Average Square Error for :
Train Data : 0.6928
Validation Data :0.901
Test Data : 0.7237
3. By Decisio Tree analysis following variables are important:
num_voted_users
duration
geners
budget
gross

pg. 24
7.Supervised Analysis:
7.1. Regression:
Regression analysis is performed on the data model using Decision
Tree Node in SAS Enterprise Miner it gets input data from Transform
Variables node in the Model
Since the Target variable is an interval variable SAS Miner Performs
Linear Regression
Regression Node Settings:
1. Selection mode: Stepwise
2. Target variable: imdb rating
SAS Enterprise Miner Regression Node Settings

Regression Node Output:

pg. 25
Fit Statistics

Residual

Residual Box plot

pg. 26
Regression Analysis Observations:
1. The R Square value of Regression Model was 0.5552. It means the
variables in the regression model are able to explain 55.2% of data .
2. By observing the Box plot of Residual plot tha 50 % of train data
lies around the regression line between -0.3448 to 0.4057
3. From Analysis of Variation output p value <0.0001 so we can
reject the null hypothesis and conclude that following variables are
required for regression analysis:
LG10_IMP_actor_2_facebook_likes
LG10_IMP_budget
LG10_IMP_duration
LG10_IMP_num_critic_for_reviews
LG10_num_voted_users
REP_color
REP_content_rating
REP_country
REP_genres

pg. 27
7.2. Neural Network
We have performed two types of Neural Network analysis with and
without using data input from Variable Selection node SAS Miner Node to
Neural Networks Node

Neural Network Node Settings:


1. Model Selection Criteria: Average Error
2. Remaining are default settings

SAS Enterprise Miner Regression Node Settings

7.2.1. Neural Network without variable selection


node
Variable selection Node removes the very minimum R square variables
from input data to neural network node

pg. 28
Variable Selection Output:

Neural Network Iteration Plot: Average Square Error

pg. 29
Fit Statistics

7.2.2. Neural Network with variable selection


node

Neural Network Iteration Plot: Average Square Error

pg. 30
Fit Statistics

Observation from Neural networks:


1. Average Square error of train data:
Neural Network without variable selection = 0.59558
Neural Network with variable selection = 0.744519
2. No of training iterations:
Neural Network without variable selection =48
Neural Network with variable selection = 0

pg. 31
8.Model comparison:

Using Model Comparison Node in SAS Miner to compare all analysis


method and
Input:

Output from Decision Tree node


Output from Regression node
Output from Neural Network without variable selection
Output from Neural Network with variable selection
Selection Criteria = Average Square Error

Fit Statistics

Model Comparison observations:


Neural network without variable selection is the best model
with Average square error = 0.779947

pg. 32
9.SAS Enterprise data model Diagram:

10. Observations:

1. Neural network without variable selection is the best data analysis


method for predicting IMDB rating
2. The variable that plays key role in IMDB rating prediction are:
Number of Actor 2 facebook Likes
Movie Budget
Duration of the movie
Number of cretics given review for the movie
Number of user voted for the movie
The Movie was color or BlackandWhite
Movie content rating
Country where movie made
Gener of the movie

pg. 33

You might also like