You are on page 1of 58

Business Analytics Foundation with SAS tools and Excel

Lesson 4 Predictive Modeling Techniques

Copyright 2014, Simplilearn, All rights reserved.


Copyright 2014, Simplilearn, All rights reserved.

Objective slide
After completing
this course, you will
be able to:

Understand Regression Analysis

Know the types of Regression Models

Understand Linear Regression

Implement Linear Regression in SAS and Excel

Understand Logistic Regression

Differentiate between Linear and Logistic Regression

Implement Logistic Regression in SAS

Know the basics of Cluster Analysis

Know the types of Cluster Analysis and Clusters

Implement Cluster Analysis in SAS

Understand Time Series and its components

Analyze Time Series in Excel


Copyright 2014, Simplilearn, All rights reserved.

Regression Analysis
Regression analysis mainly focuses on finding a relationship between a dependent variable and
one or more independent variables.

Predict the value of a dependent variable based on the value of at least one independent variable.
It explain the impact of changes in an independent variable on the dependent variable.
Y = f(X, )
where Y is the dependent variable
X is the independent variable
is the unknown coefficient
Widely used in prediction and forecasting

Copyright 2014, Simplilearn, All rights reserved.

Types of Regression Models


Regression
Models

Univariate

Linear

Simple

Non Linear

Multivariate

Linear

Non Linear

Multiple
Copyright 2014, Simplilearn, All rights reserved.

Linear Regression
Its a common technique to determine how one variable of interest is affected by another.
Its used for three main purposes:
For describing the linear dependence of one variable on the other.
For prediction of values of other variable from the one which has more data.
Correction of linear dependence of one variable on the other.
A line is fitted through the group of plotted data.
Y= + X +
= intercept coefficients
= slope coefficients

= residuals
The residual value is a discrepancy between the actual and the predicted value.
The distance of the plotted points from the line gives the residual value.

The procedure to find the best fit is called the least-squares method.
Copyright 2014, Simplilearn, All rights reserved.

Linear regression (contd.)


Y
Y= + X +
Observed value of y for xi
i

Predicted value of y for xi

Slope = i

Random error for this x value

Intercept =
xi

Copyright 2014, Simplilearn, All rights reserved.

Coefficient of determination R2 :
A measure of goodness of fit - How well your model does fit the data?

R2 = 0 , no linear relationship

R2 = -1 , negative linear relationship

R2 = +1 , positive linear relationship

Copyright 2014, Simplilearn, All rights reserved.

How good is the model ?


Based on R2 value , we can explain how well the model explains the data and the percentage of
differences that are explained by this model.
The differences between observations that are not explained by the model is the error term or
residual .
Suppose we have a case in which R2 value is 0.74. This means that 74% of variance in the values of
the dependent variable is explained by the model and the remaining 26 % which is not explained is
its residual or error term.

Copyright 2014, Simplilearn, All rights reserved.

Linear Regression in SAS Studio


The steps involved are
Open SAS
Extract data into the SAS system from the disk.
Click on tasks .
Then click on statistics and select the option linear regression.
Select the data set for which you have to perform linear regression.
Select a dependent variable and one or more explanatory variable.
Click on methods tab, define the value for confidence level and check on include intercept.

Click on options and check statistics plot if required.


Click run.
Regression output will be displayed in result window.

Copyright 2014, Simplilearn, All rights reserved.

Linear Regression in Excel


Extract the file
Click on Office button and select Excel Options
Select add-ins, then select Analysis toolPak and click on GO

Check on Analysis toolPak and click OK


Click on Data from menu bar and on the right most side you will see the Data Analysis tool
Click on Data Analysis
Click on Regression
Select the input Y range and input X range
Check on Labels in first row option
Check on Residuals and Normal probability plot if required.
Click OK and results will be obtained in new worksheet ply.
Copyright 2014, Simplilearn, All rights reserved.

Case Study
Case study slide.
SAS Video
Excel Video

Copyright 2014, Simplilearn, All rights reserved.

Logistic Regression
Its a statistical method that is used in analyzing dataset where one or more independent variables
would determine the outcome
The dependent variables are binary (True or False)

Find the best fitting model to describe the relationship between the dichotomous characteristic and
a set of independent variables
Logistic regression generates the coefficients of a formula to predict a logit transformation of the
probability of presence of the characteristic of interest
logit (p) = 0 + 1 x1 + 2 x2 +3 x3 + n xn
where, p is the probability of presence of the characteristic of interest.

The logit transformation is defined as the logged odds


odds = (p / 1-p)
logit(p) = ln (p / 1-p)
Copyright 2014, Simplilearn, All rights reserved.

Method to develop a logistic model


Observation-performance
windows
Data preparation, data treatment,
data hygiene.
Data

Derived variables identification

Logistic
Regression
Model

Fine and coarse classing

Logistic modeling and diagnostic


Copyright 2014, Simplilearn, All rights reserved.

Linear Regression vs Logistic Regression


Linear regression is mainly used to establish a relationship between dependent and independent
variable. It helps in estimating the impact of independent variable over a dependent variable.
Example using a linear regression, the relationship between temperature (T) and ice cream sales
(I) is found to be
I = 2T + 4000
This equation says that for every 1 degree raise in temperature , there is a demand of 4002 ice
creams.
Logistic regression helps in finding out the probability of an event and this event is captured in
binary format i.e. 0 or 1.
Example In order to know whether customers will buy a product or not, run a Logistic Regression
on the data. The dependent variable would be a binary variable .
In terms of graphical representation, Linear Regression gives a linear line as an output, once the
values are plotted on the graph. Whereas, the logistic regression gives an S-shaped line
Copyright 2014, Simplilearn, All rights reserved.

Cluster Analysis
Cluster Analysis is the process of forming groups of related
variable for the purpose of drawing important conclusions
based on the similarities within the group.
The greater the similarity within a group and greater the
difference between the groups, more distinct is the
clustering.

Often there are no assumptions about the underlying


distribution of the data
The reason for taking such an approach is that the objects
in a group are similar to one another and are different from
the objects in other groups. Therefore it is very easy to find
pattern here.
Copyright 2014, Simplilearn, All rights reserved.

Types of Cluster analysis


Hierarchical Clustering : Also known as nesting clusters as it also clusters to exist within bigger
clusters to form a tree. It can be either agglomerative or divisive.
Partitioned clustering : Division of the set of data objects into non-overlapping clusters such that
each object is in exactly one subset.
Overlapping clustering : Used to reflect the fact that an object can simultaneously belong to more
than one group.
Exclusive clustering: They assign each object to a single cluster.
Complete clustering : It assigns every object to a cluster

Copyright 2014, Simplilearn, All rights reserved.

Types of Clusters
Well separated : The distance between any two points in different groups is greater than the
distance between any two points within a group. They need not be globular.
Prototype based : The prototype of a cluster is often a centroid for data with continuous
attributes. Such clusters tend to be globular.
Graph based : When data is represented as a graph where nodes are the objects and links
represent connection among the objects. They tend to be globular.
Density based : This method is employed when the clusters are irregular and when noise and
outliers are present.

Shared property : Also known as conceptual clustering its the process of identifying the pattern in
the clusters to successfully segregate into groups of clusters.
Copyright 2014, Simplilearn, All rights reserved.

Methods to form clusters.


K means : Its a prototype based clustering technique that attempts to define the number of
clusters (K). They are represented as centroids.

Agglomerative Hierarchical Clustering : It refers to a collection of closely related clustering


techniques that produce a hierarchical clustering by starting with each point as singleton cluster
and repeatedly merging the closest clusters until a single, all encompassing cluster remains.

DBSCAN : Its a density based clustering algorithm that produces a partitioned clustering, in which
number of clusters is automatically determined by the algorithm.
Copyright 2014, Simplilearn, All rights reserved.

Time Series
Time series data is an ordered sequence of observations on a quantitative variable measured over
an equally spaced time interval.

Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematic
finance, weather forecasting, earthquake prediction electroencephalography, control engineering,
astronomy , communications engineering and other places.
Time series analysis is used in
Analyzing time series data
Forecasting the future value of the variable under consideration.
In time series analysis it is assumed that the data consist of set of identifiable components and
random errors which usually makes the pattern difficult to identify.

E.g. Sales of quilts and blankets in a store across a period of five years.
Copyright 2014, Simplilearn, All rights reserved.

Components of Time Series


Long term trend The smooth long term direction of time series where the data can increase or
decrease in some pattern.
Seasonal variation Patterns of change in a time series within a year which tends to repeat every
year.
Cyclical variation Its much alike seasonal variation but the rise and fall of time series over periods
are longer than one year.
Irregular variation Any variation that is not explainable by any of the three above mentioned
components. They can be classified into stationary and non stationary variation.
When the data neither increases nor decreases, i.e. its completely random its called stationary
variation.
When the data has some explainable portion remaining and can be analyzed further then such
case is called non stationary variation.

Copyright 2014, Simplilearn, All rights reserved.

Decomposition of Time Series

Observed

Trend

Seasonal

Random

Time
Copyright 2014, Simplilearn, All rights reserved.

Moving Average
Moving average is a widely used indicator in technical analysis that helps in smoothing out actions
by filtering out the noise i.e. the residuals from random fluctuations.
Moving average is also otherwise called as trend follower or lagging indicator because it always
depend on historical data.
Commonly used moving averages are

Simple moving average (SMA)


Exponential moving average (EMA)
A simple moving average is calculated by adding the value for a number of time periods and then
dividing this total by the same number of time periods.
Exponential moving average gives a higher weighting to recent prices but in case of simple moving
average it assigns equal weighting to all values.
Copyright 2014, Simplilearn, All rights reserved.

Goals of Time Series Analysis


Descriptive
Identify different patterns in correlated data which helps in finding the trend and seasonal
variation
Explanation
Understanding and modeling the data
Forecasting
Predicting the short-term trends from the previous existing patterns
Intervention analysis:

How does a single event change the time series?


Quality control
Deviations of a specified size indicate a problem.
Copyright 2014, Simplilearn, All rights reserved.

Steps for Moving Average in Excel


Open the file for performing moving average
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins
Click on Data Analysis
Select Moving Average and click OK
Select the input range for the data by clicking and dragging on the data
Check on labels in first row option if data extracted has column name in its first row

Specify the interval value as required


Check on Chart Output and click OK
The results will be obtained in new worksheet ply by default
Copyright 2014, Simplilearn, All rights reserved.

Steps for Exponential Smoothing in Excel


Open the file for performing exponential smoothing
Make sure that the Analysis ToolPak add-in is installed in Excel
If not, install it by selecting Add-ins from the Office button and selecting Manage add-ins.
Click on Data Analysis
Select Exponential Smoothing and click OK
Select the input range for the data by clicking and dragging on the data
Specify the damping factor as required

Check on labels in first row option if data extracted has column name in its first row
Check on chart output and click ok
The results will be obtained in new worksheet ply by default
Copyright 2014, Simplilearn, All rights reserved.

Summary
Here is a quick
recap of what we
have learned in this
lesson

Regression Analysis

Regression Models

Basics of Linear Regression

Linear Regression in SAS and Excel

Basics of Logistic Regression

Differences between Linear and Logistic Regression

Logistic Regression in SAS

Cluster analysis and its types

Cluster analysis in SAS

Time series and its components

Time Series Analysis in Excel

Copyright 2014, Simplilearn, All rights reserved.

Quiz

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
1

Regression analysis is used for which of the following?

a.

Prediction

b. Collection
c.

Validation

d.

Tabulation

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
1

Regression analysis is used for which of the following?

a.

Prediction

b. Collection
c.

Validation

d.

Tabulation

Answer: a.
Explanation: Prediction is the used for regression analysis.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
2

Simple linear regression is not used for which of the following purse?

a.

For describing the linear dependence of one variable on the other.

b. For prediction of values of other variable from the one which has more data.
c.

Finding the distance between two variables.

d.

Correction of linear dependence of one variable on the other.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
2

Simple linear regression is not used for which of the following purse?

a.

For describing the linear dependence of one variable on the other.

b. For prediction of values of other variable from the one which has more data.
c.

Finding the distance between two variables

d.

Correction of linear dependence of one variable on the other

Answer: c.
Explanation: Simple linear regression doesnt determine the distance between two
variables.
Copyright 2014, Simplilearn, All rights reserved.

QUIZ
3

What is a residual value?

a.

Its the left out value.

b. Its a discrepancy between the actual and the predicted value.


c.

Its the residing value.

d.

Its the redundant value.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
3

What is a residual value?

a.

Its the left out value.

b. Its a discrepancy between the actual and the predicted value.


c.

Its the residing value.

d.

Its the redundant value.

Answer: b.
Explanation: Residual value is the discrepancy between the actual and the predicted value.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
4

The procedure to find the best fit for linear regression is?

a.

Mean square method.

b. Text square method.


c.

External square method.

d.

Least square method.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
4

The procedure to find the best fit for linear regression is?

a.

Mean square method.

b. Text square method.


c.

External square method.

d.

Least square method.

Answer: d.
Explanation: The procedure to find the best fit for linear regression is least square method.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
5

Which of the following is not a method for clustering?

a.

K-means

b. DBSCAN
c.

Agglomerative hierarchical clustering

d.

Collective clustering

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
5

Which of the following is not a method for clustering?

a.

K-means

b. DBSCAN
c.

Agglomerative hierarchical clustering.

d.

Collective clustering.

Answers: d.
Explanation: Collective clustering is not a method for clustering.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
6

Which one of the following is a type of cluster?

a.

Hierarchical.

b. Fuzzy.
c.

Complete.

d.

Graph.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
6

Which one of the following is a type of cluster?

a.

Hierarchical.

b. Fuzzy.
c.

Complete.

d.

Graph.

Answer: d.
Explanation: Graph is a type of cluster.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
7

What are predictors?

a.

They are the future values.

b. They tell what is about to happen.


c.

They tell about what is upcoming.

d.

They are variables assumed to be cause for the respondent variable.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
7

What are predictors?

a.

They are the future values.

b. They tell what is about to happen.


c.

They tell about what is upcoming.

d.

They are variables assumed to be cause for the respondent variable.

Answer: d.
Explanation: The variables that are assumed to be the cause are called predictor and the
variables that are assumed to be effect are called the response or target variables.
Copyright 2014, Simplilearn, All rights reserved.

QUIZ
8

The error term in regression model is given by?

a.

- theta

b. - beta
c.

- alpha

d.

- epsilon

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
8

The error term in regression model is given by?

a.

- theta

b. - beta
c.

- alpha

d.

- epsilon

Answer: d.
Explanation: The error term is represented as epsilon.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
9

What happens in exclusive clustering?

a.

They assign each object to many clusters.

b. They assign each object to a single cluster.


c.

They assign many objects to a single cluster.

d.

They assign many objects to many clusters.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
9

What happens in exclusive clustering?

a.

They assign each object to many clusters.

b. They assign each object to a single cluster.


c.

They assign many objects to a single cluster.

d.

They assign many objects to many clusters.

Answer: b.
Explanation: They assign each object to a single cluster.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
10

Which moving average assigns equal weights to all values ?

a.

Simple moving average.

b. Exponential moving average.


c.

Quadratic moving average.

d.

Modified moving average.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
10

Which moving average assigns equal weights to all values ?

a.

Simple moving average.

b. Exponential moving average.


c.

Quadratic moving average.

d.

Modified moving average.

Answer: a.
Explanation: simple moving average assigns equal weights to all values for smoothening.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
11

Which moving average assigns more weights to recent values ?

a.

Simple moving average.

b. Exponential moving average.


c.

Quadratic moving average.

d.

Modified moving average.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
11

Which moving average assigns more weights to recent values ?

a.

Simple moving average.

b. Exponential moving average.


c.

Quadratic moving average.

d.

Modified moving average.

Answer: b.
Explanation: exponential moving average assigns more weights to recent values and for
older values it decreases exponentially.
Copyright 2014, Simplilearn, All rights reserved.

QUIZ
12

Which regression to use when the dependent variable is binary?

a.

Linear regression.

b. Clustered regression.
c.

Logistic regression.

d.

Multi linear regression.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
12

Which regression to use when the dependent variable is binary?

a.

Linear regression.

b. Clustered regression.
c.

Logistic regression.

d.

Multi linear regression.

Answer: c.
Explanation: in logistic regression the dependent variable is binary and the independent
variable may be continuous or dichotomous.
Copyright 2014, Simplilearn, All rights reserved.

QUIZ
13

Which of the following statement is used in displaying graphical output in SAS?

a.

ODS graphics .

b. ODS plot.
c.

MSN graphics.

d.

ODS diagram.

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
13

Which of the following statement is used in displaying graphical output in SAS?

a.

ODS graphics .

b. ODS plot.
c.

MSN graphics.

d.

ODS diagram.

Answer: a.
Explanation: ODS graphics helps in displaying the graphical output

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
14

Which among the following is not true of simple moving average?

a.

It smoothens the time series

b. It gives equal weightage to the window of previous data


c.

It gives exponential weights to the previous data

d.

The historic values are not taken into account

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
14

Which among the following is not true of simple moving average?

a.

It smoothens the time series

b. It gives equal weightage to the window of previous data


c.

It gives exponential weights to the previous data

d.

The historic values are not taken into account

Answer: c.
Explanation: Simple moving average gives equal weight to window of previous data, not
exponential.
Copyright 2014, Simplilearn, All rights reserved.

QUIZ
15

Which of the following time series forecasting can be done in Excel?

a.

Simple Moving Average

b. HoltWinters
c.

ARIMA

d.

Holts method

Copyright 2014, Simplilearn, All rights reserved.

QUIZ
15

Which of the following time series forecasting can be done in Excel?

a.

Simple Moving Average

b. HoltWinters
c.

ARIMA

d.

Holts method

Answer: a.
Explanation: Simple Moving Average forecasting can be done in Excel

Copyright 2014, Simplilearn, All rights reserved.

Thank You

Copyright 2014, Simplilearn, All rights reserved.


Copyright 2014, Simplilearn, All rights reserved.

You might also like