You are on page 1of 535

FROM

MARKETING ANALYSIS
TO
MARKETING ANALYTICS
(SCOPE AND OVERVIEW)
Dr. Manoj Kumar Dash
M.A;M.Phil;M.B.A;NET; Ph.D
ABV-Indian Institute of Information Technology and Management, Gwalior
( An Autonomous institute of Government of India)
acknowledgements

This presentation is based on discussion with


• Prof S.G.Deshukh (Director, IIITM Gwalior)
• Prof N K Sharma (IIT Kanpur)
• Prof B.K.Mohanty (IIM Lucknow)
• Prof Satya Bhusan Dash( IIM Lucknow)
• Prof. B.K.Panda ( BU Orissa)
My Existence in Research and Academic Fraternity
Teaching
Total Experience: 16 year
IIM Indore: 9.57 (2016) 9.68(2017)
and 9.43(2018)
IIITM Gwalior: Average 4.36 /5
( 2010-2018)

Present:
Indian Institute
of Information
Technology and
Management
( From 2010 to
contd) I. Teaching Interests:
Specialization: Behavioural Economics;
Econometrics; Micro Economics; Marketing
Analytics; Marketing Research, Money and
Banking, Research Methodology and
Consumer Behaviour; Data Analytics;
Marketing Analytic; Retail Analytics: Marketing
Visiting / Adjunct Modelling and Multivariate and Multi Criteria
Technique
Faculty II. Research Interest
A. Applied Management Science: Behavioural
Economics; Sustainable Sectoral Development;
Circular Economy and Productivity; Application
of Fuzzy approaches in Consumer Decision
Making Modelling, Multivariate and Multi
Criteria Analysis B. Interdisciplinary Research:
Entrepreneurship
Personal Information
Name :Dr. Manoj Kumar Dash
Teaching Interests: Marketing Science: Big Data
Analytic; Marketing Analytic; Retail Analytics: Consumer
Decision Making Modelling; Multi-Criteria Decision Making
(MCDM) Optimization Techniques in Marketing;
Econometrics Modelling in Marketing and Behavioural
Economics Experiments
Research Interests
Applied Marketing Science: Consumer Decision Making
Modelling, Digital Marketing;
Agenda for Discussion

Tools and
Techniques
Marketing 30 Min
Analytics
Scope of Application
Marketing
Analytics 10 Min
Data
Analytics 10 Min.
10Min.
Big
Data Question
and Answer
10 Min. 5 min
Think and Analyze …….

12/1/2018
Think and Analyze…….

12/1/2018
Some issues ……..

• A retail outlets wants to know the consumer behavioral pattern of the purchase of products in two
categories : national brand and local brand?

• Measuring customer satisfaction in a retail mall?

• Retailer is interested to know product intimacy and cross selling and up selling strategy?

• How to control churn rate in a Retail mall?

• How to increase customer life time value?

• How to prioritize the retail strategy to reach optimum level of profit and sales?

• How to address complex problem and analyze cause and effect in complex situation?

• How to measure the competitor strategy and ranking the different retail units ?

• How to measure customer feed back and develop our strategy?

• How to check fraud detection in a mall ?

• How to analyze the buying pattern of consumer?

12/1/2018
Some issues ……..

• Credit rating agencies wants to rate individual to classify


them into good lending risks or bad lending risk
• A retail outlets wants to know the consumer behavioral
pattern of the purchase of products in two categories :
national brand and local brand
• Marketing manager wondering what exactly makes a
consumer buy his product
• Restaurant wants to know attitude towards to travel tourist
place
• Measuring customer satisfaction in a retail mall
• Finding out dimension of service quality model
• To know the factors affecting performance appraisal in
particular industry etc.

12/1/2018
Once, we have
decided what is to be
analyzed then,
comes the issue of
HOW IS TO BE DONE?
12/1/2018
And, IT TAKES US TO ……
If you think that you are curious
enough to look for answers these
issues, then you have to equip yourself
with …???
Exploratory Factor Analysis
Confirmatory Factor Analysis
Cluster Analysis RFM analysis Churn Analysis
Customer Life Time Value Analysis
Text Analytics
Multiple Regression Analysis Dummy Variable Regression Analysis
Probit Model
Market Basket Analysis
CART Analysis Conjoint Analysis Multi Dimensional Scaling
Balance Score Card
Analytical Heretical Process(AHP)
DEMTEL
TOPSIS CART Neural Network
ANP DEA Image Analytics Video Analytics Text Analytics
Natural Language Processing
Decision Tree
Churn Analysis Elasticity of Demand Break Even Analysis MR and MC analysis
Input and Output analysis
CARTEL pricing strategy CRM ECRM MCRM ICRM
Big Analytics?
Data
Data
Data Estimation
Analysis

To To

Big
Data

Data Data
Analytics Visualization
Data Structured Unstructured Big Data

Collection Data
Mining

Estimation Objectives of Research


Visualisation

Analysis
Differences Analytics

Decision
Making
Who’s Generating Big Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion

19
Four Characteristics of Big Data
Cost efficiently Responding to the Collectively Analyzing
processing the increasing Velocity the broadening
growing Volume Variety
30 Billion
50x 35 ZB RFID sensors
and counting 80% of the
worlds data is
unstructured
2010 2020

Establishing the 1 in 3 business leaders don’t trust the


Veracity of big information they use to make decisions
data sources
Top 10 Analytics
Training Institute Tools:
• AnalytixLabs: Delhi/NCR Hands on exposure on IBM
• Edvancer: Mumbai DB2, IBM Cognos TM1, IBM
• International School of Cognos Insight, IBM
Engineering (INSOFE): InfoSphere Big Insight, IBM
Hyderbad Worklight, IBM BlueMix, R,
Python, SAS, Hadoop,
• Imarticus Learning: Mumbai MapReduce, EC2, AWS, Weka
• Edureka: Bangalore etc
• IMS Proschool: Mumbai
• Manipal Pro Learn: Banaglore
• IVY Pro School: Kolkatta
• NIVT: Kolkatta
• Orange Tree Global:Kolkatta
Analytics
1
Logic and
Statistics Software/IT • Visualization
Reasoning

Econometrics Mathematics Data Mining

= 2
• Inferences
Research
Multivariate
Methodology

Optimization Multi-Criteria

3
• Modeling
Marketing
Analytics???
Tools and
Big Data Descriptive + Predictive+ Prescriptive Techniques

Customer Product Promotion


Analytics Analytics Analytics
Data Analytics: Scope

Decision

Insight •Marketing: ROI, CLV,,


Brand Equity,
• Graphically • Deterministic Marketing Mix
• Picture/Image/Tre • Stochastic •Finance: Price,
• Inferences
nds • Algorithms Revenue, BEP, NPV, IRR
• Perception •HR: Competency,
• Expert Opinion
• Trending Performance,
• Complex: Fuzzy Retention,
• Relationship
Data Compensation
Measure •Operation: Assignment,
Visualization I/O, TQM, Inventory
•Others
Four Types of Analytics

What is Likely to
Happen
Predictive
Analytics

Four
What should I do
Why did it happen Diagnostic
Analytics Types of Prescriptive
Analytics about it
Analytics

Descriptive
Analytics
What is happening
Marketing Database Analytics

According to Drucker, the overriding objective of any business


is to create a customer – given that, it follows that marketing
has three (3) primary goals:
1. New customer acquisition (persuasion);
2. Current customer retention (persuasion);
3. Marketing mix optimization (economic
rationalization);
In view of the above, the goal of marketing database
analytics is to contribute to the creation of informational
advantage by providing an ongoing flow of decision-guiding,
competitively-advantageous knowledge.
Commercial Software for Big Data Analytics

SQL
SPSS

KNIME:
Azure ML
Scope of Marketing Analytics
Our focus will Be :
• 10 Tools and Techniques
• 8 Experiments
• 4 Case discussion
• Software: SPSS, AMOS and Eviews
Focus on four aspect:
1. Customer Analytics
2. Product Analytics
3. Price Analytics
4. Multivariate Tools
Marketing Anlaytics
Class Topics Process Requirement
Class-1 Overview of Data Analytics

Class-2 RFM Analysis Customer Analytics Approach SPSS software

Class-3 Exploratory Factor Analysis Product Analytics Approach Excel

Class-4 Multiple Regression Analysis Product Analytic Approach SPSS

Class-5 Dummy Variable Regression Product Analytic Approach SPSS and E views
Analysis
Class-6 Survival Analysis Product Analytic Approach SPSS and Eviews

Class-7 Cluster Analysis Customer Analytics Approach SPSS

Class-8 Discriminant Analysis Customer Analytics Approach SPSS

Class-9 Conjoint Analysis Product Analytics SPSS

Class-10 Multi Dimensional Scaling and Product Analytics SPSS


Correspondence Analysis
Tools and Techniques

Objectives Analytics Tools


Impact • Regression Analysis
Exploring • Exploratory Factor Analysis
Marketing Model • Survival Analysis
Validation of Model • Structural Equation Modeling
Dichotomy Decision • Dicrimininant Analysis
Segmentation • RFM Analysis
Positioning • Multi Dimensional Scaling
Attributes -Strategy • Conjoint Analysis
Demographics Imact on Marketing • Dummy Variable Regression
Revenue /Price Analytics • Demand Anaylsis
SURVIVAL ANALYSIS
Exploratory Factor Analysis RFM ANALYSIS

Regression analysis Multi Dimensional Scaling CART Analysis

MDS
Cluster analysis Conjoint Analysis &
CORRESPANDCE ANALYSIS

Dicriminanat Analysis Sentiment Analysis Price Analytics

Marketing Analytics: Tools and Techniques


Customer Analytics
Objectives of Customer Analytics Tools and Techniques


Understanding Customer
Sentiments of Customer
• Exploratory Factor Analysis


Retaining Your customer
Customer Life Time Value
• Confirmatory Factor


Tracking your Customer
Buying Behavior
Analysis


Challenges in online market
Consumer Decision Making Style
• Cluster Analysis
• RFM analysis
• Churn Analysis
• Customer Life Time Value
Analysis
• Text Analytics
Product Analytics
Objectives Tools and Techniques


Positioning of Product
Analyzing Profitability
• Multiple Regression
• Defectives Analysis and losses Analysis
• Tracking Product Movement
• Product Intimacy
• Dummy Variable Regression
• Cross selling and Up selling strategy Analysis
• Product Strategy
• Probit Model
• Market Basket Analysis
• CART Analysis
• Conjoint Analysis
• Multi Dimensional Scaling
Competitor Analytics
Objectives Tools and Techniques
• Benchmarking
• Competitive • Balance Score Card
• Aggressive and Defense Strategy


Entry and exit
Behaviors of Brands
• Analytical Heretical


Product Movements
Prioritization of strategy
Process(AHP)


Understanding Cause and Effect of complex situation
Ranking of Competitor • DEMTEL
• Developing structural Model for Marketing Strategy
• Measuring Efficiency • TOPSIS
• CART
• Neural Network
• ANP
• DEA
Promotion Analytics
Objectives Tools and Techniques
• Effective of Promotion • Image Analytics
• Selecting right media • Video Analytics
• ROI on promotional • Text Analytics
Expanses • Natural Language
• Customer feedback Processing
• Reaction of Advertisement • Decision Tree
Service Analytics
Objectives Tools and Techniques
• Retaining Customer • Churn Analysis
• Attracting Customer • CRM
• Monitoring Service scape • ECRM
• Competitive • MCRM
• ICRM
Price Analytics
Objective Tools and Techniques
• Demand Management • Elasticity of Demand
• More Competitive • Break Even Analysis
• Sustainable • MR and MC analysis
• Supply Management • Input and Output analysis
• Profitability • CARTEL pricing strategy
Factor analysis is a statistical method
used to describe variability among
observed, correlated variables in
terms of a potentially lower number
of unobserved variables called
factors.
For example, it is possible that
variations in four observed variables
mainly reflect the variations in two
unobserved variables.

Exploratory
Factor Analysis
•Factor analysis in marketing is
important because it reflects the
perception of the buyer of the
product.
•By testing variables, it is possible for
marketing professionals to determine
what is important to the customers of
the product.
Confirmatory factor analysis (CFA) is a more complex approach that
tests the hypothesis that the items are associated with specific factors.
1.CFA uses structural equation modeling to test a measurement model
whereby loading on the factors allows for evaluation of relationships
between observed variables and unobserved variables.[
2. Structural equation modeling approaches can accommodate
measurement error, and are less restrictive than least-squares
estimation.
3. Hypothesized models are tested against actual data, and the analysis
would demonstrate loadings of observed variables on the latent variables
(factors), as well as the correlation between the latent variables.

Confirmatory
Factor Analysis
Structural Equation
Modeling
Multiple
Regression
Analysis
Discriminant
Analysis
Cluster Analysis
Multi Dimensional
Scaling
Conjoint
Analysis
Churn Analysis……
Churn Tools and Techniques
Term used to describe customer attrition or loss

Churn rate
The number of participants who

• Churn
discontinue their use of a service divided
by the average number of total
participants during a period

Analysis

Reason for Churn::


Easy to switch provider
Difficult to manage the customer data
Results of CDA applied on the data set with Inadequate services
churners (black), non-
non-churners (red) and Quality of service
“returners” (green) Plenty of attractive offers
Customer dissatisfaction
RFM Analysis……
Tools and Techniques
RFM stands for Recency, Frequency & Monetary Analysis
 Recency: When did the customer make their
last purchase? • Cluster Analysis
 Frequency: How often does the customer make
a purchase? • K-means Algorithm
 Monetary: How much money does the customer
spend?

Recency Value:
• The date of last order
• The most powerful predictor of who is likely to
order

Frequency Value Solve STP issues and strategy:


• Frequency is the next most powerful How recently (R) a customer has ordered.
predictor of response  How frequently (F) a customer has ordered.
 How much money (M) the customer has spent.
A company compute value (D) number of days that
Monetary Value expire between invoice & arrival payment)
• Use the average order size of the orders -Value (P) who pay fastest.
counted for frequency
CVM Analysis…… Tools and Techniques
• Factor analysis
Customer value management (CVM) is a process that refines and • Profitability Index analysis
leverages the benefits of customer relationship management.
• Campaign Analytics
It encompasses :
 Customer identification,
 Contact management,
Campaign management,
Advanced data modeling and
Customer scoring.
Funnel
Tools and Techniques
Analysis…… • Multi Dimensional Scaling mapping
• CART Techniques
The funnel analyses "are an effective way to calculate • Conjoint Analysis
conversion rates on specific user behaviors". • Cohort Analysis: Breaking users down into
This can be in the form of a sale, registration, or other similar groups to gain a more focused
intended action from an audience. The origin of the term understanding of their behavior.
funnel analysis comes from the nature of a funnel where
individuals will enter the funnel, yet only a small number
of them will perform the intended goals. In a funnel analysis, you:
Identify a specific workflow: this is your
funnel
 Identify the steps that a user must take to
work through that funnel
Identify the set of users who start on that
funnel
Analyze what fraction of that set of users
drop out at each step in the funnel
 Work out what caused the user to drop out
of the funnel, with a view to improving your
product so that a smaller fraction of users drop
out at each step.
Market Basket Analysis
Market Basket Analysis (Association Analysis):
It is a mathematical modeling technique
based upon the theory that if you buy a certain
group of items, you are likely to buy another
group of items.
It is used to analyze the customer purchasing
behavior and helps in increasing the sales and
maintain inventory by focusing on the point of
sale transaction data
Text Analytics
Commercial Software
http://www.capterra.com/statistical-analysis-software/

http://www.predictiveanalyticstoday.com/top-data-analysis-software/
THINK BIG
Start Small

Imagine it. Realize it. Trust it.


Build a culture Invest ahead of Be proactive
that infuses scale in a big about privacy,
analytics data & analytics security and
everywhere platform governance
RFM
Analysis
Dr. Manoj Kumar Dash
Indian Institute of Information
Technology and Management Gwalior
Big Question?
 Can you identify your best customers?
 Do you know who your worst customers are? Do you know which
customers you just lost, and which ones you’re about to lose?
 Can you identify loyal customers who buy often, but spend very
little?
 Can you target customers who are willing to spend the most at
your store?
If you answered “no,” then consider RFM Analysis
Today, retailers should use RFM to increase conversion rates,
personalization, relevancy and revenue. Sophisticated shoppers
demand personalized shopping experiences, and RFM Analysis is an
excellent way to provide highly relevant, personalized campaigns
that reflect the preferences of the customers they want to keep.
Responsiveness & Profitability are
not the same
Responsive customers may not be the
most profitable

Profitable Responsive
Customers Customers
RFM
LTV

Not all responsive customers are profitable


Not all profitable customers will respond when you write them.
RFM-BACKGROUND
• RFM analysis is a technique used to identify existing customers who are most likely
to respond to a new offer
RFM analysis is based on the following simple theory:
 The most important factor in identifying customers who are likely to respond to a
new offer is recency. Customers who purchased more recently are more likely to
purchase again than are customers who purchased further in the past.
 The second most important factor is frequency. Customers who have made more
purchases in the past are more likely to respond than are those who have made
fewer purchases.
 The third most important factor is total amount spent, which is referred to
as monetary. Customers who have spent more (in total for all purchases) in the
past are more likely to respond than those who have spent less.
7 Ways to Use RFM for a Smart Marketing Strategy
Here are seven ways to use RFM to target your marketing campaigns more precisely and utilize your
marketing resources more effectively:
Understand your best customers. Once you’ve identified your best customers, you can create
demographic profiles to gain insights into the characteristics they share. You also can append data to
their records, such as company size or NAICS code, for an even fuller picture.
Find the low-hanging fruit among your next-best customers. Take a careful look at the customers in
deciles 3-7 whose demographic profiles are similar to your best customers. This is likely to be your best
upselling opportunity.
Target the right prospects on rented mailing lists. Armed with information about the characteristics of
your best customers, you can be extremely selective about the names you rent on commercial mailing
lists, which can cut your costs and increase response.
Reallocate sales support. RFM can help you reassess the level of sales support appropriate for each
customer based on their value and potential. Your goal should be to deploy your most expensive sales
resource – your sales team – on customers who already generate the most profit or have the highest
potential to buy more.
Develop tiered direct marketing campaigns. Focus high-end direct marketing campaigns on your
highest-value customers and mail less expensive campaigns to lower-value customers. You might send
best customers a personalized direct mail package with a product sample, for example, while others get
a simple selfmailer offering a free product sample on request.
Test a high-end marketing campaign to high potential customers. Once you’ve identified customers in
deciles 3-7 with the same demographics as your best customers, test a more elaborate direct marketing
campaign to these customers to try to increase their profitability.
Decide which customers to drop from marketing. Customers in deciles 8-10 probably should be
dropped from your mailing lists and marketing campaigns because of their low value. It may be costing
you more to sell to them than they’re worth.
How does it work?
The goal of RFM Analysis is to segment
customers based on buying behavior. To do
this, we need to understand the historical
actions of individual customers for each RFM
factor.
We then rank customers based on each
individual RFM factor, and finally pull all the
factors together to create RFM segments for
targeted marketing
Definition
 Recency is the number of days since the customer’s last
purchase. Typically, its value is defined in days. For
example, if the customer’s order was 42 days ago, then
their Recency input is “42
 Frequency is the number of orders placed in a given time
period. If a customer has placed seven orders over the
course of one year, then their Frequency input is “7.”
 Monetary Value is the total amount of money spent by the
customer over a given time period. If a customer has made
5 orders of RS 50 each over the course of one year, their
Monetary Value input for the year is RS 250.

FORE MORE DETAILS: Ebook: on RFM


http://www.e-rfm.com/Libey/Libeybook2.html
How RFM Analysis Works

• Customers are assigned a recency score based on date of most recent purchase or time interval since most
recent purchase. This score is based on a simple ranking of recency values into a small number of
categories.
– For example, if you use five categories, the customers with the most recent purchase dates receive
a recency ranking of 5, and those with purchase dates furthest in the past receive a recency ranking
of 1.
• In a similar fashion, customers are then assigned a frequency ranking, with higher values representing a
higher frequency of purchases.
– For example, in a five category ranking scheme, customers who purchase most often receive a
frequency ranking of 5.
• Finally, customers are ranked by monetary value, with the highest monetary values receiving the highest
ranking.
– Continuing the five-category example, customers who have spent the most would receive a
monetary ranking of 5

– The result is four scores for each customer: recency, frequency, monetary, and combined RFM score,
which is simply the three individual scores concatenated into a single value. The "best" customers
(those most likely to respond to an offer) are those with the highest combined RFM scores. For
example, in a five-category ranking, there is a total of 125 possible combined RFM scores, and the
highest combined RFM score is 555.
Recency Value

• Recency: the time of a customer’s most recent purchase.


– A relatively long period of purchase inactivity can signal to the firm that
the customer has ended the relationship.
• Recency values are assigned to each customer and these values
represent the following categories on a scale from 1 to 5:
1. Not recent at all
2. Not recent
3. Somewhat recent
4. Recent
5. Very recent
• The specific cutoff points depend on the specific marketing campaign
and are decided by the marketing team based on the type of purchase.
Frequency value

• Frequency: the number of a customer’s past purchases.


• Frequency values are assigned to each customer and these
values represent the following categories on a scale from 1 to 5:
1. Not frequent at all
2. Not frequent
3. Somewhat frequent
4. Frequent
5. Very frequent
• The specific cutoff points for each category and the number of
frequency categories are decided by the marketing team based
on the type of purchase.
Monetary Value

• Monetary value is based on the average purchase amount per


customer transaction.
• In this chapter the average amount of purchase is used and
categories are defined as:
1. Very small buyer
2. Small buyer
3. Normal buyer
4. Large buyer
5. Very large buyer
• The specific cutoff points can be decided based on the type of
purchases.
– Using the quintile values for the average price can be an alternative
approach for the cutoff points.
RFM Can Predict Responders
• For product launch, select SICs with highest
penetration ratios
• Use RFM to select most likely responders
• Use combination of mail, phone, and sales
visits to responsive relationship buyers.
How to Apply Recency Codes

• Put most recent purchase date into every customer record


• Sort database by that date - newest to oldest
• Divide into five equal parts - Quintiles
• Assign “5” to top group, “4” to next, etc.
• Put quintile number in each customer record
Response by Recency Quintile

4.00% 3.49%
3.50%
Response Rate

3.00%
2.50%
2.00%
1.50% 1.25% 1.08%
1.00% 0.63%
0.50% 0.26%
0.00%
5 4 3 2 1
Recency Quintile
How to compute a Frequency
Index
• Keep number of transactions in customer
record
• Sort Recency Groups from highest to lowest
• Divide into five equal groups
• Number groups from 5 to 1
• Put Quintile number in each customer record
Response Rate
Response by Frequency Quintile

2.50%
1.99%
2.00%
1.56%
1.50% 1.31%
1.00%
0.92% 0.93%
0.50%

0.00%

5 4 3 2 1

Frequency Quintile
How to compute a Monetary Index
• Store total dollars purchased in each customer
record
• Sort Frequency Groups from highest to lowest
• Divide into 5 equal groups (Quintiles)
• Number Quintiles 5, 4, 3, 2, 1
• Put Quintile number in each record
Response by Monetary Quintile

1.80%
1.61%
1.60% 1.45% 1.46%
1.40%
1.22% 1.23%
1.20%
1.00%
0.80%
0.60%
0.40%
0.20%
0.00%
5 4 3 2 1
Monetary Response to Rs 5,000
Product
Percentage of households promoted who purchased
2

1.68
1.5

1.17
1 0.88
0.66
0.5
0.32

0
5 4 3 2 1

Monetary Quintile
Result of Test Mailing to 30,000
# RFM Mailed Response Rate
1 555 240 20 8.15%
2 554 240 16 6.56%
3 553 240 13 5.62%
4 552 240 10 4.33%
5 551 240 11 4.51%
6 545 240 9 3.78%
7 544 240 12 4.98%
8 543 240 6 2.88%
9 542 240 10 4.26%
10 541 240 7 3.10%
11 535 240 10 4.13%
12 534 240 9 3.83%
13 533 240 8 3.35%
14 532 240 6 2.70%
Step

There are three basic steps to RFM analysis:


1. Sort all customers in ascending order based
on Recency, Frequency and Monetary Value.
2. Split customers into quartiles for each factor.
3. Combine factors to group customers into
RFM segments for targeted marketing.
Contd.
• With customers now organized in ascending
order, divide them into FOUR equal groups, for
each RFM factor.
• The customers in the top quartile represent your
best customers for each factor.
• For example, the top quartile for Monetary Value
will have the 25% of your customers who have
spent the most at your store.
• Each quartile also has a name: the top quartile
for Recency is called R-1, the second quartile is
called R-2, and so on.
With customers now in quartiles, it’s time to group them into RFM segments. Let’s say you

Segments
have a customer who purchased an item 17 days ago (R=1), bought 7 times in the last year
(F=1), and spent 1568 total in the past year (M=1). As a result, we place this customer in
RFM segment “111.” Segment 111 contains your “Best Customers

Dividing into quartiles will create 64 RFM segments: 4 Recency groups x 4


Frequency groups x 4 Monetary Value groups. Common RFM Analysis practice is to
start by dividing customers into quintiles, which creates 125 RFM segments. While
the generaly created by 125 customer segments will help you increase conversion
rates, it also reduces the number of customers in each cell. For most online
marketers, quartiles will be sufficient.
Marketing to RFM Segments
Data Considerations

• If data rows represent transactions (each row


represents a single transaction, and there may be
multiple transactions for each customer), use
RFM from Transactions.
• If data rows represent customers with summary
information for all transactions (with columns
that contain values for total amount spent, total
number of transactions, and most recent
transaction date), use RFM from Customer Data.
RFM Scores from Transaction Data

• RFM (Recency, Frequency, Monetary) analysis is a technique used to


identify existing customers who are most likely to respond to a new offer.
– This technique is commonly used in direct marketing
Data Considerations
• In a transaction data file, each row represents a separate transaction,
rather than a separate customer, and there can be multiple transaction
rows for each customer.
– If data rows represent customers with summary information for all
transactions (with columns that contain values for total amount spent, total
number of transactions, and most recent transaction date), see RFM Scores
from Customer Data.
The dataset must contain variables that contain the following information:
• A variable or combination of variables that identify each case (customer).
• A variable with the date of each transaction.
• A variable with the monetary value of each transaction.
Creating RFM Scores from Transaction Data
1. From the menus choose:
Direct Marketing > Choose Technique
2. Select Help identify my best contacts (RFM Analysis) and click Continue.
3. Select Transaction data and click Continue.
4. Select the variable that contains transaction dates.
5. Select the variable that contains the monetary amount for each
transaction.
6. Select the method for summarizing transaction amounts for each
customer: Total (sum of all transactions), mean, median, or maximum
(highest transaction amount).
7. Select the variable or combination of variables that uniquely identifies
each customer. For example, cases could be identified by a unique ID code
or a combination of last name and first name.
RFM Scores from Customer Data

RFM (Recency, Frequency, Monetary) analysis is a technique used to identify


existing customers who are most likely to respond to a new offer.
Data Considerations
 In a customer data file, each row represents a customer, and there is only
one row (case) for each customer. If data rows represent transactions,
see RFM Scores from Transaction Data.
The dataset must contain variables that contain the following information:
 Most recent purchase date or a time interval since the most recent
purchase date. This will be used to compute recency scores.
 Total number of purchases. This will be used to compute frequency scores.
 Summary monetary value for all purchases. This will be used to compute
monetary scores. Typically, this is the sum (total) of all purchases, but it
could be the mean (average), maximum (largest amount), or other
summary measure.
 If you want to write RFM scores to a new dataset, the active dataset must
also contain a variable or combination of variables that identify each case
(customer).
RFM THROUGH SPSS
1. From the menus choose:
Direct Marketing > Choose Technique
2. Select Help identify my best contacts (RFM Analysis) and click Continue.
3. Select Customer data and click Continue.
4. Select the variable that contains the most recent transaction date or a number
that represents a time interval since the most recent transaction.
5. Select the variable that contains the total number of transactions for each
customer.
6. Select the variable that contains the summary monetary amount for each
customer.
If you want to write RFM scores to a new dataset, select the variable or combination
of variables that uniquely identifies each customer. For example, cases could be
identified by a unique ID code or a combination of last name and first name.
RFM Binning

• The process of grouping a large number of


numeric values into a small number of
categories is sometimes referred to as binning
• In RFM analysis, the bins are the ranked
categories. You can use the Binning tab to
modify the method used to assign recency,
frequency, and monetary values to those bins.
RFM BINNING METHOD
1. Nested.
 In nested binning, a simple rank is assigned to recency values.
 Within each recency rank, customers are then assigned a
frequency rank, and within each frequency rank, customer are
assigned a monetary rank.
 This tends to provide a more even distribution of combined
RFM scores, but it has the disadvantage of making frequency
and monetary rank scores more difficult to interpret.
 For example, a frequency rank of 5 for a customer with
a recency rank of 5 may not mean the same thing as a
frequency rank of 5 for a customer with a recency rank of 4,
since the frequency rank is dependent on the recency rank.
CONTD.
3. Number of Bins
 The number of categories (bins) to use for each component
to create RFM scores.
 The total number of possible combined RFM scores is the
product of the three values.
 For example, 5 recency bins, 4 frequency bins, and 3
monetary bins would create a total of 60 possible
combined RFM scores, ranging from 111 to 543.
 The default is 5 for each component, which will create 125
possible combined RFM scores, ranging from 111 to 555.
 The maximum number of bins allowed for each score
component is nine.
CONTD.
2.Independent.
 Simple ranks are assigned to recency, frequency, and
monetary values.
 The three ranks are assigned independently.
 The interpretation of each of the three RFM
components is therefore unambiguous; a frequency
score of 5 for one customer means the same as a
frequency score of 5 for another customer, regardless
of their recency scores.
 For smaller samples, this has the disadvantage of
resulting in a less even distribution of combined RFM
scores.
CONTD.
4. Ties
 A "tie" is simply two or more equal recency, frequency, or monetary
values. Ideally, you want to have approximately the same number
of customers in each bin, but a large number of tied values can
affect the bin distribution. There are two alternatives for handling
ties:
 Assign ties to the same bin. This method always assigns tied values
to the same bin, regardless of how this affects the bin distribution.
This provides a consistent binning method: If two customers have
the same recency value, then they will always be assigned the
same recency score.
 In an extreme example, however, you might have 1,000 customers,
with 500 of them making their most recent purchase on the same
date. In a 5-bin ranking, 50% of the customers would therefore
receive a recency score of 5, instead of the ideal value of 20%.
Saving RFM Scores from Transaction
Data
For customer data, you can add the RFM score variables to the active dataset or create a new dataset
that contains the selected scores variables. Use the Save Tab to specify what score variables you
want to save and where you want to save them.
Names of Saved Variables
 Automatically generate unique names. When adding score variables to the active dataset, this
ensures that new variable names are unique. This is particularly useful if you want to add multiple
different sets of RFM scores (based on different criteria) to the active dataset.
 Custom names. This allows you to assign your own variable names to the score variables. Variable
names must conform to standard variable naming rules.
Variables
Select (check) the score variables that you want to save:
 Recency score. The score assigned to each customer based on the value of the Transaction Date or
Interval variable selected on the Variables tab. Higher scores are assigned to more recent dates or
lower interval values.
 Frequency score. The score assigned to each customer based on the Number of Transactions
variable selected on the Variables tab. Higher scores are assigned to higher values.
 Monetary score. The score assigned to each customer based on the Amount variable selected on
the Variables tab. Higher scores are assigned to higher values.
 RFM score. The three individual scores combined into a single
value: (recency*100)+(frequency*10)+monetary.
Location

For customer data, there are three alternatives for where you can save
new RFM scores:
 Active dataset. Selected RFM score variables are added to active
dataset.
 New Dataset. Selected RFM score variables and the ID variables
that uniquely identify each customer (case) will be written to a new
dataset in the current session. Dataset names must conform to
standard variable naming rules. This option is only available if you
select one or more Customer Identifier variables on the Variables
tab.
 File. Selected RFM scores and the ID variables that uniquely identify
each customer (case) will be saved in an external data file. This
option is only available if you select one or more Customer
Identifier variables on the Variables tab.
RFM Procedure
– Transaction file
• Run RFM on transaction file
• Create an new RFM dataset
• Merge with customer file
– Customer file
• File is already prepared
• Run RFM on time since last purchase, number of
purchases, and money spent

93
SPSS Means Procedure
• Compare Means  Means
– DV  Response Variable (0/1)
– IV  RFM (Means and N)
• Copy output table to Excel
– Sort all columns by mean response
– Descending order
• Determine BE and economics

94
Example of a data file in a spreadsheet
What is RFMPD Analysis?
• RFMPD includes 2 additional variables.
• P stands for Payment. This measures when the
company receives payment.
• Customers who pay quickly receive a P score of 1
with the slowest paying receiving a score of 5.
• D stands for Date. This is the date of the
customers last payment.
• Customers are sorted by decreasing D values.
• The final score is based on a value for R, F, M, & P.
Uses of RFM Analysis
• Marketing departments of any company
• Customer Service Departments
• Customer Relations Departments
• Ranking Suppliers
• Ranking Salespeople
• Airlines
• Credit Card Companies
Strengths of RFM Analysis
• Companies have data that can be used for target
marketing.
• Marketing budgets will be focused on customers
who are more recent, more frequent and spend
more.
• Specific targeting can increase profit and reduce
costs; companies gain by not spending on
customers who will not add value
• You can offer incentives to middle scoring
customers to increase their purchases
• Analysis is quick and easy to interpret
Weaknesses of RFM Analysis
 It only looks at three variables and there may be others
that are more important
 Customers with low RFM scores may be ignored, even
though they may have legitimate reasons for spending
more with other vendors.
 Opportunities may be missed to solidify business
relationships leading to loss of future sales and
referrals.
 A customer with a low recency value and high spending
could be ranked lower than a customer who made a
recent purchase and spends 10 times less
Effectiveness of RFMP Analysis
• Customers scoring the top 20% also pay the
fastest. Companies will be able make money
faster and this can be used to reduce other
liabilities.
• Customers in the lowest 20% are slow payers
and companies can choose to limit credit or
change payment terms to reduce the amount
of outstanding debt.
1, 1, 1, 5 Customers
• Customer is one that has recently ordered, buys
frequently, spends large amounts of money but
they are a slow payer.
• To speed up the payment process, companies can
change payment terms and offer incentives to
pay earlier.
• For example, if the due date is 30 days, but
payments are received with 10 days, the buyer
will receive a 2% discount off the bill (2/10 net
30).
5, 5, 5, 1 Customer
• This customer has not ordered recently or
frequently, spend small amounts of money
but always pays on time.
• This customer is spending more money with
competitors.
• Make an effort to find out why the customer is
spending elsewhere to see if there is anything
the company can improve on.
RFMP or RFM?
• RFMP is a better method because it include the
variable of payment. With more variables, you
have a clearer picture of the customer’s value to
the company.
• RFMP also takes into account the customer’s
payment history.
• If a customer pays on time, you know that there
are no cash flow issues.
• Slow payers may be having financial problems
which may increase in the future.
Using RFM for Salespeople
• RFM Analysis of Salespeople gives managers a
clear picture of how a salesperson is
performing
• You can analyze the amount of revenue
generated per person and compare different
salespeople
• It is also possible to identify opportunities for
additional training, promotion or employment
termination.
RFM or No RFM?
• RFM is best suited for companies who offer a
rewards program. They are able to track
spending and can offer their high profile
clients incentives to spend more.
• RFM is worst suited to companies who provide
products that are unique and will not be
purchased in large quantities.
Case Study
Description …
1. Dataset used in this case study was provided by a sports store and
collected through its e-commerce website within two years period.
2. The complete dataset included 1584 different product demands in 54 sub-
groups and 6149 purchase orders of 2666 individual customers.
3. The purchase orders included many columns such as transaction id,
product id, customer id, ordering date, quantity, ordering amount (price),
sales type, discount and whether or not promotion was involved.
4. While customer table included demographic variables such as age, gender,
marital status, education level and geographic region; product table
included attributes such as barcode, brand, color, category, subcategory,
usage type and season.
What should you do?
• Maintain a customer database
• Maintain the most recent date, frequency of
orders and total dollar amount
• Put RFM cell codes into your records
• With each mailing, see which cells respond.
• Increase response and profits by NOT MAILING
non responsive cells
Books by Arthur Hughes

From McGraw Hill. Order at


www.dbmarketing.com
Thanks

DR MANOJ KUMAR DASH


Factor Analysis

Dr. Manoj Kumar Dash


M.A(Eco); M.Phil(Eco);MBA(Mkt.)NET; Ph.D(Eco)
Areas where we can apply multivariate
analysis
• Credit rating agencies wants to rate individual to classify them into
good lending risks or bad lending risk
• A retail outlets wants to know the consumer behavioral pattern of the
purchase of products in two categories : national brand and local brand
• Marketing manager wondering what exactly makes a consumer buy his
product
• Restaurant wants to know attitude towards to travel tourist place
• Measuring customer satisfaction in a retail mall
• Finding out dimension of service quality model
• To know the factors affecting performance appraisal in particular
industry etc.

Dr. Manoj Dash


Scope of my presentation

• Concept of multivariate analysis


• Factor analysis
• Thumbs rule for interpretation of factor analysis
• Case study of factor analysis through software and its
interpretation

Dr. Manoj Dash


Multivariate Analysis
• Many statistical techniques focus on just one
or two variables
• Multivariate analysis (MVA) techniques allow
more than two variables to be analysed at
once
• Multivariate analysis allow the effects of more
than one variable to be considered at one
time

Dr. Manoj Dash


Multivariate Analysis Methods
• Two general types of MVA technique
– Analysis of dependence
• Where one (or more) variables are dependent variables, to be explained
or predicted by others
• One dependent variable (metric scale): Multiple regression
• One dependent variable (non metric scale): Multiple discriminant analysis
• Several dependent variables (metric scale): MANOVA
• Several dependent variables (non-metric scale): Conjoint analysis
• Multiple independent variables: Canoncial analysis

contd.

Dr. Manoj Dash


Analysis of interdependence

• Inter dependence scale with Metric scale:


Factor analysis, cluster analysis, MDS
• Inter dependence scale with non-metric scale:
Non-metric multidimensional scaling

Dr. Manoj Dash


Level of Measurement and
Multivariate Statistical Technique
Independent Dependent Variable Technique
Variable
Numerical Numerical Multiple
Regression
Nominal or Nominal Logistic Regression
Numerical
Nominal or Numerical Cox Regression
Numerical (censored)
Nominal or Numerical ANOVA, MANOVA
Numerical
Nominal or Nominal (2 or more Discriminant
Numerical values) Analysis
Numerical No Dependent Factor and
Variable
Dr. Manoj Dash Cluster Analysis
Factor Analysis

• The purpose of factor analysis is to discover simple patterns in the


pattern of relationships among the variables.
• In particular, it seeks to discover if the observed variables can be
explained largely or entirely in terms of a much smaller number of
variables called factors.
• A data reduction technique designed to represent a wide range of
attributes on a smaller number of dimensions.
• Factor analysis was invented nearly 100 years ago by psychologist
Charles Spearman

Dr. Manoj Dash


A typical factor analysis suggests answers to
four major questions:

• How many different factors are needed to explain the


pattern of relationships among these variables?
• What is the nature of those factors?
• How well do the hypothesized factors explain the
observed data?
• How much purely random or unique variance does each
observed variable include?

Dr. Manoj Dash


Contd.
• Factor analysis is a technique that is used to reduce a large number of
variables into fewer numbers of factors.
• Factor analysis extracts maximum common variance from all variables
and puts them into a common score. As an index of all variables, we
can use this score for further analysis.
• Factor analysis is part of general linear model (GLM) and this method
also assumes several assumptions: there is linear relationship, there is
no multicollinearity, it includes relevant variables into analysis, and
there is true correlation between variables and factors.
• Several types of factor analysis methods are available, but principle
component analysis is used most commonly.

Dr. Manoj Dash


Types of factoring:

• Principal component analysis: This is the most common method used by researchers. PCA starts
extracting the maximum variance and puts them into the first factor. After that, it removes that
variance explained by the first factors and then starts extracting maximum variance for the second
factor. This process goes to the last factor.

• Common factor analysis: Common factor analysis is the second most preferred method by
researchers. It extracts the common variance and puts them into factor. Common factor analysis
does not include the unique variance of all variables. This method is used in SEM modeling.

• Image factoring: This method is based on correlation matrix. OLS Regression method is used to
predict the factor in image factoring.

• Maximum likelihood method: This method also works on correlation metric but it uses maximum
likelihood method to factor.

Dr. Manoj Dash


Factor Analysis

• For example, suppose that a bank asked a


large number of questions about a given
branch. Consider how the following
characteristics might be more parsimoniously
represented by just a few constructs (factors).

Dr. Manoj Dash


Steps in Factor Analysis
• Factor analysis usually proceeds in four steps:
– 1st Step: the correlation matrix for all variables is
computed
– 2nd Step: Factor extraction
– 3rd Step: Factor rotation
– 4th Step: Make final decisions about the number of
underlying factors

141
Steps in Factor Analysis:
The Correlation Matrix
• 1st Step: the correlation matrix
– Generate a correlation matrix for all variables
– Identify variables not related to other variables
– If the correlation between variables are small, it is unlikely that they
share common factors (variables must be related to each other for the
factor model to be appropriate).
– Think of correlations in absolute value.
– Correlation coefficients greater than 0.3 in absolute value are indicative
of acceptable correlations.
– Examine visually the appropriateness of the factor model.

142
Steps in Factor Analysis:
The Correlation Matrix
– Bartlett Test of Sphericity:
 used to test the hypothesis the correlation matrix is an identity matrix (all
diagonal terms are 1 and all off-diagonal terms are 0).

 If the value of the test statistic for sphericity is large and the associated
significance level is small, it is unlikely that the population correlation matrix
is an identity.

– If the hypothesis that the population correlation matrix is an identity


cannot be rejected because the observed significance level is large, the
use of the factor model should be reconsidered.

143
Steps in Factor Analysis:
The Correlation Matrix
– The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy:
 is an index for comparing the magnitude of the observed correlation
coefficients to the magnitude of the partial correlation coefficients.

 The closer the KMO measure to 1 indicate a sizeable sampling adequacy (.8
and higher are great, .7 is acceptable, .6 is mediocre, less than .5 is
unaccaptable ).

 Reasonably large values are needed for a good factor analysis. Small KMO
values indicate that a factor analysis of the variables may not be a good idea.

144
Steps in Factor Analysis:
Factor Extraction
 2nd Step: Factor extraction
 The primary objective of this stage is to determine the factors.
 Initial decisions can be made here about the number of factors underlying a
set of measured variables.
 Estimates of initial factors are obtained using Principal components
analysis.
 The principal components analysis is the most commonly used extraction
method . Other factor extraction methods include:
 Maximum likelihood method
 Principal axis factoring
 Alpha method
 Unweighted lease squares method
 Generalized least square method
 Image factoring.

145
Steps in Factor Analysis:
Factor Extraction
 In principal components analysis, linear combinations of the observed
variables are formed.

 The 1st principal component is the combination that accounts for the
largest amount of variance in the sample (1st extracted factor).

 The 2nd principle component accounts for the next largest amount of
variance and is uncorrelated with the first (2nd extracted factor).

 Successive components explain progressively smaller portions of the


total sample variance, and all are uncorrelated with each other.

146
Steps in Factor Analysis:
Factor Extraction
 To decide on how many factors we Total Variance Explained

need to represent the data, we use 2 Extraction Sums of Squared

statistical criteria: Initial Eigenvalues Loadings

 Eigen Values, and Comp % of Cumulativ % of Cumulativ

 The Scree Plot. onent Total Variance e% Total Variance e%

1 3.046 30.465 30.465 3.046 30.465 30.465

 The determination of the number of 2 1.801 18.011 48.476 1.801 18.011 48.476

factors is usually done by considering 3 1.009 10.091 58.566 1.009 10.091 58.566

only factors with Eigen values greater 4 .934 9.336 67.902

than 1. 5 .840 8.404 76.307

6 .711 7.107 83.414

 Factors with a variance less than 1 are 7 .574 5.737 89.151

no better than a single variable, since 8 .440 4.396 93.547

each variable is expected to have a 9 .337 3.368 96.915

variance of 1. 10 .308 3.085 100.000

Extraction Method: Principal Component Analysis.

147
Steps in Factor Analysis:
Factor Extraction
 The examination of the Scree plot provides a visual of the
total variance associated with each factor.

 The steep slope shows the large factors.

 The gradual trailing off (scree) shows the rest of the


factors usually lower than an Eigen value of 1.

 In choosing the number of factors, in addition to the


statistical criteria, one should make initial decisions based
on conceptual and theoretical grounds.

 At this stage, the decision about the number of factors is


not final.

148
Steps in Factor Analysis:
Factor Extraction
Component Matrix using Principle Component Analysis

Component Matrixa

Component

1 2 3

I discussed my frustrations and feelings with person(s) in school .771 -.271 .121

I tried to develop a step-by-step plan of action to remedy the problems .545 .530 .264

I expressed my emotions to my family and close friends .580 -.311 .265

I read, attended workshops, or sought someother educational approach to correct the .398 .356 -.374
problem

I tried to be emotionally honest with my self about the problems .436 .441 -.368

I sought advice from others on how I should solve the problems .705 -.362 .117

I explored the emotions caused by the problems .594 .184 -.537

I took direct action to try to correct the problems .074 .640 .443

I told someone I could trust about how I felt about the problems .752 -.351 .081

I put aside other activities so that I could work to solve the problems .225 .576 .272

Extraction Method: Principal Component Analysis.

a. 3 components extracted.

149
Steps in Factor Analysis:
Factor Rotation

 3rd Step: Factor rotation.


 In this step, factors are rotated.

 Un-rotated factors are typically not very interpretable (most factors are
correlated with may variables).

 Factors are rotated to make them more meaningful and easier to


interpret (each variable is associated with a minimal number of
factors).

 Different rotation methods may result in the identification of


somewhat different factors.

150
Steps in Factor Analysis:
Factor Rotation
 The most popular rotational method is Varimax rotations.

 Varimax use orthogonal rotations yielding uncorrelated factors/components.

 Varimax attempts to minimize the number of variables that have high


loadings on a factor. This enhances the interpretability of the factors.

151
Steps in Factor Analysis:
Factor Rotation
• Other common rotational method used include Oblique rotations which
yield correlated factors.

• Oblique rotations are less frequently used because their results are more
difficult to summarize.

• Other rotational methods include:


 Quartimax (Orthogonal)
 Equamax (Orthogonal)
 Promax (oblique)

152
Steps in Factor Analysis:
Factor Rotation
• A factor is interpreted or named by examining the largest values linking the factor
to the measured variables in the rotated factor matrix.
Rotated Component Matrixa

Component

1 2 3

I discussed my frustrations and feelings with person(s) in school .803 .186 .050

I tried to develop a step-by-step plan of action to remedy the problems .270 .304 .694

I expressed my emotions to my family and close friends .706 -.036 .059

I read, attended workshops, or sought someother educational approach to correct the problem .050 .633 .145

I tried to be emotionally honest with my self about the problems .042 .685 .222

I sought advice from others on how I should solve the problems .792 .117 -.038

I explored the emotions caused by the problems .248 .782 -.037

I took direct action to try to correct the problems -.120 -.023 .772

I told someone I could trust about how I felt about the problems .815 .172 -.040

I put aside other activities so that I could work to solve the problems -.014 .155 .657

Extraction Method: Principal Component Analysis.


Rotation Method: Varimax with Kaiser Normalization.

a. Rotation converged in 5 iterations.


153
Steps in Factor Analysis:
Making Final Decisions
• 4th Step: Making final decisions
– The final decision about the number of factors to choose is the number of factors
for the rotated solution that is most interpretable.
– To identify factors, group variables that have large loadings for the same factor.
– Plots of loadings provide a visual for variable clusters.
– Interpret factors according to the meaning of the variables
• This decision should be guided by:
– A priori conceptual beliefs about the number of factors from past research or
theory
– Eigen values computed in step 2.
– The relative interpretability of rotated solutions computed in step 3.

154
Factor analysis
To find out:
• Reliability test
• Validity test sample adequacy test
• Communalities
• Loading
• Variance explained anti image variance matrix
• Factors
• To check factors are independent
• Multiple regression to find out significant factor

Dr. Manoj Dash


Reliability test

• Reliability shows the extent to which a scale produces


consistent results if measurements are made repeatedly. This
is done by determining the association between scores
obtained from different administration of the scale
• If the association is high then the scale yields consistent
results , thus is reliable
• Croanbanch ‘s alpha is most widely used method , it may be
mentioned that its value varies from 0 to 1
• Satisfactory value required to be more than 0.6 for the scale
to be realiable(Crobanch , 1951)

Dr. Manoj Dash


Sample required

• Kass and Tinesley(1979) : recommended to have 5 to 10 subjects per


variable up to a total of 300
• Tabachinick and Fidell(1996): it is comforting to have at least 300
cases for good factor analysis ,100 as poor and 1000 excellent
• MaCallum(1999) : if all communities are above 0.6 relatively small
sample than less than 100 is perfectly adequate ,itf it is 0.5 to 0.6
than 100-200 is good , in case of very low communities (<0.5) than
500 is recommended
• It is clear that a sample size of 300 or more should be suitable for
factor analysis

Dr. Manoj Dash


KMO test(sample adequacy test)
• Kasier- Meyer- Olkin measures sampling
adequacy
• The KMO varies between 0 to 1.
• This measure varies between 0 and 1, and
values closer to 1 are better.
• Greater than 0.5 are acceptable,between 0.5-
0.7 are mediocre ,0.7-0.9 are good ,0.8-0.9 are
great, above 0.9 are superb

Dr. Manoj Dash


Bertlett’ s test(significance test)
• This tests the null hypothesis that the correlation matrix is
an identity matrix. An identity matrix is matrix in which
all of the diagonal elements are 1 and all off diagonal
elements are 0. You want to reject this null hypothesis.
• This test should be significant i.e having a significance
value less than 0.5

Dr. Manoj Dash


Initial Eigenvalues
• Eigenvalues are the variances of the factors.
• Because we conducted our factor analysis on the
correlation matrix, the variables are standardized, which
means that the each variable has a variance of 1.(Henry
Kaiser )
• and the total variance is equal to the number of variables
used in the analysis

Dr. Manoj Dash


Factor loading
• Factor loading is basically the correlation coefficient for the variable
and factor.
• Factor loading shows the variance explained by the variable on that
particular factor.
• This is the proportion of each variable's variance that can be explained
by the factors (e.g., the underlying latent continua). It is also noted as
h2 and can be defined as the sum of squared factor loadings for the
variables.
• 0.30- significant
• 0.40- more important
• 0.50 or greater considered more significant

Dr. Manoj Dash


Assumptions in Factor analysis:

• 1. No outlier: Factor analyses assume that there is no outlier in data.


• 2. Adequate sample size: In factor analysis, the case must be greater than
the factor.
• 3. No perfect multicollinearity: Factor analysis is an interdependency
technique. There should not be perfect multicollinearity between the
variables.
• 4. Homoscedasticity: Since factor analysis is a linear function of measured
variables, it does not require homoscedasticity between the variables.
• 5. Linearity: Factor analysis is also based on linearity assumption. Non-linear
variables can also be used. After transfer, however, it changes into linear
variable.
• 6. Interval Data: Interval data are assumed for factor analysis

Dr. Manoj Dash


Factor Analysis

Dr. Manoj Dash


Obtaining a Factor Analysis
• Click:
– Analyze and select
• Dimension
Reduction
• Factor
• A factor
Analysis Box
will appear

164
Obtaining a Factor Analysis
• Move
variables/scale
items to
Variable box

165
Obtaining a Factor Analysis
• Factor
extraction
• When variables
are in variable
box, select:
– Extraction

166
Obtaining a Factor Analysis
• When the factor
extraction Box
appears, select:
• Scree Plot

• keep all default


selections including:
– Principle
component
Analysis
– Based on Eigen
Value of 1, and
– Un-rotated
factor solution

167
Obtaining a Factor Analysis
• During
factor
extraction
keep
factor
rotation
default of:
– None
– Press
continue

168
Obtaining a Factor Analysis
• During Factor Rotation:
• Decide on the number
of factors based on
actor extraction phase
and enter the desired
number of factors by
choosing:
• Fixed number of factors
and entering the
desired number of
factors to extract.
• Under Rotation Choose
Varimax
• Press continue
• Then OK

169
Dr. Manoj Dash
Dr. Manoj Dash
Dr. Manoj Dash
Dr. Manoj Dash
Cumulative percent of variance explained.

We are looking for an eigenvalue above 1.0.

Dr. Manoj Dash


Dr. Manoj Dash
Dr. Manoj Dash
Dr. Manoj Dash
Expensive Appeals to Others Reliable
Exciting Attractive Looking Latest Features
Luxury Trend Setting Trust
Distinctive
Not Conservative
Not Family
Not Basic

Dr. Manoj Dash


What shall these components be called?

Expensive Appeals to Others Reliable


Exciting Attractive Looking Latest Features
Luxury Trend Setting Trust
Distinctive
Not Conservative
Not Family
Not Basic

Dr. Manoj Dash


EXCLUSIVE TRENDY RELIABLE

Expensive Appeals to Others Reliable


Exciting Attractive Looking Latest Features
Luxury Trend Setting Trust
Distinctive
Not Conservative
Not Family
Not Basic

Dr. Manoj Dash


Case Study
Trends of electronic purchase
• To determine whether the variables of consumer decision
Why this study is important/objectives?
making instruments style are vary offline to online.
• To find the factors of consumers’ electronic purchase
decision making style
• To find the significant difference on the basis of age/ gender
• To find influencing factors
• To develop a model of unpremeditated consumers
Methodology

• This study follows the survey research methodology.


• Based on previous research in related areas, a questionnaire was constructed to
measure and 5- point Likert Scale used.
• After pilot testing on a small group of individuals, the questionnaire was
administered.
• Sample size:411
• Statistical techniques:
1. Factor analysis (EFA)
2. Analysis of Variance analysis(ANOVA)
3. Confirmatory factor analysis (CFA)
4. Structural Equation Modeling (SEM)
Factor analysis approach
Factor Name and Statements Mean S.D Reliability Communities Factors
(α) Loading
2.08 .799 .919 - -
1.Innovative product consciousness
(28.122 percent of variance explained with 9.843 eigen value)
IPC1 2.71 1.24 .916 .784 .846
IPC2 2.72 1.26 .916 .789 .842
IPC3 2.72 1.22 .916 .751 .823
IPC4 2.72 1.25 .917 .711 .815
IPC5 2.81 1.35 .915 .772 .812
IPC6 2.77 1.29 .916 .642 .655
IPC7 2.70 1.17 .917 .655 .523
2.21 .962 .912 - -
2. Brand value consciousness
(10.815 percent of variance explained with 3.785 eigen value)
BVC1 2.64 1.22 .917 .727 .827
BVC2 2.68 1.24 .917 .715 .817
BVC3 2.70 1.26 .917 .748 .815
BVC4 2.81 1.32 .917 .671 .794
BVC5 2.62 1.29 .916 .718 .782
BVC6 2.62 1.29 .916 .575 .730
BVC7 2.72 1.23 .918 .565 .686
BVC8 2.53 1.18 .919 .637 .625
2.07 .831 .869 - -
3. Trendy sophisticated
(8.468 percent of variance explained with 2.964 eigen value)
TS1 2.55 1.23 .919 .765 .851
TS2 2.60 1.24 .919 .739 .831
TS3 2.52 1.16 .919 .696 .798
TS4 2.68 1.25 .918 .659 .737
1.93 .880 .844 - -
4. Country of origin
(6.671 percent of variance explained with 2.335 eigen value)
BO1 2.74 1.30 .917 .717 .770
BO2 2.66 1.29 .918 .674 .716
BO3 2.67 1.26 .917 .684 .681
BO4 2.79 1.35 .916 .703 .664
Factor Name and Statements Cont…Mean S.D Reliability Factors
(α) Loading
1.93 .880 .689 - -
5. Price sensitive consciousness
(3.860 percent of variance explained with 1.351 eigen value)
PSC1 2.59 1.24 .918 .691 .796
PSC2 2.78 1.95 .921 .551 .704
PSC2 2.57 1.31 .918 .617 .696

1.81 .719 .713 - -


6. Unpremeditated consumer
(3.612 percent of variance explained with 1.264 eigen value)
UC1 2.78 1.94 .920 .744 .815
UC2 2.50 1.24 .919 .678 .767
UC3 2.58 1.21 .917 .660 .566

1.64 .675 .718 - -


7. Misperception by over choice
(3.196 percent of variance explained with 1.119 eigen value)
MOC1 2.43 1.26 .919 .712 .767
MOC2 2.52 1.24 .918 .685 .678
MOC3 2.53 1.27 .918 .634 .546

1.56 .529 .560 - -


8. Socially- consciousness
(3.124 percent of variance explained with 1.098 eigen value)
SC1 2.59 1.25 .918 .562 .641
SC2 2.72 1.22 .917 .641 .586
SC3 2.69 1.20 .919 .482 .534

{Kaiser-Meyer-Olkin Measure of Sampling Adequacy=0.880,


Bartlett's Test of Sphericity = Approx. Chi-Square - 8329.315 (p=0.000)}
Innovative
product
consciousness

Brand Value Trendy


consciousness sophisticated

Socially-
consciousness e-CDMS Country of
origin

Misperception Price sensitive


by over choice consciousness

Unpremeditated
consumer
Conceptual model for unpremeditated consumers’ electronic purchasing decision
Factor Name and Statements EFA CFA
Reliability Factors Reliability Factors

Measurement properties of scales of consumer decision(α)making


Loading (α)
styles (CDMS) Loading
after
.919 - .819 -
1.Innovative product consciousness
model fit in CFA
To get variety, I shop different online retail stores and choose my product. .916 .846 .871 .841
Attractive features in an online buying product are very important to me. .916 .842 Eliminated Eliminated
I can keep my product up-to-date with the changing life style through online purchase. .916 .823 Eliminated Eliminated
I usually choose one or more online products which are innovative style .917 .815 .872 .808
In general, I usually try to buy the best quality product. .915 .812 Eliminated Eliminated
My standard and expectations for a product during online buying are very high. .916 .655 .870 .751
Nice department and specialty online stores offer me the best products. .917 .523 Eliminated Eliminated
.912 - .908 -
2. Brand value consciousness
I go to the same online stores each time to shop my brand. .917 .827 .870 .849
I have favourite brands I buy over and over through online. .917 .817 .870 .848
I can get number of branded companies products in a particular online site. .917 .815 .870 .703
Often I wish to purchases the best brand through online .917 .794 .872 .705
I like to focus shop through online the good value for money. .916 .782 .868 .817
I really give much thought or care of my electronic purchase brands .916 .730 Eliminated Eliminated
Once I find a brand online retail sites I like, I stick with it. .918 .686 .873 .651
The most online advertised brands are usually very good choices. .919 .625 .873 .592
.869 - .815 -
3. Trendy sophisticated
The trendiest online products are usually my choice. .919 .851 .877 .852
A product doesn’t have to be perfect, or the best, to satisfy me. .919 .831 .876 .807
It fun to buy something new and fashionable through online. .919 .798 Eliminated Eliminated
Sometimes it’s hard to choose which online stores to shop for perfect choose. .918 .737 Eliminated Eliminated
4. Country of origin .844 - .710 -

The well-known national brands are best for me to buy online. .917 .770 .874 .668
I buy online as much as possible national brand Cont... .918 .716 Eliminated Eliminated
During online buying I prefer the brand relative to country of origin. .917 .681 .873 .778
I make my shopping fast through online purchasing. .916 .664 Eliminated Eliminated

5. Price sensitivity conscious .689 - - -

I buy online after comparing the price with others service provider. .918 .796 Eliminated Eliminated
I carefully watch how much I spend during online buying. .921 .704 Eliminated Eliminated
I take the time to shop online carefully for best buy high price products. .918 .696 Eliminated Eliminated

6.Unpremeditated consumer .713 - .743 -

I should plan my online shopping more carefully than I do. .920 .815 .878 .572
I am impulsive when purchasing online products. .919 .767 .878 .519
I can change my regularly online buying brands .917 .566 .874 .948

7.Misperception by over-choice .718 - .723 -

There are so many brands available online to choose from that I feel confused .919 .767 .874 .548
The more I learn about online products, the harder it seems to choose the best .918 .678 .871 .922
Sometimes I feel hard to choose which online store to shop. .918 .546 Eliminated Eliminated
Latent Variable and related items Standardized Factors a Average Variance Cronbach
Loading (>.70)* Extracted (>.50) * (α) (>.70)*

Innovative Product Consciousness 0.64 .819

Convergent validity measurement


To get variety, I shop different online retail stores and choose my product. .841
I usually choose one or more online products which are innovative style. .808
My standard & expectations for a product during online buying are very high .751

Brand Value Consciousness 0.55 .908

I go to the same online stores each time to shop to my brand. .849


I have favourite brands I buy over and over through online. .848
I can get number of branded companies products in a particular online site. .703
Often I wish to purchases the best brand through online. .705
I like to focus shop through online the good value for money. .817
Once I find a brand online retail sites I like, I stick with it. .651
The most online advertised brands are usually very good choices. .592

Trendy Sophisticated 0.69 .815

The trendiest products are usually my choice. .852


A product doesn’t have to be perfect, or the best, to satisfy me .807

Country of origin 0.53 .710

The well-known national brands are best for me to buy online. .668
During online buying I prefer the brand relative to country of origin. .778

Unpremeditated consumer 0.51 .743

I should plan my online shopping more carefully than I do. .572


I am impulsive when purchasing products through online. .519
I can change my regularly online buying brands .948

Misperception by over- choice 0.58 .723

There are so many online brands to choose from that often I feel confused .548
The more I learn about online products, the harder it seems to choose the best .922

Socially- consciousness 0.51 .701

I prefer to buy from online service provider companies that give something back to society .763
I willing to pay extra for product and services to the companies that give back to society .661
* Indicates an acceptable level of reliability or validity and a AVE: Average Variance Extracted. This is computed by adding the squared factor loading divided by number of factors
Correlation of latent variables and Discriminant validity
Innovative Brand value Trendy Country of Unpremeditated Misperception Socially
Product consciousness Sophisticated Origin Consumer by Over choice Consciousness
Innovative Product (.80)

Brand Value .389** (.74)

Trendy .184** .143** (.83)


Sophisticated
Country of Origin .359** .208** .487** (.73)

Unpremeditated .449** .224** .155** .295** (.71)


Consumer
Misperception .468** .203** .387** .708** .379** (.76)
by Over choice
Socially- .455** .389** .257** .350** .247** .312** (.71)
Consciousness
**Indicated that value is significant at p < .01

Diagonal in parentheses: square root of average variance from observed variables (items); off-diagonal: correlation between constructs
The measurement model

Figure- 2: Hypothesized structural and measurement model of unpremeditated


consumers’ electronic buying behaviour
Structural Model Results
Path Estimate t-Value
Brand value consciousness unpremeditated behaviour 0.034 .339
Country of origin unpremeditated behaviour -0.748 .563
Innovative product consciousness brand value consciousness 0.172 .725
Innovative product consciousness unpremeditated behaviour -0.067 .154
Innovative product consciousness Country of origin -0.081 .391
Trendy sophisticated brand value consciousness -1.232 .355
Trendy sophisticated unpremeditated behaviour 1.303 .377

Trendy sophisticated Country of origin 1.121 .456


Misperception by over choice brand value consciousness -0.019 .038
Misperception by over choice unpremeditated behaviour 2.489 .959
Misperception by over choice Country of origin 1.819 4.274***

Socially conscious brand value consciousness 1.555 .380


Socially conscious unpremeditated behaviour -1.508 .365

Socially conscious Country of origin -1.137 .396


Chi- square 615.283
d.f. 185
NFI 0.874
NNFI 0.884
CFI 0.907
RMSEA 0.075
***α =.001
Findings

Cause related marketing

Socially-
consciousness
 Branded
Without companies taking
investment benefit more advantages
Double type
benefits for e- Brand-  Problem for
service provider Social consumer consciousne
new entrants
Impact on
Impact on
Impact
unplanned sharing ss
unplanned
behaviour of
behaviour of consumer
consumer

Trendy-
Sophisticated

Novelty & innovative features products more attract consumers
Case study-1
• Factors affecting customer satisfaction in a
retail mall in ghaziabad
• Data-1
Bibliographical References
 Almar, E.C. (2000). Statistical Tricks and traps. Los Angeles, CA: Pyrczak Publishing.
 Bluman, A.G. (2008). Elemtary Statistics (6th Ed.). New York, NY: McGraw Hill.
 Chatterjee, S., Hadi, A., & Price, B. (2000) Regression analysis by example. New York: Wiley.
 Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd
Ed.). Hillsdale, NJ.: Lawrence Erlbaum.
 Darlington, R.B. (1990). Regression and linear models. New York: McGraw-Hill.
 Einspruch, E.L. (2005). An introductory Guide to SPSS for Windows (2nd Ed.). Thousand Oak, CA: Sage
Publications.
 Fox, J. (1997) Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage
Publications.
 Glassnapp, D. R. (1984). Change scores and regression suppressor conditions. Educational and Psychological
Measurement (44), 851-867.
 Glassnapp. D. R., & Poggio, J. (1985). Essentials of Statistical Analysis for the Behavioral Sciences. Columbus,
OH: Charles E. Merril Publishing.
 Grimm, L.G., & Yarnold, P.R. (2000). Reading and understanding Multivariate statistics. Washington DC:
American Psychological Association.
 Hamilton, L.C. (1992) Regression with graphics. Belmont, CA: Wadsworth.
 Hochberg, Y., & Tamhane, A.C. (1987). Multiple Comparisons Procedures. New York: John Wiley.
 Jaeger, R. M. Statistics: A spectator sport (2nd Ed.). Newbury Park, London: Sage Publications.

196
Bibliographical References
• Keppel, G. (1991). Design and Analysis: A researcher’s handbook (3rd Ed.). Englwood Cliffs, NJ: Prentice Hall.
• Maracuilo, L.A., & Serlin, R.C. (1988). Statistical methods for the social and behavioral sciences. New York:
Freeman and Company.
• Maxwell, S.E., & Delaney, H.D. (2000). Designing experiments and analyzing data: Amodel comparison
perspective. Mahwah, NJ. : Lawrence Erlbaum.
• Norusis, J. M. (1993). SPSS for Windows Base System User’s Guide. Release 6.0. Chicago, IL: SPSS Inc.
• Norusis, J. M. (1993). SPSS for Windows Advanced Statistics. Release 6.0. Chicago, IL: SPSS Inc.
• Norusis, J. M. (2006). SPSS Statistics 15.0 Guide to Data Analysis. Upper Saddle River, NJ.: Prentice Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Guide to Data Analysis. Upper Saddle River, NJ.: Prentice Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Statistical Procedures Companion. Upper Saddle River, NJ.: Prentice
Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Advanced Statistical Procedures Companion. Upper Saddle River, NJ.:
Prentice Hall.
• Pedhazur, E.J. (1997). Multiple regression in behavioral research, third edition. New York: Harcourt Brace
College Publishers.

197
Bibliographical References
• SPSS Base 7.0 Application Guide (1996). Chicago, IL: SPSS Inc.
• SPSS Base 7.5 For Windows User’s Guide (1996). Chicago, IL: SPSS Inc.
• SPSS Base 8.0 Application Guide (1998). Chicago, IL: SPSS Inc.
• SPSS Base 8.0 Syntax Reference Guide (1998). Chicago, IL: SPSS Inc.
• SPSS Base 9.0 User’s Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Base 10.0 Application Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Base 10.0 Application Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Interactive graphics (1999). Chicago, IL: SPSS Inc.
• SPSS Regression Models 11.0 (2001). Chicago, IL: SPSS Inc.
• SPSS Advanced Models 11.5 (2002) Chicago, IL: SPSS Inc.
• SPSS Base 11.5 User’s Guide (2002). Chicago, IL: SPSS Inc.
• SPSS Base 12.0 User’s Guide (2003). Chicago, IL: SPSS Inc.
• SPSS 13.0 Base User’s Guide (2004). Chicago, IL: SPSS Inc.
• SPSS Base 14.0 User’s Guide (2005). Chicago, IL: SPSS Inc..
• SPSS Base 15.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• SPSS Base 16.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• SPSS Statistics Base 17.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• Tabachnik, B.G., & Fidell, L.S. (2001). Using multivariate statistics (4th Ed). Boston, MA: Allyn and Bacon.

198
Confirmatory factor analysis
The exploratory factor model

1 1

...............
x1 xq

x1 x2 x3 x4 . . . . . . . . . . . . x. 5. . x6 x7 x8

d1 d2 d3 d4 d5 d6 d7 d8

q11d q22d q33d q44d q55d q66d q77d q88d


Why Confirmatory factor analysis ?

 Confirm the factor structure


 You specify a model, indicating which variables load on which
factors and which factors are correlated
 No (zero) cross-loadings
 A model is constructed in advance
 The number of latent variables is set by the analyst, whether a
latent variable influences an observed variables is specified
 Measurement errors may correlate, the co-variances of latent can
be estimated or set to any value and parameter identification is
required
Why Confirmatory factor analysis ?

j21
1 1

x1 x2

l11 l21 l31 l41 l52 l62 l72 l82

x1 x2 x3 x4 x5 x6 x7 x8

d1 d2 d3 d4 d5 d6 d7 d8

q11d q22d q33d q44d q55d q66d q77d q88d


Why Confirmatory factor analysis ?

(Source: http://en.wikipedia.org/wiki/Confirmatory_factor_analysis)
Why Confirmatory factor analysis ?

1 j21 1

x1 x2

l11 l21 l31 l41 l52 l62 l72 l82

x1 x2 x3 x4 x5 x6 x7 x8

d1 d2 d3 d4 d5 d6 d7 d8
Evaluating Model Fit

1. Absolute fit indices


 Chi-squared test
 The root mean square error of approximation (RMSEA)
 Root mean square residual (SMR) and standardized root mean square
residual (SRMR)
 Goodness of fit index (GFI) and adjusted goodness of fit index(AGFI)
2. Relative fit indices
 Normed fit index (NFI) and Non-normed fit index (NNFI)
3. Comparative fit index
 Comparative fit index (CFI)
Some Thumb Role for Model Fit

• Note: For more details please go through study materials (CFA & SEM) ,Specially paper 3 in SEM

(Source: http://en.wikipedia.org/wiki/Confirmatory_factor_analysis)
Confirmatory factor analysis

• CFA Model fit

Criteria p χ2 / df AGFI NFI RFI CFI RMSEA Over all


Recommended <0.05 Between >0.9 >0.9 >0.9 >0.9 <0.05 -
Value 1 and 5
Actual Value .000 3.521 .898 .872 .909 .885 0.032 Good fit
References

 Hair, J., Black, W., Babin, B., and Anderson, R. (2010). Multivariate data analysis (7th ed.):
Prentice-Hall, Inc. Upper Saddle River, NJ, USA.

 Barbara M. Byrne (2010). Structural Equation Modeling with Amos: Basic Concepts,
Applications and Programming (2th Edition).

 Bollen, Kenneth A (1998). Structural equation models, Wiley Online Library.

 Field, Andy (2009). Discovering statistics using SPSS, Sage Publications Limited.
Let us do a exercise my dear friends
Interactive Graphical
Data Analysis through
Tableau Software
Dr. Manoj Kumar Dash
M.A;M.Phil;M.B.A;NET; Ph.D
ABV-Indian Institute of Information Technology and Management, Gwalior
( An Autonomous institute of Government of India)
Agenda for Discussion

Tableau
Software
Tableau Demonstration
Software
30 Min
Visualization 20 Min
Software
10 Min.
Data
Visualization
10 Min.
Big Data
10 Min. Question
and Answer
10min
Data Visualization
A graphical, animation, or video presentation
of data and the results of data analysis
– The ability to quickly identify important trends in
corporate and market data can provide
competitive advantage
– Check their magnitude of trends by using
predictive models that provide significant business
advantages in applications that drive content,
transactions, or processes
01. Dygraphs
02. ZingChart
3.InstantAtlas
04. Timeline
05. Exhibit
06. Modest Maps
07. Leaflet
08. WolframAlpha
09. Visual.ly
10. Visualize Free
11.Better World Flux
12.FusionCharts
13. jqPlot
14. Highcharts
15. iCharts
Open Source Software
Data Visualization
• D3.js, •Plotly
• http://www.fusioncharts.com •ChartBlocks
/ata Visualizat
• Chart.js
•Flot
• Google Charts •Raphaël
• Highcharts •Visual.ly
• Leaflet •Crossfilter


Dygraphs
Datawrapper
•Tangle
•Polymaps
Tableau
• Raw •Kartograph
• Timeline JS
•CartoDB
• Infogram
•NodeBox
• Ggobi
• Xmdv
•Weka
•Gephi
GARTNER MAGIC QUADRANT
FOR BI
Contents to cover
• Step-1 Tableau Introduction
• Step-2 Connecting to Data
• Step-3 Building basic views
• Step-4 Data manipulations and Calculated
fields
• Step-5 Tableau Dashboards
• Step-6 Advanced Data Options
• Step-7 Advanced graph Options
What is Tableau
• Tableau is a rapid BI software
• Great visualizations: Allows anyone to
connect to data, visualize and create
interactive, sharable dashboards in a few clicks
• Ease of use: It's easy enough that any Excel
user can learn it, but powerful enough to
satisfy even the most complex analytical
problems.
• Fast: We can create parallelized dashboards,
quick filters and calculations
Venkata Reddy Konasani
What is Tableau?

Tableau enables users to quickly see and


understand their data by automatically turning
data into pictures
TABLEAU
TIMELINE

 Achieves
20th
consecutive
quarter of
record
growth
 Three  Achieves 4th  Achieves  Adds 1000
professor s consecutive 8th customer
(Chris, Pat, quarter of consecutive accounts  Launches
Chabot) record sales quarter of Tableau 7.1
in the record  Launches
Stanford growth Tableau 7.2
university  Launches  Launche
started  Named Tableau  Launches
“Product s
research  Launches Tableau 9.0.1
7.3 Tableau
of the Year” Desktop 5.0
to build  Launches by PC
8.0  Launches  To 9.0.5
visual tool Desktop1.0 – & Server
Magazine  Launche Tableau
5.0
customers s 8.2
in every  #400 on Inc Tableau  Launches
industry 500 8.1 Tableau
 #132 on 8.3
Deloitte
Technol
ogy
Fast500
1991 2005 2006 2009 2012 2013 2014 2015
Tableau’s History

See, Understand and


Share Data
Gartner ~ Magic Quadrant

Fastest Growing BI Vendor

Tableau Mobile Platform

Tableau Public Cloud ~ 7.5M


users
DARPA
Forrester Report - ROI 127%

3000+ customers Deloitte Fast 500 - #115

Release 1.0 ~ 7,500 successful customers


Release 3.1 ~
Tableau Desktop
Tableau Server 50+ federal agencies
VizQL™ including the Air Force

1997 - 2003 2004 2005 - 2008 2009 - 2011 Future ?


Tableau Products
Tableau Desktop Tableau Server Tableau Reader Tableau Public
Create Share - Web Share - Local Share - Everyone

+ ad hoc analytics, + business intelligence + share visualizations & + create and publish
dashboards, reports, graphs solution scales to dashboards on the desktop interactive visualizations
+ explore, visualize, and organizations of all sizes + filter, sort, and page through and dashboards
analyze your data + share visual analytics with the views + embed in websites and
+ create dashboards to anyone with a web browser + “Acrobat for Data” blogs
consolidate multiple views + publish interactive analytics + free download + free download and free
+ deliver interactive data or dashboards hosting service
experiences + secure information and
manage metadata
+ collaborate with others
Market Expectations:
Strength of Tableau….
Fast Cost Effective Everyone

Easy Secure Everywhere


Opening Screen
Basics
• Opening a new sheet
– File>>New
• Connect to data
– Data>>Connect to data
Typical Data

Dimensions Measures
Lab
• Start Tableau
• Open a new workbook
• Add one additional sheet
• Identify data connection tab
• Can we connect to MySQL server?
• Can we connect to txt file?
• How to go back to workbook from connect to data window?
• Add a new dashboard
• Where are various types of graphs options available?
• Can we draw pie chart using Tableau?
Tableau Repository
• The Tableau repository holds Workbooks Bookmarks and
data sources.
• located in a folder called My Tableau Repository inside of
your My Documents folder.

Venkata Reddy Konasani


Tableau Files and File types
• Workbooks
– Tableau workbook files have the .twb file extension and are marked with the
workbook icon. Workbooks hold one or more worksheets and dashboards.
• Bookmarks
– Tableau bookmark files have the .tbm file extension and are marked with the
bookmark icon. Bookmarks contain a single worksheet and are an easy way to
quickly share your work.
• Packaged Workbooks
– Tableau packaged workbooks have the .twbx file extension and are marked
with the packaged workbook icon. Packaged workbooks contain a workbook
along with any supporting local file data sources and background images. This
format is the best way to package your work for sharing with others who don’t
have access to the data.
Tableau Files and File types
• Data Extract Files
– Tableau data extract files have the .tde file extension and
are marked with the extract icon. Extract files are a local
copy of a subset or entire data source that you can use to
share data, work offline, and improve database
performance.
• Data Connection Files
– Tableau data connection files have the .tds file extension
and are marked with the data connection icon. Data
connection files are shortcuts for quickly connecting to
data sources that you use often.
Step2-Connecting to Data
Demo: Connecting to Desktop files

• Connecting to excel file


– Connecting to superstore sales in sales data folder
• Snapshot view of the data
• Connecting to txt file
– Connecting to survey data
• Connecting to access file
– Connecting to survey data
Data Types

• Sometimes Tableau may identify a field with a data type that is


incorrect.
• For example, a field that contains dates may be identified as an
integer rather than a date.
• You can change the data type in Tableau by right-clicking the field in
the Data window, selecting Change Data Type, and then selecting
the appropriate data type.
Navigating the Tableau work area
Demonstration
Demo: Basic views
• Superstore data
– Sum of order quantity by product category
– Sum of order quantity by month & year
– Changing the graph type
– Price web data day wise count of products
• Import market one data
– Draw a time series chart of number of campaigns by date
– Draw a time series chart of number of campaigns by
month
– Draw a time series chart for budget and identify the month
with maximum budget
– Adding reference line
Demo: GIS graph
• City wise bill on the map
– Count if accounts as size
– Total Bill as colour
• Connect to Sales_by_country_v1.csv(inside super store folder)
– Show number of units sold for each county
– Draw a fill map fill graph
• Connect to world bank data(data by country tab)
– Create a GIS graph to show GDP by country
– Create a GIS graph to show total population by country
Step-4 : Data manipulations and
Calculated fields
Contents
• Calculated fields
• Working with dates
• Logic statements
• Working with filters
Demo: Calculated fields
• Connect to excel>>Sample-Superstore
data>>>orders
Demo: Calculated fields
• New reduced shipping cost to 50%
Working with dates

Select any of the date function


to manipulate dates and use
them in the expression.
Demo: Working with dates
• Delay indicator in Super store data [Ship
Date]-[Order Date]
Working with filters
Dashboards
Step-5
Tableau Dashboards
Practice : Contents
Section 1 – Connecting to Data – Introduction
• 1) Connect to an Excel data source
2) Web Sources
3) MS Access
4) Creating Joins
5) Data Extracts
Section 2 – Basic Training

1) Excel
2) Data Roles – Dimensions and Measures
3) Data Window and Right-click Options
4) Excel Visual With Modified Defaults
5) Tabular View
6) Show Me
7) Formatting Text Visuals
8) Geo-Coding
9) Geo-Coding Filled Maps
10) Scatter Charts
11) Time Series – Trend Lines
12) Visual Filtering
13) Sorting
14) Filtering
15) Map Filters
16) Percentages and Totals
17) First Dashboard
Section 3 – Intermediate
1) Data Roles and Options
2) Changing Data Roles
3) Maps
4) Dates and Times
5) Dates with Calculations –Gantt Chart
6) Grouping
7) Bins and Histograms
8) Sets within Scatter Charts
9) Concatenated Sets
10) Sorting with Sets
11) Quick Table Calculations
12) Secondary Table Calculations
13) Create Calculated Fields
14) Histogram with Running Totals
15) Trend Lines
16) Reference Lines
17) Performance
Section 4 – Advanced Features

1) Combination Charts
2) Trends and Motions
3) Data Blending
4) Parameters
5) Shipping Parameters
6) Area Charts
Section 5 – Masters
1) Heat Maps
2) Box Plots
3) Pareto Or the 20 – 80 Rule
4) Bullet Chart
5) Bar In A Bar
6) Standard Deviations
7) Reference Lines with Banding
8) Groups and Sets
9) Dashboard
Section 6 – Dashboards and
Guided Analytics

A complete lesson on building out a


dashboard with Actions and Filters from
scratch
Section 7 – Calculations
1) Conditional Filters
2) Calculated Fields
Section 8 – Collaboration

1) PDF 2) PowerPoint 3) Export


Images 4) Tableau Reader and Outlook
Thanks
DR. MANOJ KUMAR DASH
9981380256
MANOJCTP@GMAIL.COM
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY AND
MANAGEMENT GWALIOR
Survival Analysis and
Customer Life Time Value
Survival Function Graph Produced
by SPSS
The proportion of the cohort that has survived (still enrolled) at any
term
Each step of the
curve represents an
event

There is a 90%
probability of
surviving to the end
of 10th term.

Surviving =
remaining enrolled!
Example of Survival Probability Graph

http://wpfau.blogspot.com/2011/08/safe-withdrawal-rates-and-life.html
One Minus survival function

There is a 10%
probability of not-
surviving to the end of
10th term.

Not surviving =
graduating!!
Contd.
Survival analysis

What is survival analysis


− Event history analysis
− Time series analysis
When use survival analysis
− Research interest is about time-to- and event is discrete
event
occurrence.
Examples
− Durationof
tosurvival
the hazardanalysis
of death
− Adoption of an innovation in diffusion research
− Marriage duration
Characteristics of survival analysis
− At any time point, events may occur
− Factors influence events include two types: time-constant
and
time-dependent (age).
Survival analysis

Survival analysis focuses on hazard function


− Hazard: the event of interest occurring
− Hazard might be death, engine breakdown, adoption of
innovation, etc.
− Hazard rate: is the instantaneous probability of the given
event occurring at any point in time. It can be plotted against
time on the X axis, forming a graph of the hazard rate over
time.
− Hazard function: the equation that describe this plotted line is
the hazard function.
− Hazard ratio: also called relative risk: Exp(B) in SPSS.
Survival analysis

Type of survival analysis


− Nonparametric: no assumption about the shape of hazard
function. Hazard function is estimated based on empirical data,
showing change over time, for example, Kaplan-Meier survival
analysis.
− Semi-parametric: no assumption about the shape of hazard
function, but make assumption about how covariates affect
the hazard function, for example: Cox regression
− Parametric: specify the shape of baseline hazard function and
covariates effects on hazard function in advance.
− Maximum likelihood method
− Used when time is itself considered a meaningful independent
variable.
− Used for predictive modeling
− Software: Stata
Survival analysis

Terms
− Events: what terminates an episode (such as churn, adoption
of an innovation), it is the change which causes the subject to
transition from one state to another.
− Durations: the number of time units an individual spends in a
given state.
− Dependent: probability of an event.
− Survival function, s(t): is the cumulative frequency of the
proportion of the sample Not experiencing the event by time
t. In another word, it is the probability of event will NOT occur
until time t.
− Censored cases: data are censored if events start before (left-
censored) or ended after (right-censored) the period of
observation.
Survival analysis

Censored cases
Survival analysis

Censored cases: unique characteristics of survival


analysis.
− For some cases, the event simply doesn’t occur before the end
of study.
− For some cases, they drop out from the study for reasons
unrelated to the study.
− For some cases, we lost track of their status sometime before
the end of the study.
Survival analysis

Outline of topics
− Life tables
− Kaplan-Meier
− Cox regression
Life tables

Life Tables is a descriptive procedure for examining the


distribution of time-to-event variables. We also can
compare the distribution by levels of a factor variable.
The basic idea of life tables is to subdivide the period of
observation into smaller time intervals. Then the
probability from each of the intervals are estimated.
Life tables

Variables
− Time variable (duration variable): must be a continuous
variable.
− Status variable: binary or categorical variable, represents the
event of interest.
− Factor variable: categorical variable.
Assumption
− The probability for the event of interest should depend
only
on time. Cases that enter the study at different times should
behave similarly.
− No systematic differences between censored and
uncensored
cases
Life tables

Example (from IBM SPSS 20.0): data file name: telco


− Examine distribution of customer time to churn by customer
category.
− Time variable: tenure (in month)
− Status variable: churn (binary: 1 = Churn, 0 = Not churn)
− Factor: custcat (four categories)
Go to Analyze > Survival > Life Tables
Life tables

Run analysis
Life tables

Click Options

1. Survival: display the


cumulative survival
function on a linear scale
2. Hazard: display the
cumulative hazard
function on a linear
scale.
Life tables

SPSS Outputs: life table


Life tables

SPSS Outputs: life table


− Cumulative Proportion Surviving at End of Interval. The
proportion of cases surviving from the start of the table to the
end of the interval (266-10-17)/266=0.8984 (second row).
− Probability Density. An estimate of the probability of
experiencing the terminal event during the interval.
− Hazard Rate. An estimate of experiencing the terminal event
during the interval, conditional upon surviving to the start of
the interval.
Life tables

SPSS Outputs: life table


− The greatest number and proportion of terminal events
occur
within the first year, which suggests that customers should be
monitored more closely during their first year to be sure of
their satisfaction with the company's service.
Life tables

SPSS Outputs

1.Wilcoxon test is used to compare


survival distribution among groups,
with the test statistic based on
differences in group mean scores.
2. Since the significance value of
the test is less than 0.05, we
conclude that the survival curves
are different across the group.
3. Pairwise comparisons show
which two groups are
significantly different in survival
curves.
SURVIVAL ANALYSIS : EXAMPLE
 Data required for analysis:

 Clearly defined event: (death, onset of illness, recovery from illness, marriage,
birth, mechanical failure, success, job loss, employment, graduation).
 Terminal event

 Event status (1 = event occurred, 0 = event did not occur)

 Time variable = Time measured from the entry of a subject into the study until the
defined event. Months, terms, days, years, seconds.

 Covariates:
 To determine if different groups have different survival times

 Gender, age, ethnicity, GPA, treatment, intervention

 Regression models
Survival analysis – SPSS Data layout
Basic student data
• Time variable – terms enrolled
• Event status – graduation status

Binary or
dummy
Censored variables Group into
indicator
categories

terms_enrolled graduate_status gender 1st_term_gpa

Student 1 5 0 1 3.4
Student 2 9 1 0 4.0
Student 3 14 0 1 2.9
Student 4 7 1 1 3.9
Student 5 8 1 0 3.1
Cohort Description

• Undergraduates, one division


• Fall 2006, Fall 2007 entering freshmen, N = 884
• Respondents to 2008 UCUES* survey
• Freshmen admits (transfers excluded)
• 1st term gpa >= 3.0
• Censored = 10 or 1.1%
• Explanatory variables available: gender, URM status, domestic-
foreign status, Pell Grant recipient status, hours worked (survey),
double/triple major

* UCUES = University of California Undergraduate Survey


Survival Analysis – SPSS
SPSS
• Analyze
• Survival
• Life Tables
Sample Data – Working in SPSS
SPSS
• Analyze
• Survival
• Life Tables
Survival Analysis – Life Table produced
byof the
primary output SPSSsurvival analysis procedure
Intervals = terms.
count is from admit
term

Count of still
enrolled
students at
start of term
Survival Analysis – Life Table produced
byof the
primary output SPSSsurvival analysis procedure

# exposed to # terminal Probability Density =


# withdrawing risk: # entering
during interval events = # Estimated probability of
interval minus ½ graduating in interval
= censored graduated
censored

Proportion Hazard Rate =


Cumul. Surviving
Terminating: # Proportion Instantaneous failure
= cumulative % of
Terminal events ÷ # rate. % chance of
surviving = 1 those surviving at
exposed to risk: graduating given not
– proportion end of interval =
example Term 10 = having graduated at start
terminating (829.5 - 38) ÷ 884 =
38 ÷ 829.5 = .05 of interval
0.90
Survival Function Graph Produced
by SPSS
The proportion of the cohort that has survived (still enrolled) at any term

Each step of the


curve represents an
event

There is a 90%
probability of
surviving to the end
of 10th term.

Surviving =
remaining enrolled!
One Minus survival function

There is a 10%
probability of not-
surviving to the end of
10th term.

Not surviving =
graduating!!
Survival Analysis: SPSS, with Covariate
Factor = Gender

SPSS
• Analyze
• Survival
• Life Tables

SURVIVAL TABLE=Terms_enrolled BY Gender(1 2)


/INTERVAL=THRU 15 BY 1
/STATUS=graduated(1)
/PRINT=TABLE
/PLOTS (SURVIVAL OMS)=Terms_enrolled BY Gender.
Hazard Rate =
Survival Analysis – SPSS, Instantaneous failure
rate. % chance of
Life Table by gender graduating given not
having graduated at start
of interval

Median Survival Time = Time at


which 50% of the original cohorts
have not-survived (graduated)
Life Table - Hazard Rate Column Survival Analysis: Hazard Ratio
Number Number of
Interval Entering Terminal Hazard
First-order Controls Start Time Interval Events Rate
Gender Female 0 586 0 .00  Hazard Ratio = ratio of the hazard rates.
1 586 0 .00
2 586 0 .00
3 586 0 .00
4 585 0 .00  At 12th term, Hazard ratio = 1.63 / 1.41 =
5
6
584
584
0
0
.00
.00
1.16, females are 16% more likely to
7
8
583
583
0
0
.00
.00
graduate in the 12th term than males
9 583 38 .07
10 545 22 .04
11
12
523
450
73
404
.15
1.63
 At 13th term, Hazard ratio = .41 / .62 = .66,
13 46 15 .41 females are 34% less likely to graduate in
the 13th term than males
14 28 11 .49
15 17 17 .00
Male 0 298 0 .00
1 298 0 .00
2 298 0 .00
3 298 0 .00
4 298 0 .00
5 298 0 .00
6 298 1 .00
7 296 0 .00
8 296 1 .00
9 295 10 .03
10 285 16 .06
11 268 46 .19
12 222 183 1.41
13 38 18 .62
14 20 6 .36
15 13 13 .00
Survival functions - SPSS
Factor = gender

Survival Pattern: SPSS will produce a different colored line for each of the
factor’s values
Second Approach
Kaplan-Meier Estimator
• The Kaplan-Meier estimator, independently
described by Edward Kaplan and Paul Meier
and conjointly published in 1958 in the
Journal of the American Statistical Association,
is a non-parametric statistic that allows us to
estimate the survival function.
Kaplan-Meier procedure

The Kaplan-Meier procedure is a method of estimating


time-to-event models in the presence of censored
cases.
A descriptive procedure for examining the distribution
of time-to-event variables. We also can compare the
distribution by levels of a factor variable or produce
separate analyses by levels of a stratification variable.
Censored cases (right-censored cases) are those for
which the event of interest has not yet happened.
Kaplan-Meier procedure

Assumptions
− Probabilities for the event of interest should depend only on
time after the initial event without covariates effects.
− Cases that enter the study at different times (for example,
patients who begin treatment at different times) should
behave similarly.
− Censored and uncensored cases behave the same. If, for
example, many of the censored cases are patients with more
serious conditions, your results may be biased.
Survival Analysis: Kaplan-meier Method

Assumptions
Censored individual – student who has not experienced the
event (graduated) by the end of the study, e.g. they are no
longer enrolled
 Check for differences between censored and non-censored
groups

Cohorts should behave similarly – groups entering at different


times should be similar

Avoid “selection bias” in data


SURVIVAL FUNCTIONS – SPSS,
KAPLAN_MEIER
FACTOR = GENDER

KM Terms_enrolled BY
Gender
/STATUS=graduated(1)
/PRINT TABLE MEAN
/PLOT SURVIVAL
/TEST LOGRANK BRESLOW
TARONE
/COMPARE OVERALL
POOLED.
Kaplan-Meier Results – Gender

Null Hypothesis: Female Curve = Male Curve


Kaplan-Meier output

Log Rank weights


all graduations
equally

Breslow gives
more weight to
earlier
graduations

Taron-Ware is
mixture of two
Kaplan-Meier Results – Gender

Null Hypothesis: Female Curve = Male Curve

Curves not
significantly
different at p < .05
Kaplan-Meier procedure

Example (from IBM SPSS 20.0) : data file:


pain_medication
− A pharmaceutical company is developing an anti-
inflammatory
medication for treating chronic arthritic pain. The research
interest is the time it takes for the drug to take effect and how
it compares to an existing drug. Shorter times to effect are
considered better.
Event: drug takes effect
Kaplan-Meier procedure

Variables
− Time variable (duration variable): must be a continuous
variable
− Status variable: categorical or continuous variable, represents
the event of interest (drug has effect or not).
− Factor variable: categorical variable, represents a causal effect
(type of treatment for example).
− Stratification variable: categorical variable.
Kaplan-Meier procedure

We have Factor variable: Treatment (0 = New drug, 1 =


old drug), Status variable: status ( 0 = censored, 1 = take
effect), Time variable: time
We want to compare the effect of two different drugs.
Null hypothesis: whether survival function is the same
between different levels of the factor variable (between
old and new drug) .
Kaplan-Meier procedure

Go to Analyze > Survival > Kaplan-Meier


Kaplan-Meier procedure

Analyze data
Kaplan-Meier procedure

Click Compare Factor

Log rank: Tests equality of survival functions by weighting all time points the
same.
Breslow: Tests equality of survival functions by weighting all time points by the
number of cases at risk at each time point.
Tarone-Ware: Tests equality of survival functions by weighting all time points by
the square root of the number of cases at risk at each time point.
Kaplan-Meier procedure

Compare factor

Pooled over strata: a single test is computed for all factor levels, testing for
equality of survival function across all levels of the factor variable.
Pairwise over strata: a separate test is computed for each pair of factor levels
when a pooled test shows non-equality of survival functions.
For each stratum: a separate test is computed for group formed by the
stratification variable.
Pairwise for each stratum: a separate test is computed for each pair of factor
variable, for each stratum of the stratification variable.
Kaplan-Meier procedure

Click Options
Kaplan-Meier procedure

Overall comparison

This table provides overall tests of the equality of


survival times across groups. Since the significance
values of the tests are all greater than 0.05, there is no
statistically significant difference between two
treatments in survival time.
COX REGRESSION (PROPORTIONAL HAZARDS)

• Measures influence of explanatory variables

• Most used Survival analysis method

• Only time independent variables are appropriate

• Assumptions: Hazards are proportional


Cox regression

The Cox Regression procedure is useful for modeling the


time to a specified event, based upon the values of
given covariates.
One or more covariates are used to predict a status
(event).
The central statistical output is the hazard ratio.
Data contain censored and uncensored cases.
Similar to logistic regression, but Cox regression
assesses relationship between survival time and
covariates .
Cox regression

Terms
− Status variable: the dependent in Cox regression, should be
binary variable.
− Time variable: measures duration to the event defined by the
status variable (continuous or discrete).
− Covariates: independent/predictor variables. They can be
categorical or continuous. They also can be time-fixed or time-
dependent.
− Interac on terms
− Categorical covariates: SPSS automa cally convert them into a set of
dummy variables, omitting one category.
Cox Regression, Checking proportional hazards
SPSS
assumption • Analyze
• Survival
• Cox Regression

Repeat for
each factor!
Cox Regression:
Use log minus log function to check
Proportional Hazards Assumption

Do not use Cox


Regression if the
curves cross. This
means the hazards are
not proportional.
Cox Regression Model – Example, Gender

 SPSS
• Analyze
• Survival
• Cox Regression
• (move gender to
Covariates box)
COX REGRESSION MODEL RESULTS: EXAMPLE, GENDER
Interpretation of SPSS Cox
Regression Results:
• The reference category is
female because I made that
choice for this model
• It is not statistically significant
at p < 0.05 that females and
males have different survival
curves

Exp(B) = Hazard
ratio: Female vs. Male
The null hypothesis is
that this ratio = 1.

Hazard Ratio = eB = e-0.04 = 0.961


SUMMARY
Survival Analysis provides the following:

• Handles both censored data and a time variable


• Life table
• Graphical representation of trends
• Kaplan-Meier survival function estimator
• Survival comparison between 2 or more groups

p value is produced
that indicates if
difference between
curves is significant
or not

• Regression models – relationships between variables and survival


times
Descriptive power of survival analysis :
Terms Enrolled by 1st Term GPA – Using Survival Graph (K-M) to
display data

At end of 12th term:


~ 34% probability of
continued enrollment

~ 9% probability of
continued enrollment
Cox regression

Example (from SPSS): data file: telco


− Use Cox Regression to determine which attributes are
associated with shorter "time to churn“.
− Time variable: tenure (month with services)
− Status variable: churn (0 = No, 1 = Yes).
− Covariates: age, marital, address, employ, retire, gender,
reside, and custcat
Go to Analyze > Survival > Cox Regression
Cox regression

Run Cox regression


Cox regression

Click Categorical
Cox regression

Click Plots
Cox regression

Click Options
Cox regression

SPSS Outputs
Cox regression

SPSS Outputs
Cox regression

SPSS Outputs: variables in the equation


− Exp(B), which can be interpreted as the predicted change in
the hazard for a unit increase in the predictor.
− For binary covariates, hazard ratio is the estimate of the ratio
of the hazard rate in one group to the hazard rate in another
group.
− The value of Exp(B) for marital means that the churn hazard
for an unmarried customer is 1.395 times that of a married
customer.
− The value of Exp(B) for address means that the churn hazard is
reduced by 100%−(100%×0.943)=5.7% for each year a
customer has lived at the same address.
Cox regression

SPSS Outputs

This table displays the average


value of each predictor variable,
plus a pattern for each level of
custcat. The four patterns
correspond to each of the
customer types, each with
otherwise "average" predictor
values.
REFERENCES
Dunn, S. (2002). Kaplan-Meier Survival Probability Estimates. Retrieved from
http://vassarstats.net/survival.html

Harris, S. (2009). Additional Regression techniques, October 2009, Retrieved from


http://www.edshare.soton.ac.uk/id/document/9437

Newell, J. & Hyun, S. (2011). Survival Probabilities With and Without the Use of
Censored Failure Times Retrieved from
https://www.uscupstate.edu/uploadedFiles/Academics/Undergraduate_Research/Reseach_
Journal/2011_007_ARTICLE_NEWELL_HYUN.pdf

Singh, R., Mukhopadhyay, K. (2011). Survival analysis in clinical trials: Basics and must
know areas, Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227332/t

Wiorkowski, J., Moses, A., & Redlinger, L. (2014).The Use of Survival Analysis to
Compare Student Cohort Data, Presented at the 2014 Conference of the Association of
Institutional Research
Conjoint Analysis

Dr Manoj Kumar Dash

12/1/2018
CONJOINT ANALYSIS

• The main objective of conjoint analysis is to find out the attributes


of the product that a respondent prefers most.
• The word ‘conjoint’ refers to the notion that relative value of any
phenomenon (product in most of the cases) can be measured
jointly, which may not be measured when taken individually.
• People tend to be better at giving well-ordered preferences when
evaluating options together (conjointly) rather than in isolation;
this method relieves a respondent from the difficult task of
accurately introspecting the relative importance of individual
attributes for a particular decision (Green & Rao, 1971).

12/1/2018
Concept of Performing Conjoint Analysis

To understand the concept of conjoint analysis, let us consider the


colour television example.
Suppose the consumer has got two choices in terms of two different
brands: “Brand A” and “Brand B.”
The consumer is willing to consider three attributes—brand image,
sound quality, and picture quality.
The consumer’s preference is given in Table 17.6.

12/1/2018
Concept of Performing Conjoint Analysis (Cont.)

• The conjoint analysis is based on the assumption that subjects


evaluate the value or utility of a product or service or idea
(real or hypothetical) by combining the separate amount of
utility provided by each attribute (Schaupp & Belanger,
2005).
• It works on the simple principal of developing a part-worth or
utility function stating the utility consumers attach to the
levels of each attribute.

12/1/2018
Steps in Conducting Conjoint Analysis

12/1/2018
Step 1: Problem Formulation

• To formulate a problem, as a first step, a researcher must identify


the various attributes and attribute levels.
• The number of attributes used in the conjoint analysis should be
selected with care.
• As a thumb rule, the number of attributes used in a typical conjoint
analysis study averages six or seven.
• The general recommendation is to select the attribute levels
that are larger than the attribute levels prevailing in the
market but not as large as it can make the options
unbelievable.

12/1/2018
Step 1: Problem Formulation (Cont.)

To understand the concept of conjoint analysis, let us continue


with the colour television example. Suppose the company has
conducted a qualitative research and three attributes, such as
screen, sound quality, and price, have been identified as salient
attributes.
Each attribute is defined in terms of three levels that are given in
Table 17.7.

12/1/2018
Step 2: Trade-off Data Collection

To construct conjoint analysis stimuli, two broad approaches are


available: the pair-wise (two-factor) approach and full-profile
approach.

12/1/2018
Step 2: Trade-off Data Collection (Cont.)

Both the methods, that is, pair-wise (two factor) approach and full-profile
approach, have their own utility, but full-profile approach is the most
widely used method.

FIGURE 17.18 : Full-profile approach for collecting conjoint data


In the colour television example, three attributes and three levels of
each attribute are given. Hence, following full-profile approach, a total
of 3 × 3 × 3 = 27 profiles can be constructed.

12/1/2018
Step 3: Metric Versus Non-Metric Input Data

• Conjoint analysis data can be of both the forms: metric data and
non-metric data.
• For non-metric data, the respondents indicate ranking, and for
metric data, the respondents indicate rating.
• Rating approach has got popularity in recent days. As obvious, in
conjoint analysis, the dependent variable is consumer preference or
intention to buy a product (rating or ranking provided by the
customers for buying a product). In the colour television example,
ratings are obtained in a 7-point Likert scale with 1 as not preferred
and 7 as highly preferred.
• These ratings are given in Table 17.11.

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

TABLE 17.13 : Result of the conjoint analysis

12/1/2018
TABLE 17.11 : Selected profiles of colour television example

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

A particular dummy variable xd is defined as

To analyse the conjoint analysis data, dummy variables are


treated as independent or explanatory variables and preference
rating obtained from the respondent is treated as dependent
variable.

To estimate utility, regression model can be formed as

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

In this model, x1 and x2 are dummy variables representing the


attribute “screen,” x3 and x4 are dummy variables representing the
attribute “sound quality,” and x5 and x6 are dummy variables
representing “price.” For screen, attribute levels are coded as below:

12/1/2018
TABLE 17.12 : The colour television data converted into dummy
variables on applying regression technique
Using any statistical software regression equation can be obtained
as

12/1/2018
Step 4: Result Analysis and Interpretation

The conjoint analysis model can be represented by the following


formula:

12/1/2018
SPSS output (multiple regression) for conjoint problem

• SPSS for ch 17\Conjoint Analysis Problem.sav


• SPSS for ch 17\Output Conjoint Analysis.spv

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

For the first attribute, screen

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

For the second attribute, semi-flat, below equations can be


obtained:

Solving these equations, we get

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

For the third attribute, flat, below equations can be obtained:

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

TABLE 17.13 : Result of the conjoint analysis

12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)

For easy interpretation of the result obtained from conjoint


analysis, it is important to plot a graph of utility functions.

12/1/2018
Step 5: Reliability and Validity Check

• As discussed in regression, to estimate the best-fit model, the value


of R2 is an indicator. In our model, value of R2is obtained as 98.4%,
which indicates a good-fit model.
• Earlier discussed test–retest method of reliability assessment can
be applied.
• In an aggregate level analysis, the total sample can be divided into
various subsamples, and conjoint analysis can be performed on all
the sub-samples.
• Then across-samples results are compared to assess the stability of
conjoint analysis.

12/1/2018
Assumptions and Limitations of Conjoint Analysis

• Conjoint analysis is based on the assumption that all the


attributes that contribute to the utility of the product can be
identified and are independent.
• Conjoint analysis also assumes that the consumers evaluate the
alternatives and make trade-offs.
• One limitation of the conjoint analysis is that there may be few
situations where brand name is an important factor and then
the consumer may not evaluate the brands or alternatives in
terms of attribute.
• For a large number of attributes, collection of data may be
difficult.

12/1/2018
Multi Dimensional Scaling
DR MANOJ KUMAR DASH
FIGURE : Perceptual map (two dimensional) with labelling of
dimensions
Example of MDS output – holiday
destinations in two dimensions
Common Space
Interpretation: How trendy is the city

0.75
London
• Each of the respondents is asked to
Paris
rank the cities, without necessarily
0.50 Berlin specifying why one city was preferred to
another
0.25 Amsterdam
Rome • Similarities in ranking across an
Dimension 2

adequate number of respondents reflect


0.00
Madrid
perceived similarities between cities (e.g.
London is more similar to Berlin than to
Athens)
-0.25
Athens
Stockholm
• Graph distances reflect dissimilarities
-0.50
Bruxelles • If the two dimensions can be labelled
according to some criterion, as for
-0.75
-0.5 0.0 0.5 principal component or factor analysis,
Dimension 1
then it becomes possible to understand
May be interpreted as “climate” the main perceived differences.

423
Example of brand positioning
The two dimensions are the output of some
reduction technique
- PCA or FA for interval (metric) data
- correspondence analysis for non-metric
data
coordinates for brands are obtained by
running PCA (or FA) on sensory
assessments (usually through a panel of
experts unless objective measures exists)
Consumer positions (as individuals or as
segments) can be defined in two ways
1) using their “ideal brand” characteristics
2) by translating preference ranking for
brands into coordinates through unfolding

424
Brand positioning
The product should be Segment A
healthy as both A & C There is room for a new
chooses
like that dimension. product for segment C
three but it
The thicker it is, the also close to sgm. A
is not that
closer is to C close
compared to A Brand five survives
because of segment
C, but it is far from C’s
Consumer segment B preferences
Brand 5
is close to Brand three
Brand repositioning. If brand five had this marketing research information, one could
improve one’s performance by enhancing the perceived healthiness of the product
Brand
(e.g. reducing the 1salt
and 4 are and through a targeted advertising campaign). This
content Consumer segment D
perceivedwould
as similar
move brand fivcloser to segment C with Brand 2
is happy

425
Plots

Plot of objects Plot of subjects

426
According to the sample, basketball,
Joint plot baseball and cricket share
similarities in subjects’ perceptions
and so do American football, motor
sports and ice hockey.
A third “cluster” is provided by
handball, waterpolo and volleyball,
while football seems to be
equidistant from all other sports.
Consumers are also grouped in
clusters according to their
preferences and the joint
representation allows one to show
not only which sports (products) are
closer to the preferences of different
segments, but also which sports
need to be repositioned to attract
more public, like the cluster with
volleyball, waterpolo and handball.

427
MULTIDIMENSIONAL SCALING

• Management of a company is always interested in knowing


the position of its products as compared with the position of
competitor’s product in the market.

• Multidimensional scaling is an attempt to answer such


questions.

• Multidimensional scaling commonly known as MDS is a


technique to measure and represent the perception and
preferences of respondents in a perceptual space as a visual
display.
Conducting Multidimensional
Scaling
Fig. 21.1

Formulate the Problem

Obtain Input Data

Select an MDS Procedure

Decide on the Number of Dimensions

Label the Dimensions and Interpret


the Configuration

© 2007 Prentice Hall 21-429


MULTIDIMENSIONAL SCALING (Cont.)

• Multidimensional scaling handles two marketing decision


parameters.
• As a first case, the dimension on which respondents evaluate
objects must be determined.
• As a convenient option, only two dimensions are worked out as
the evaluation objects are graphically portrayed.
• As a second case, objects are to be positioned on these
dimensions.
• The output of multidimensional scaling happens to be in the form
of location of objects on the dimensions and is termed as spatial
map or perceptual map.
MULTIDIMENSIONAL SCALING (Cont.)

• Multidimensional scaling attempts to infer the underlying dimensions


from the preference judgment provided by the customers.
• This is done by assigning responses of the respondents to a specific
location in a perceptual space in a manner that the distances in the
space match the given dissimilarity as closely as possible.
• Data obtained from the respondents can be metric or non-metric.
• As a case of metric data, rating of respondent’s preference can be
obtained and as a non-metric data, ranking of the respondent’s
relative preference can also be obtained.
• Multidimensional approaches are available for analyzing metric as well
as non-metric data.
Some Basic Terms Used in Multidimensional Scaling

• Stress: Stress measures lack of fit in multidimensional scaling. A


higher value for stress is an indication of poorer fit.
• R-square (squared correlation): R2 value indicates how much of the
variance in the original dissimilarity matrix can be attributed to
multidimensional scaling model. Higher value for R2 is desirable in
multidimensional scaling model. In fact, R2 is a goodness-of-fit
measure in multidimensional scaling model.
• Perceptual map: Perceptual map is a tool to visually display
perceived relationship among various stimuli or objects in a
multidimensional attribute space.
The Process of Conducting Multidimensional Scaling
Input Data for Multidimensional Scaling
Fig. 21.2

MDS Input Data

Perceptions Preferences

Direct (Similarity Derived (Attribute


Judgments) Ratings)

© 2007 Prentice Hall 21-434


Step 1: Problem Formulation

• A minimum of 8 to 10 brands can be included to construct a well-defined


perceptual map, and a maximum of 25 to 30 brands can be included in
multidimensional scaling model.
• In fact, the number of brands or stimuli to be included is based on some
factors such as research objective, past researches, decision of
researchers, or requirement of management.
• We will take a hypothetical example of 10 edible oil brands for better
understanding the concept of multidimensional scaling with special
reference to obtaining a perceptual map.
• These 10 edible oil brands are Fortune, Sundrop, Saffola, Gemini, Nutrela,
Dhara, Ginni, Maharaja, Vital, and Nature Fresh.
Step 2: Input Data Collection

• The input data used for multidimensional scaling may be connected with the
similarity data or the preference data.
• Similarity Data: Similarity data are collected through the respondents by just
noting the perceived similarity between the two brands or objects.
• These data are often referred to as similarity judgment. Figure 1 provides
respondent’s similarity judgment between two pairs (Fortune–Saffola) of
edible oil brands.
• As a second way, derived approach for data collection in terms of conducting
multidimensional scaling can also be used.
• Using this approach, the respondents are supposed to rate the brands for
identified attributes on a rating scale.
• Responses obtained from a single respondent are summarized in Table 18.5
above.
Step 2: Input Data Collection (Cont.)

• Preference Data: In some cases, a researcher may be interested in


knowing the respondent’s preference for a stimuli or object.
• In such situations, the respondent is asked to provide rank order for all
the objects or stimuli as per their preference.
• The configuration derived from similarity data and preference data
are not the same but are different.
• Two objects can be perceived differently through a similarity data
produced map but may be close together for a preference data.
• For taking care of this dimension, an ideal object approach is
considered.
Input Data

• Perception Data: Direct Approaches: respondents are asked to judge how


similar or dissimilar the various brands are. These data are referred to as
similarity judgments.
• Consider similarity ratings of various toothpaste brands:
Very Very
Dissimilar Similar
Crest vs. Colgate 1 2 3 4 5 6 7
Aqua-Fresh vs. Crest 2 3 4 5 6 7
Crest vs. Aim 1 2 3 4 5 6 7
.
.
.
Colgate vs. Aqua-Fresh 1 2 3 4 5 6 7

• The number of pairs to be evaluated is n (n -1)/2, where n is the number


of stimuli.

21-438
Similarity Rating Of Toothpaste Brands

Table 21.1

Aqua-Fresh Crest Colgate Aim Gleem Plus White Ultra Brite Close-Up Pepsodent Sensodyne
Aqua-Fresh
Crest 5
Colgate 6 7
Aim 4 6 6
Gleem 2 3 4 5
Plus White 3 3 4 4 5
Ultra Brite 2 2 2 3 5 5
Close-Up 2 2 2 2 6 5 6
Pepsodent 2 2 2 2 6 6 7 6
Sensodyne 1 2 4 2 4 3 3 4 3

© 2007 Prentice Hall 21-439


FIGURE: Respondent’s similarity judgment between two pairs
of edible oil brands
TABLE: Similarity score data for different brands of edible oil
(data that will be used for multidimensional scaling)
Step 3: Selection of Multidimensional Scaling Procedure

• Non-metric multidimensional scaling procedure is based on the


ordinal nature of input data, whereas metric multidimensional scaling
procedure is based on the assumption that the input data are interval
scaled.
• In multidimensional scaling procedure, ordinal or non-metric
information is preferred.
• It is important to learn that non-metric multidimensional scaling
procedure results in a metric output.
• Obviously, metric multidimensional scaling procedure produces a
metric output.
• Here, it is important to note that metric and non-metric
multidimensional scaling procedure both produce similar type of
results.
Step 3: Selection of Multidimensional Scaling Procedure (Cont.)

• As a second issue, a researcher has to determine whether


multidimensional scaling procedure should be performed on an
individual or an aggregate data is required.
• Individual-level multidimensional scaling procedure results in a
perceptual map for each respondent.
• While performing multidimensional scaling procedure on aggregate
data, a perceptual map on the basis of average similarity rating can
be obtained very easily.
• Our discussion of multidimensional scaling is based on the edible oil
brands example for which data are rank ordered (ordinal) and the
adopted procedure is non-metric.
Step 4: Determining Number of Dimensions for Perceptual Map

• In the light of the visual interpretation objective of


multidimensional scaling procedure, a two-dimensional or at most
a three-dimensional perceptual map is desirable.
• Stress measures lack of fit in multidimensional scaling as the higher
value for stress is an indication of poorer fit.
• A widely used criteria to determine the number of dimensions in
multidimensional scaling is to construct a plot between stress
values (obtained as the SPSS output) and dimensionality.
• An elbow in the plot indicates number of dimensions to be
included in the study to construct a perceptual map.
Step 5: Substantive Interpretation
The SPSS output for edible oil data is presented in Figures

FIGURE : SPSS output exhibiting iteration for stress value improvement, stress
value, and R2 value
Step 5: Substantive Interpretation (Cont.)

Stress index indicates lack of fit in multidimensional scaling.


This stress value is commonly known as S-stress or Kruskal’s stress.
This value ranges from the worst fit (stress value as 1) to best fit (stress value as
0).
In fact, stress value is based on the type of multidimensional scaling procedure
adopted and the data on which multidimensional scaling is performed.
The acceptable stress value is suggested by Kruskal (1964) as given below:
Step 5: Substantive Interpretation (Cont.)

Higher value for R2 (close to 1) is desirable in multidimensional scaling. An R2


value greater than or equal to 60% is considered acceptable.

FIGURE : (b) SPSS output exhibiting stimulus coordinates


Step 5: Substantive Interpretation (Cont.)
Step 5: Substantive Interpretation (Cont.)

• Figure is the desired SPSS-produced perceptual map (two


dimensional).
• Spatial map may be interpreted by examining the coordinates of
the map and relative position of the brands with respect to these
coordinates.
• The labelling of horizontal axis (X-axis) and vertical axis (Y-axis) is
a matter of researcher’s judgment and depends on factors such as
researcher’s insight, obtained information parameters, and so on.
• In some cases, the respondents are often asked to provide the base
of similarity they have used for judging the different brands or
objects.
FIGURE : (d) SPSS-produced perceptual map (three
dimensional)
FIGURE : Perceptual map (two dimensional) with labelling of
dimensions
Step 5: Substantive Interpretation (Cont.)

• Figure exhibits perceptual map (two dimensional) with labelling of


dimensions.
• Brands located close to each other may have competitive nature
on related dimension.
• A brand located in isolation may have unique image. This
perceptual map is based on the similarity judgment of a single
respondent.
• If we take an aggregate score of the responses, a perceptual map
based on multiple responses (when taken as aggregate) can also be
constructed and is very helpful for marketing managers to assess
the positioning of their own brand as compared with different
brands on some defined attributes.
Step 6: Check the Model Fit

• As a first step of checking reliability and validity of the model, the value of
R2 must be examined. As discussed, an R2 value greater than or equal to
60% is considered acceptable.
• In edible oil multidimensional scaling model, R2 value comes to 0.9707
(97.07%), which is very close to 1 and hence the model is very well
acceptable.
• As a second step, stress value must be examined.
• In edible oil multidimensional scaling model, stress value comes to 0.0746
(close to 5%). This is an indication of a good-fit multidimensional scaling
model.
• Original data should be divided in two or parts and obtained results must
be compared.
• Input data must be gathered at two different points of time and test–
retest reliability must be computed.
Example of MDS output – holiday
destinations in two dimensions
Common Space
Interpretation: How trendy is the city

0.75
London
• Each of the respondents is asked to
Paris
rank the cities, without necessarily
0.50 Berlin specifying why one city was preferred to
another
0.25 Amsterdam
Rome • Similarities in ranking across an
Dimension 2

adequate number of respondents reflect


0.00
Madrid
perceived similarities between cities (e.g.
London is more similar to Berlin than to
Athens)
-0.25
Athens
Stockholm
• Graph distances reflect dissimilarities
-0.50
Bruxelles • If the two dimensions can be labelled
according to some criterion, as for
-0.75
-0.5 0.0 0.5 principal component or factor analysis,
Dimension 1
then it becomes possible to understand
May be interpreted as “climate” the main perceived differences.

454
Example of brand positioning
The two dimensions are the output of some
reduction technique
- PCA or FA for interval (metric) data
- correspondence analysis for non-metric
data
coordinates for brands are obtained by
running PCA (or FA) on sensory
assessments (usually through a panel of
experts unless objective measures exists)
Consumer positions (as individuals or as
segments) can be defined in two ways
1) using their “ideal brand” characteristics
2) by translating preference ranking for
brands into coordinates through unfolding

455
Brand positioning
The product should be Segment A
healthy as both A & C There is room for a new
chooses
like that dimension. product for segment C
three but it
The thicker it is, the also close to sgm. A
is not that
closer is to C close
compared to A Brand five survives
because of segment
C, but it is far from C’s
Consumer segment B preferences
Brand 5
is close to Brand three
Brand repositioning. If brand five had this marketing research information, one could
improve one’s performance by enhancing the perceived healthiness of the product
Brand
(e.g. reducing the 1salt
and 4 are and through a targeted advertising campaign). This
content Consumer segment D
perceivedwould
as similar
move brand fivcloser to segment C with Brand 2
is happy

456
The MDS data set

Ratings by consumers Evaluations of product


characteristics by experts

457
IPM & unfolding

458
Unfolding
Proximities are defined
from the subjects’
preference rankings

This nominal variable


provides the labels for
the objects (sports)

When measures for the


same set of objects are
Defines model provided by different
options sources (e.g. different
groups/scenarios) –
data should be stacked
Allows to Defines
options for the Choose Displays and saves
place
algorithm plots additional output
restrictions
459
Unfolding options

Select identity
as data come
from a single
source

Rankings are
dissimilarities
and ordinal
data

Number of
dimensions to
be explored

460
Options
Convergence criterion for
the STRESS function

The penalty term helps avoid


Choose the
degenerative solutions (where
starting
points can hardly be distinguished
configuration
from each other).
The weight of the penalty term
increases as the strength
becomes smaller.
When the penalty range is zero, no
correction is made to the Stress-I
criterion, while larger range values
lead to solutions where the
variability of the transformed
proximities is higher
461
Plots

Applies different
The final colors or markets to
common different objects
space shows
subjects and
objects on the
same plot

462
Outputs
Output tables can be
selected here

Output coordinates
(distances) can be saved
into a new file

463
Unfolding output
Measures The final STRESS-I value of 0.04 is acceptable.
Iterations 992
Other measures of “badness-of-fit” and “goodness-of-fit” are
Final Function Value .3835645 provided and confirm that the results are acceptable.
Function Value Stress Part .0410912
Parts Penalty Part 3.5803705 The variation coefficient of the transformed proximities
Badness of Fit Normalized Stress .0016885 can be used to check for the risk of degenerated solutions
Kruskal's Stress-I .0410908 (points are too close to each other). In this case, the
Kruskal's Stress-II .1905153
variation coefficient of the transformed proximities is 0.33 as
Young's S-Stress-I .0720164
Young's S-Stress-II
compared to the 0.50 of the original ones, which means that
.1781156 most of the variability is retained after transformation.
Goodness of Fit Dispersion Accounted For .9983115 Furthermore, the distances show a variability which is more
Variance Accounted For .9666225 or less equal to the original one, indicating that the points in
Recovered Preference
.8471837 space should be scattered enough to reflect the initial
Orders
distances.
Spearman's Rho .8617494
Kendall's Tau-b .7273984
Variation Variation Proximities .5043544
The DeSarbo’s Intermixedness index and the Shepard’s
Coefficients Variation Transformed RNI also provide warning signals for degenerated solutions:
.3322572
Proximities the former should be as close to zero as possible and the
Variation Distances .5071630 latter as close to one as possible. There are no strong
Degeneracy Indices Sum-of-Squares of signals for a degenerated solution
DeSarbo's .4694185
Intermixedness Indices
Shepard's Rough One may wish to try different parameters for the penalty
.5609796
Nondegeneracy Index term to see whether these indicators improve.

464
Plots

Plot of objects Plot of subjects

465
According to the sample, basketball,
Joint plot baseball and cricket share
similarities in subjects’ perceptions
and so do American football, motor
sports and ice hockey.
A third “cluster” is provided by
handball, waterpolo and volleyball,
while football seems to be
equidistant from all other sports.
Consumers are also grouped in
clusters according to their
preferences and the joint
representation allows one to show
not only which sports (products) are
closer to the preferences of different
segments, but also which sports
need to be repositioned to attract
more public, like the cluster with
volleyball, waterpolo and handball.

466
Repositioning
• If one can attach a meaning to dimensions one and two
it becomes possible to understand what characteristics
of the products should be changed
• A method to obtain an interpretation of the coordinates
consists in looking at the correlations betweens the
coordinates of the sports and the object characteristics
that can be measured objectively or through the
evaluation of expert panellists.
• The algorithm has created an output file coord.sav which
contains the two coordinates for each sport and
consumer and can be used to obtain the bivariate
correlations

467
Using SPSS for Multidimensional Scaling

• SPSS\SPSS 17\Problem MDS.sav


• SPSS\SPSS 17\Output MDS New.spv
CART

Dr Manoj Dash
Learning Objectives
• Understand basic concepts in decision tree
modeling
• Understand how decision trees can be used
to solve classification problems
• Understand the risk of model over-fitting
and the need to control it via pruning
• Able to evaluate the performance of a
classification modelusing training, validation
and test datasets
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
What is CART?
• Classification And Regression Trees
• Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
– Introduced tree-based modeling into the statistical mainstream
– Rigorous approach involving cross-validation to select the optimal tree
• One of many tree-based modeling techniques.
– CART -- the classic
– CHAID
– C5.0
– Software package variants (SPSS)
EXAMPLE:
A bank wants to categorize credit applicants according
to whether or not they represent a reasonable credit
risk. Based on various factors, including the known
credit ratings of past customers, you can build a model
to predict if future customers are likely to default on
their loans.
A tree-based analysis provides some attractive features:
a)It allows you to identify homogeneous groups with
high or low risk.
b) It makes it easy to construct rules for making
predictions about individual cases.
The Decision Tree procedure creates a tree-based
classification model. It classifies cases into groups or
predicts values of a dependent (target) variable
based on values of independent (predictor)
variables. The procedure provides validation tools for
exploratory and confirmatory classification analysis.
Classification Trees (cont.)
• Business marketing: predict whether a person will buy a computer?
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
7 Yes Divorced 220K No
TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision Tree


Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10

Apply Decision Tree


Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
• Let Dt be the set of training records Status Income Cheat

that reach a node t 1 Yes Single 125K No

• General Procedure: 2 No Married 100K No


3 No Single 70K No
– If Dt contains records that belong the
4 Yes Married 120K No
same class yt, then t is a leaf node
labeled as yt 5 No Divorced 95K Yes
6 No Married 60K No
– If Dt is an empty set, then t is a leaf
7 Yes Divorced 220K No
node labeled by the default class, yd
8 No Single 85K Yes
– If Dt contains records that belong to 9 No Married 75K No
more than one class, use an attribute 10 No Single 90K Yes
test to split the data into smaller 10

subsets. Recursively apply the Dt


procedure to each subset.

?
Hunt’s Algorithm
Tid Refund Marital Taxable
Refund Status Income Cheat
Don’t
Yes No 1 Yes Single 125K No
Cheat
Don’t Don’t 2 No Married 100K No
Cheat Cheat 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes

Refund Refund 6 No Married 60K No

Yes No Yes No 7 Yes Divorced 220K No


8 No Single 85K Yes
Don’t Don’t Marital
Marital Cheat 9 No Married 75K No
Cheat Status Status
Single, Single, 10 No Single 90K Yes
Married Married
Divorced Divorced 10

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
Steps to Follow:
1. Select a dependent variable.
2. Select one or more independent variables.
3. Select a growing method.
4. Optionally, you can:
a) Change the measurement level for any variable in the source list.
b) Force the first variable in the independent variables list into the model
as the first split variable.
5. Select an influence variable that defines how much influence a case has on
the tree-growing process. Cases with lower influence values have less
influence; cases with higher values have more. Influence variable values
must be positive.
6. Validate the tree.
7. Customize the tree-growing criteria.
8. Save terminal node numbers, predicted values, and predicted probabilities
as variables.
9. Save the model in XML (PMML) format.
EXAMPLE:
A bank wants to categorize credit applicants according
to whether or not they represent a reasonable credit
risk. Based on various factors, including the known
credit ratings of past customers, you can build a model
to predict if future customers are likely to default on
their loans.
A tree-based analysis provides some attractive features:
a)It allows you to identify homogeneous groups with
high or low risk.
b) It makes it easy to construct rules for making
predictions about individual cases.
The tree diagram is a graphic representation of the tree model. This
tree diagram shows that:
Using the CHAID method, income level is the best predictor of credit
rating.
a. For the low income category, income level is the only significant
predictor of credit rating. Of the bank customers in this category, 82%
have defaulted on loans. Since there are no child nodes below it, this is
considered a terminal node.
b. For the medium and high income categories, the next best predictor
is number of credit cards.
c. For medium income customers with five or more credit cards, the
model includes one more predictor: age. Over 80% of those customers
28 or younger have a bad credit rating, while slightly less than half of
those over 28 have a bad credit rating.
The tree table, as the name suggests, provides most of the essential tree diagram
information in the
form of a table. For each node, the table displays:
a. The number and percentage of cases in each category of the dependent variable.
b. The predicted category for the dependent variable. In this example, the predicted
category is the credit rating category with more than 50% of cases in that node, since
there are only two possible credit ratings.
c. The parent node for each node in the tree. Note that node 1—the low income level
node—is not the parent node of any node. Since it is a terminal node, it has no child
nodes.
The gains for nodes table provides a summary of information about the terminal nodes in the model.
a. Only the terminal nodes—nodes at which the tree stops growing—are listed in this table.

b. Since gain values provide information about target categories, this table is available only if you specified one
or more target categories. In this example, there is only one target category, so there is only one gains for nodes
table.
c. Node N is the number of cases in each terminal node, and Node Percent is the percentage of the total number
of cases in each node.
d. Gain N is the number of cases in each terminal node in the target category, and Gain Percent is the percentage
of cases in the target category with respect to the overall number of cases in the target category—in this
example, the number and percentage of cases with a bad credit rating.
e. For categorical dependent variables, Response is the percentage of cases in the node in the specified target
category. In this example, these are the same percentages displayed for the Bad category in the tree diagram.
f. For categorical dependent variables, Index is the ratio of the response percentage for the target category
compared to the response percentage for the entire sample.
Case

EXPERIMENT
Cross validation
1. Cross validation divides the sample into a number of
subsamples, or folds. Tree models are then generated,
excluding the data from each subsample in turn.
2. The first tree is based on all of the cases except those in the first
sample fold,
3. the second tree is based on all of the cases except those in the
second sample fold, and so on.
4. For each tree, misclassification risk is estimated by applying the
tree to the subsample excluded in generating it.
5. a. You can specify a maximum of 25 sample folds. The higher the
value, the fewer the number of cases excluded for each tree
model.
b. Cross validation produces a single, final tree model. The cross
validated risk estimate for the final tree is calculated as the
average of the risks for all of the trees.
Split-Sample Validation
With split-sample validation, the model is generated using a training sample and tested
on a hold-out sample.
a. You can specify a training sample size, expressed as a percentage of the total sample
size, or a variable that splits the sample into training and testing samples.
b. If you use a variable to define training and testing samples, cases with a value of 1 for
the variable are assigned to the training sample, and all other cases are assigned to the
testing sample. The variable cannot be the dependent variable, weight variable, influence
variable, or a forced independent variable.
a. You can display results for both the training and testing samples or just the testing
sample.
b. Split-sample validation should be used with caution on small data files (data files with
a small number of cases). Small training sample sizes may yield poor models, since there
may not be enough cases in some categories to adequately grow the tree.
The Growth Limits tab
allows you to limit the
number of levels in the
tree and control the
minimum number of cases
for parent and child nodes.
Maximum Tree Depth.
Controls the maximum
number of levels of
growth beneath the root
node.
The Automatic setting
limits the tree to three
levels beneath the root
node for the CHAID and
Exhaustive CHAID
methods and five levels for
the CRT and QUEST
methods
For the CHAID and Exhaustive
CHAID methods, you can control:
Significance Level:
You can control the significance
value for splitting nodes and
merging categories. For both
criteria, the default significance
level is 0.05.
a. For splitting nodes, the value
must be greater than 0 and less
than 1. Lower values tend to
produce trees with fewer nodes.
b. For merging categories, the
value must be greater than 0
and less than or equal to 1. To
prevent merging of categories,
specify a value of 1. For a scale
independent variable, this means
that the number of categories
for the variable in the final tree is
the specified number of
intervals (the default is 10).
In CHAID analysis, scale
independent (predictor) variables
are always banded into discrete
groups (for example, 0–10, 11–20,
21–30, etc.) prior to analysis. You
can control the initial/maximum
number of groups (although the
procedure may merge contiguous
groups after the initial split):
a. Fixed number. All scale
independent variables are initially
banded into the same number of
groups. The default is 10.
b. Custom. Each scale independent
variable is initially banded into the
number of groups
specified for that variable.
The extent to which a node does
not represent a homogenous
subset of cases is an indication of
impurity.
For example,
a terminal node in which all cases
have the same value for the
dependent variable is a
homogenous node that requires no
further splitting because it is
“pure.” You can select the method
used to measure impurity and the
minimum decrease in impurity
required to split nodes.
Impurity Measure. For scale
dependent variables, the least-
squared deviation (LSD) measure
of impurity is used. It is computed
as the within-node variance,
adjusted for any frequency weights
or influence values.
For categorical (nominal, ordinal) dependent variables, you can
select the impurity measure:
Gini. Splits are found that maximize the homogeneity of child
nodes with respect to the value of the dependent variable. Gini is
based on squared probabilities of membership for each category of
the dependent variable. It reaches its minimum (zero) when all cases
in a node fall into a single category. This is the default measure.
Twoing. Categories of the dependent variable are grouped into two
subclasses. Splits are found that best separate the two groups.
Ordered twoing. Similar to twoing except that only adjacent
categories can be grouped. This measure is available only for ordinal
dependent variables.
Minimum change in improvement. This is the minimum decrease in
impurity required to split a node. The default is 0.0001. Higher
values tend to produce trees with fewer nodes.
Saved Variables
Terminal node number. The terminal node
to which each case is assigned. The value is
the tree node number.
Predicted value. The class (group) or value
for the dependent variable predicted by
the model.
Predicted probabilities. The probability
associated with the model’s prediction.
One variable is saved for each category of
the dependent variable. Not available for
scale dependent variables.
Sample assignment (training/testing). For
split-sample validation, this variable
indicates whether a case was used in the
training or testing sample. The value is 1 for
the training sample and 0 for the testing
sample. Not available unless you have
selected split-sample validation
Export Tree Model as XML
You can save the entire tree model in XML (PMML) format.
You can use this model file to apply
the model information to other data files for scoring
purposes.
Training sample. Writes the model to the specified file.
For split-sample validated trees, this is the model for the
training sample.
Test sample. Writes the model for the test sample to the
specified file. Not available unless you have selected split-
sample validation
OUTPUT .
Tree. By default, the tree diagram is included in the output displayed in the Viewer.
Deselect (uncheck) this option to exclude the tree diagram from the output.
Display. These options control the initial appearance of the tree diagram in the Viewer.
All of these attributes can also be modified by editing the generated tree.
a. Orientation. The tree can be displayed top down with the root node at the top, left to
right, or right to left.
b. Node contents. Nodes can display tables, charts, or both. For categorical dependent
variables, tables display frequency counts and percentages, and the charts are bar charts.
For scale dependent variables, tables display means, standard deviations, number of
cases, and predicted values, and the charts are histograms.
c. Scale. By default, large trees are automatically scaled down in an attempt to fit the
tree on the page. You can specify a custom scale percentage of up to 200%.
d. Independent variable statistics. For CHAID and Exhaustive CHAID, statistics include F
value (for scale dependent variables) or chi-square value (for categorical dependent
variables) as well as significance value and degrees of freedom. For CRT, the
improvement value is shown. For QUEST, F, significance value, and degrees of freedom
are shown for scale and ordinal independent variables; for nominal independent
variables, chi-square, significance value, and degrees of freedom are shown.
Risk. Risk estimate and its standard
error. A measure of the tree’s predictive
accuracy.
a. For categorical dependent variables,
the risk estimate is the proportion of
cases incorrectly classified after
adjustment for prior probabilities and
misclassification costs.
b. For scale dependent variables, the risk
estimate is within-node variance.
Classification table. For categorical
(nominal, ordinal) dependent variables,
this table shows the number of cases
classified correctly and incorrectly for
each category of the dependent
variable. Not available for scale
dependent variables
Gain. Gain is the percentage of total
cases in the target category in each
node, computed as:
(node target n / total target n) x 100. The
gains chart is a line chart of cumulative
percentile gains, computed as:
(cumulative percentile target n / total
target n) x 100.
A separate line chart is produced for
each target category. Available only for
categorical dependent variables with
defined target categories. For more
information
The gains chart plots the same values
that you would see in the Gain Percent
column in the gains
for percentiles table, which also reports
cumulative values
Gain. Gain is the percentage of total
cases in the target category in each
node, computed as:
(node target n / total target n) x 100.
The gains chart is a line chart of
cumulative percentile gains,
computed as: (cumulative percentile
target n / total target n) x 100.
A separate line chart is produced for
each target category. Available only
for categorical dependent variables
with defined target categories. For
more information The gains chart
plots the same values that you
would see in the Gain Percent
column in the gains for percentiles
table, which also reports cumulative
values
Index. Index is the ratio of the node
response percentage for the target
category compared to the overall target
category response percentage for the
entire sample.

The index chart is a line chart of


cumulative percentile index values.
Available only for categorical dependent
variables.

Cumulative percentile index is computed


as: (cumulative percentile response
percent / total response percent) x 100.

A separate chart is produced for each


target category, and target categories
must be defined.
Response. The percentage of cases in the
node in the specified target category.

The response chart is a line chart of


cumulative percentile response, computed as:
(cumulative percentile target n / cumulative
percentile total n) x 100.

Available only for categorical dependent


variables with defined target categories.

You might also like