Professional Documents
Culture Documents
MARKETING ANALYSIS
TO
MARKETING ANALYTICS
(SCOPE AND OVERVIEW)
Dr. Manoj Kumar Dash
M.A;M.Phil;M.B.A;NET; Ph.D
ABV-Indian Institute of Information Technology and Management, Gwalior
( An Autonomous institute of Government of India)
acknowledgements
Present:
Indian Institute
of Information
Technology and
Management
( From 2010 to
contd) I. Teaching Interests:
Specialization: Behavioural Economics;
Econometrics; Micro Economics; Marketing
Analytics; Marketing Research, Money and
Banking, Research Methodology and
Consumer Behaviour; Data Analytics;
Marketing Analytic; Retail Analytics: Marketing
Visiting / Adjunct Modelling and Multivariate and Multi Criteria
Technique
Faculty II. Research Interest
A. Applied Management Science: Behavioural
Economics; Sustainable Sectoral Development;
Circular Economy and Productivity; Application
of Fuzzy approaches in Consumer Decision
Making Modelling, Multivariate and Multi
Criteria Analysis B. Interdisciplinary Research:
Entrepreneurship
Personal Information
Name :Dr. Manoj Kumar Dash
Teaching Interests: Marketing Science: Big Data
Analytic; Marketing Analytic; Retail Analytics: Consumer
Decision Making Modelling; Multi-Criteria Decision Making
(MCDM) Optimization Techniques in Marketing;
Econometrics Modelling in Marketing and Behavioural
Economics Experiments
Research Interests
Applied Marketing Science: Consumer Decision Making
Modelling, Digital Marketing;
Agenda for Discussion
Tools and
Techniques
Marketing 30 Min
Analytics
Scope of Application
Marketing
Analytics 10 Min
Data
Analytics 10 Min.
10Min.
Big
Data Question
and Answer
10 Min. 5 min
Think and Analyze …….
12/1/2018
Think and Analyze…….
12/1/2018
Some issues ……..
• A retail outlets wants to know the consumer behavioral pattern of the purchase of products in two
categories : national brand and local brand?
• Retailer is interested to know product intimacy and cross selling and up selling strategy?
• How to prioritize the retail strategy to reach optimum level of profit and sales?
• How to address complex problem and analyze cause and effect in complex situation?
• How to measure the competitor strategy and ranking the different retail units ?
12/1/2018
Some issues ……..
12/1/2018
Once, we have
decided what is to be
analyzed then,
comes the issue of
HOW IS TO BE DONE?
12/1/2018
And, IT TAKES US TO ……
If you think that you are curious
enough to look for answers these
issues, then you have to equip yourself
with …???
Exploratory Factor Analysis
Confirmatory Factor Analysis
Cluster Analysis RFM analysis Churn Analysis
Customer Life Time Value Analysis
Text Analytics
Multiple Regression Analysis Dummy Variable Regression Analysis
Probit Model
Market Basket Analysis
CART Analysis Conjoint Analysis Multi Dimensional Scaling
Balance Score Card
Analytical Heretical Process(AHP)
DEMTEL
TOPSIS CART Neural Network
ANP DEA Image Analytics Video Analytics Text Analytics
Natural Language Processing
Decision Tree
Churn Analysis Elasticity of Demand Break Even Analysis MR and MC analysis
Input and Output analysis
CARTEL pricing strategy CRM ECRM MCRM ICRM
Big Analytics?
Data
Data
Data Estimation
Analysis
To To
Big
Data
Data Data
Analytics Visualization
Data Structured Unstructured Big Data
Collection Data
Mining
Analysis
Differences Analytics
Decision
Making
Who’s Generating Big Data
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
19
Four Characteristics of Big Data
Cost efficiently Responding to the Collectively Analyzing
processing the increasing Velocity the broadening
growing Volume Variety
30 Billion
50x 35 ZB RFID sensors
and counting 80% of the
worlds data is
unstructured
2010 2020
= 2
• Inferences
Research
Multivariate
Methodology
Optimization Multi-Criteria
3
• Modeling
Marketing
Analytics???
Tools and
Big Data Descriptive + Predictive+ Prescriptive Techniques
Decision
What is Likely to
Happen
Predictive
Analytics
Four
What should I do
Why did it happen Diagnostic
Analytics Types of Prescriptive
Analytics about it
Analytics
Descriptive
Analytics
What is happening
Marketing Database Analytics
SQL
SPSS
KNIME:
Azure ML
Scope of Marketing Analytics
Our focus will Be :
• 10 Tools and Techniques
• 8 Experiments
• 4 Case discussion
• Software: SPSS, AMOS and Eviews
Focus on four aspect:
1. Customer Analytics
2. Product Analytics
3. Price Analytics
4. Multivariate Tools
Marketing Anlaytics
Class Topics Process Requirement
Class-1 Overview of Data Analytics
Class-5 Dummy Variable Regression Product Analytic Approach SPSS and E views
Analysis
Class-6 Survival Analysis Product Analytic Approach SPSS and Eviews
MDS
Cluster analysis Conjoint Analysis &
CORRESPANDCE ANALYSIS
Exploratory
Factor Analysis
•Factor analysis in marketing is
important because it reflects the
perception of the buyer of the
product.
•By testing variables, it is possible for
marketing professionals to determine
what is important to the customers of
the product.
Confirmatory factor analysis (CFA) is a more complex approach that
tests the hypothesis that the items are associated with specific factors.
1.CFA uses structural equation modeling to test a measurement model
whereby loading on the factors allows for evaluation of relationships
between observed variables and unobserved variables.[
2. Structural equation modeling approaches can accommodate
measurement error, and are less restrictive than least-squares
estimation.
3. Hypothesized models are tested against actual data, and the analysis
would demonstrate loadings of observed variables on the latent variables
(factors), as well as the correlation between the latent variables.
Confirmatory
Factor Analysis
Structural Equation
Modeling
Multiple
Regression
Analysis
Discriminant
Analysis
Cluster Analysis
Multi Dimensional
Scaling
Conjoint
Analysis
Churn Analysis……
Churn Tools and Techniques
Term used to describe customer attrition or loss
Churn rate
The number of participants who
• Churn
discontinue their use of a service divided
by the average number of total
participants during a period
Analysis
Recency Value:
• The date of last order
• The most powerful predictor of who is likely to
order
http://www.predictiveanalyticstoday.com/top-data-analysis-software/
THINK BIG
Start Small
Profitable Responsive
Customers Customers
RFM
LTV
• Customers are assigned a recency score based on date of most recent purchase or time interval since most
recent purchase. This score is based on a simple ranking of recency values into a small number of
categories.
– For example, if you use five categories, the customers with the most recent purchase dates receive
a recency ranking of 5, and those with purchase dates furthest in the past receive a recency ranking
of 1.
• In a similar fashion, customers are then assigned a frequency ranking, with higher values representing a
higher frequency of purchases.
– For example, in a five category ranking scheme, customers who purchase most often receive a
frequency ranking of 5.
• Finally, customers are ranked by monetary value, with the highest monetary values receiving the highest
ranking.
– Continuing the five-category example, customers who have spent the most would receive a
monetary ranking of 5
– The result is four scores for each customer: recency, frequency, monetary, and combined RFM score,
which is simply the three individual scores concatenated into a single value. The "best" customers
(those most likely to respond to an offer) are those with the highest combined RFM scores. For
example, in a five-category ranking, there is a total of 125 possible combined RFM scores, and the
highest combined RFM score is 555.
Recency Value
4.00% 3.49%
3.50%
Response Rate
3.00%
2.50%
2.00%
1.50% 1.25% 1.08%
1.00% 0.63%
0.50% 0.26%
0.00%
5 4 3 2 1
Recency Quintile
How to compute a Frequency
Index
• Keep number of transactions in customer
record
• Sort Recency Groups from highest to lowest
• Divide into five equal groups
• Number groups from 5 to 1
• Put Quintile number in each customer record
Response Rate
Response by Frequency Quintile
2.50%
1.99%
2.00%
1.56%
1.50% 1.31%
1.00%
0.92% 0.93%
0.50%
0.00%
5 4 3 2 1
Frequency Quintile
How to compute a Monetary Index
• Store total dollars purchased in each customer
record
• Sort Frequency Groups from highest to lowest
• Divide into 5 equal groups (Quintiles)
• Number Quintiles 5, 4, 3, 2, 1
• Put Quintile number in each record
Response by Monetary Quintile
1.80%
1.61%
1.60% 1.45% 1.46%
1.40%
1.22% 1.23%
1.20%
1.00%
0.80%
0.60%
0.40%
0.20%
0.00%
5 4 3 2 1
Monetary Response to Rs 5,000
Product
Percentage of households promoted who purchased
2
1.68
1.5
1.17
1 0.88
0.66
0.5
0.32
0
5 4 3 2 1
Monetary Quintile
Result of Test Mailing to 30,000
# RFM Mailed Response Rate
1 555 240 20 8.15%
2 554 240 16 6.56%
3 553 240 13 5.62%
4 552 240 10 4.33%
5 551 240 11 4.51%
6 545 240 9 3.78%
7 544 240 12 4.98%
8 543 240 6 2.88%
9 542 240 10 4.26%
10 541 240 7 3.10%
11 535 240 10 4.13%
12 534 240 9 3.83%
13 533 240 8 3.35%
14 532 240 6 2.70%
Step
Segments
have a customer who purchased an item 17 days ago (R=1), bought 7 times in the last year
(F=1), and spent 1568 total in the past year (M=1). As a result, we place this customer in
RFM segment “111.” Segment 111 contains your “Best Customers
For customer data, there are three alternatives for where you can save
new RFM scores:
Active dataset. Selected RFM score variables are added to active
dataset.
New Dataset. Selected RFM score variables and the ID variables
that uniquely identify each customer (case) will be written to a new
dataset in the current session. Dataset names must conform to
standard variable naming rules. This option is only available if you
select one or more Customer Identifier variables on the Variables
tab.
File. Selected RFM scores and the ID variables that uniquely identify
each customer (case) will be saved in an external data file. This
option is only available if you select one or more Customer
Identifier variables on the Variables tab.
RFM Procedure
– Transaction file
• Run RFM on transaction file
• Create an new RFM dataset
• Merge with customer file
– Customer file
• File is already prepared
• Run RFM on time since last purchase, number of
purchases, and money spent
93
SPSS Means Procedure
• Compare Means Means
– DV Response Variable (0/1)
– IV RFM (Means and N)
• Copy output table to Excel
– Sort all columns by mean response
– Descending order
• Determine BE and economics
94
Example of a data file in a spreadsheet
What is RFMPD Analysis?
• RFMPD includes 2 additional variables.
• P stands for Payment. This measures when the
company receives payment.
• Customers who pay quickly receive a P score of 1
with the slowest paying receiving a score of 5.
• D stands for Date. This is the date of the
customers last payment.
• Customers are sorted by decreasing D values.
• The final score is based on a value for R, F, M, & P.
Uses of RFM Analysis
• Marketing departments of any company
• Customer Service Departments
• Customer Relations Departments
• Ranking Suppliers
• Ranking Salespeople
• Airlines
• Credit Card Companies
Strengths of RFM Analysis
• Companies have data that can be used for target
marketing.
• Marketing budgets will be focused on customers
who are more recent, more frequent and spend
more.
• Specific targeting can increase profit and reduce
costs; companies gain by not spending on
customers who will not add value
• You can offer incentives to middle scoring
customers to increase their purchases
• Analysis is quick and easy to interpret
Weaknesses of RFM Analysis
It only looks at three variables and there may be others
that are more important
Customers with low RFM scores may be ignored, even
though they may have legitimate reasons for spending
more with other vendors.
Opportunities may be missed to solidify business
relationships leading to loss of future sales and
referrals.
A customer with a low recency value and high spending
could be ranked lower than a customer who made a
recent purchase and spends 10 times less
Effectiveness of RFMP Analysis
• Customers scoring the top 20% also pay the
fastest. Companies will be able make money
faster and this can be used to reduce other
liabilities.
• Customers in the lowest 20% are slow payers
and companies can choose to limit credit or
change payment terms to reduce the amount
of outstanding debt.
1, 1, 1, 5 Customers
• Customer is one that has recently ordered, buys
frequently, spends large amounts of money but
they are a slow payer.
• To speed up the payment process, companies can
change payment terms and offer incentives to
pay earlier.
• For example, if the due date is 30 days, but
payments are received with 10 days, the buyer
will receive a 2% discount off the bill (2/10 net
30).
5, 5, 5, 1 Customer
• This customer has not ordered recently or
frequently, spend small amounts of money
but always pays on time.
• This customer is spending more money with
competitors.
• Make an effort to find out why the customer is
spending elsewhere to see if there is anything
the company can improve on.
RFMP or RFM?
• RFMP is a better method because it include the
variable of payment. With more variables, you
have a clearer picture of the customer’s value to
the company.
• RFMP also takes into account the customer’s
payment history.
• If a customer pays on time, you know that there
are no cash flow issues.
• Slow payers may be having financial problems
which may increase in the future.
Using RFM for Salespeople
• RFM Analysis of Salespeople gives managers a
clear picture of how a salesperson is
performing
• You can analyze the amount of revenue
generated per person and compare different
salespeople
• It is also possible to identify opportunities for
additional training, promotion or employment
termination.
RFM or No RFM?
• RFM is best suited for companies who offer a
rewards program. They are able to track
spending and can offer their high profile
clients incentives to spend more.
• RFM is worst suited to companies who provide
products that are unique and will not be
purchased in large quantities.
Case Study
Description …
1. Dataset used in this case study was provided by a sports store and
collected through its e-commerce website within two years period.
2. The complete dataset included 1584 different product demands in 54 sub-
groups and 6149 purchase orders of 2666 individual customers.
3. The purchase orders included many columns such as transaction id,
product id, customer id, ordering date, quantity, ordering amount (price),
sales type, discount and whether or not promotion was involved.
4. While customer table included demographic variables such as age, gender,
marital status, education level and geographic region; product table
included attributes such as barcode, brand, color, category, subcategory,
usage type and season.
What should you do?
• Maintain a customer database
• Maintain the most recent date, frequency of
orders and total dollar amount
• Put RFM cell codes into your records
• With each mailing, see which cells respond.
• Increase response and profits by NOT MAILING
non responsive cells
Books by Arthur Hughes
contd.
• Principal component analysis: This is the most common method used by researchers. PCA starts
extracting the maximum variance and puts them into the first factor. After that, it removes that
variance explained by the first factors and then starts extracting maximum variance for the second
factor. This process goes to the last factor.
• Common factor analysis: Common factor analysis is the second most preferred method by
researchers. It extracts the common variance and puts them into factor. Common factor analysis
does not include the unique variance of all variables. This method is used in SEM modeling.
• Image factoring: This method is based on correlation matrix. OLS Regression method is used to
predict the factor in image factoring.
• Maximum likelihood method: This method also works on correlation metric but it uses maximum
likelihood method to factor.
141
Steps in Factor Analysis:
The Correlation Matrix
• 1st Step: the correlation matrix
– Generate a correlation matrix for all variables
– Identify variables not related to other variables
– If the correlation between variables are small, it is unlikely that they
share common factors (variables must be related to each other for the
factor model to be appropriate).
– Think of correlations in absolute value.
– Correlation coefficients greater than 0.3 in absolute value are indicative
of acceptable correlations.
– Examine visually the appropriateness of the factor model.
142
Steps in Factor Analysis:
The Correlation Matrix
– Bartlett Test of Sphericity:
used to test the hypothesis the correlation matrix is an identity matrix (all
diagonal terms are 1 and all off-diagonal terms are 0).
If the value of the test statistic for sphericity is large and the associated
significance level is small, it is unlikely that the population correlation matrix
is an identity.
143
Steps in Factor Analysis:
The Correlation Matrix
– The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy:
is an index for comparing the magnitude of the observed correlation
coefficients to the magnitude of the partial correlation coefficients.
The closer the KMO measure to 1 indicate a sizeable sampling adequacy (.8
and higher are great, .7 is acceptable, .6 is mediocre, less than .5 is
unaccaptable ).
Reasonably large values are needed for a good factor analysis. Small KMO
values indicate that a factor analysis of the variables may not be a good idea.
144
Steps in Factor Analysis:
Factor Extraction
2nd Step: Factor extraction
The primary objective of this stage is to determine the factors.
Initial decisions can be made here about the number of factors underlying a
set of measured variables.
Estimates of initial factors are obtained using Principal components
analysis.
The principal components analysis is the most commonly used extraction
method . Other factor extraction methods include:
Maximum likelihood method
Principal axis factoring
Alpha method
Unweighted lease squares method
Generalized least square method
Image factoring.
145
Steps in Factor Analysis:
Factor Extraction
In principal components analysis, linear combinations of the observed
variables are formed.
The 1st principal component is the combination that accounts for the
largest amount of variance in the sample (1st extracted factor).
The 2nd principle component accounts for the next largest amount of
variance and is uncorrelated with the first (2nd extracted factor).
146
Steps in Factor Analysis:
Factor Extraction
To decide on how many factors we Total Variance Explained
The determination of the number of 2 1.801 18.011 48.476 1.801 18.011 48.476
factors is usually done by considering 3 1.009 10.091 58.566 1.009 10.091 58.566
147
Steps in Factor Analysis:
Factor Extraction
The examination of the Scree plot provides a visual of the
total variance associated with each factor.
148
Steps in Factor Analysis:
Factor Extraction
Component Matrix using Principle Component Analysis
Component Matrixa
Component
1 2 3
I discussed my frustrations and feelings with person(s) in school .771 -.271 .121
I tried to develop a step-by-step plan of action to remedy the problems .545 .530 .264
I read, attended workshops, or sought someother educational approach to correct the .398 .356 -.374
problem
I tried to be emotionally honest with my self about the problems .436 .441 -.368
I sought advice from others on how I should solve the problems .705 -.362 .117
I took direct action to try to correct the problems .074 .640 .443
I told someone I could trust about how I felt about the problems .752 -.351 .081
I put aside other activities so that I could work to solve the problems .225 .576 .272
a. 3 components extracted.
149
Steps in Factor Analysis:
Factor Rotation
Un-rotated factors are typically not very interpretable (most factors are
correlated with may variables).
150
Steps in Factor Analysis:
Factor Rotation
The most popular rotational method is Varimax rotations.
151
Steps in Factor Analysis:
Factor Rotation
• Other common rotational method used include Oblique rotations which
yield correlated factors.
• Oblique rotations are less frequently used because their results are more
difficult to summarize.
152
Steps in Factor Analysis:
Factor Rotation
• A factor is interpreted or named by examining the largest values linking the factor
to the measured variables in the rotated factor matrix.
Rotated Component Matrixa
Component
1 2 3
I discussed my frustrations and feelings with person(s) in school .803 .186 .050
I tried to develop a step-by-step plan of action to remedy the problems .270 .304 .694
I read, attended workshops, or sought someother educational approach to correct the problem .050 .633 .145
I tried to be emotionally honest with my self about the problems .042 .685 .222
I sought advice from others on how I should solve the problems .792 .117 -.038
I took direct action to try to correct the problems -.120 -.023 .772
I told someone I could trust about how I felt about the problems .815 .172 -.040
I put aside other activities so that I could work to solve the problems -.014 .155 .657
154
Factor analysis
To find out:
• Reliability test
• Validity test sample adequacy test
• Communalities
• Loading
• Variance explained anti image variance matrix
• Factors
• To check factors are independent
• Multiple regression to find out significant factor
164
Obtaining a Factor Analysis
• Move
variables/scale
items to
Variable box
165
Obtaining a Factor Analysis
• Factor
extraction
• When variables
are in variable
box, select:
– Extraction
166
Obtaining a Factor Analysis
• When the factor
extraction Box
appears, select:
• Scree Plot
167
Obtaining a Factor Analysis
• During
factor
extraction
keep
factor
rotation
default of:
– None
– Press
continue
168
Obtaining a Factor Analysis
• During Factor Rotation:
• Decide on the number
of factors based on
actor extraction phase
and enter the desired
number of factors by
choosing:
• Fixed number of factors
and entering the
desired number of
factors to extract.
• Under Rotation Choose
Varimax
• Press continue
• Then OK
169
Dr. Manoj Dash
Dr. Manoj Dash
Dr. Manoj Dash
Dr. Manoj Dash
Cumulative percent of variance explained.
Socially-
consciousness e-CDMS Country of
origin
Unpremeditated
consumer
Conceptual model for unpremeditated consumers’ electronic purchasing decision
Factor Name and Statements EFA CFA
Reliability Factors Reliability Factors
The well-known national brands are best for me to buy online. .917 .770 .874 .668
I buy online as much as possible national brand Cont... .918 .716 Eliminated Eliminated
During online buying I prefer the brand relative to country of origin. .917 .681 .873 .778
I make my shopping fast through online purchasing. .916 .664 Eliminated Eliminated
I buy online after comparing the price with others service provider. .918 .796 Eliminated Eliminated
I carefully watch how much I spend during online buying. .921 .704 Eliminated Eliminated
I take the time to shop online carefully for best buy high price products. .918 .696 Eliminated Eliminated
I should plan my online shopping more carefully than I do. .920 .815 .878 .572
I am impulsive when purchasing online products. .919 .767 .878 .519
I can change my regularly online buying brands .917 .566 .874 .948
There are so many brands available online to choose from that I feel confused .919 .767 .874 .548
The more I learn about online products, the harder it seems to choose the best .918 .678 .871 .922
Sometimes I feel hard to choose which online store to shop. .918 .546 Eliminated Eliminated
Latent Variable and related items Standardized Factors a Average Variance Cronbach
Loading (>.70)* Extracted (>.50) * (α) (>.70)*
The well-known national brands are best for me to buy online. .668
During online buying I prefer the brand relative to country of origin. .778
There are so many online brands to choose from that often I feel confused .548
The more I learn about online products, the harder it seems to choose the best .922
I prefer to buy from online service provider companies that give something back to society .763
I willing to pay extra for product and services to the companies that give back to society .661
* Indicates an acceptable level of reliability or validity and a AVE: Average Variance Extracted. This is computed by adding the squared factor loading divided by number of factors
Correlation of latent variables and Discriminant validity
Innovative Brand value Trendy Country of Unpremeditated Misperception Socially
Product consciousness Sophisticated Origin Consumer by Over choice Consciousness
Innovative Product (.80)
Diagonal in parentheses: square root of average variance from observed variables (items); off-diagonal: correlation between constructs
The measurement model
Socially-
consciousness
Branded
Without companies taking
investment benefit more advantages
Double type
benefits for e- Brand- Problem for
service provider Social consumer consciousne
new entrants
Impact on
Impact on
Impact
unplanned sharing ss
unplanned
behaviour of
behaviour of consumer
consumer
Trendy-
Sophisticated
•
Novelty & innovative features products more attract consumers
Case study-1
• Factors affecting customer satisfaction in a
retail mall in ghaziabad
• Data-1
Bibliographical References
Almar, E.C. (2000). Statistical Tricks and traps. Los Angeles, CA: Pyrczak Publishing.
Bluman, A.G. (2008). Elemtary Statistics (6th Ed.). New York, NY: McGraw Hill.
Chatterjee, S., Hadi, A., & Price, B. (2000) Regression analysis by example. New York: Wiley.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd
Ed.). Hillsdale, NJ.: Lawrence Erlbaum.
Darlington, R.B. (1990). Regression and linear models. New York: McGraw-Hill.
Einspruch, E.L. (2005). An introductory Guide to SPSS for Windows (2nd Ed.). Thousand Oak, CA: Sage
Publications.
Fox, J. (1997) Applied regression analysis, linear models, and related methods. Thousand Oaks, CA: Sage
Publications.
Glassnapp, D. R. (1984). Change scores and regression suppressor conditions. Educational and Psychological
Measurement (44), 851-867.
Glassnapp. D. R., & Poggio, J. (1985). Essentials of Statistical Analysis for the Behavioral Sciences. Columbus,
OH: Charles E. Merril Publishing.
Grimm, L.G., & Yarnold, P.R. (2000). Reading and understanding Multivariate statistics. Washington DC:
American Psychological Association.
Hamilton, L.C. (1992) Regression with graphics. Belmont, CA: Wadsworth.
Hochberg, Y., & Tamhane, A.C. (1987). Multiple Comparisons Procedures. New York: John Wiley.
Jaeger, R. M. Statistics: A spectator sport (2nd Ed.). Newbury Park, London: Sage Publications.
196
Bibliographical References
• Keppel, G. (1991). Design and Analysis: A researcher’s handbook (3rd Ed.). Englwood Cliffs, NJ: Prentice Hall.
• Maracuilo, L.A., & Serlin, R.C. (1988). Statistical methods for the social and behavioral sciences. New York:
Freeman and Company.
• Maxwell, S.E., & Delaney, H.D. (2000). Designing experiments and analyzing data: Amodel comparison
perspective. Mahwah, NJ. : Lawrence Erlbaum.
• Norusis, J. M. (1993). SPSS for Windows Base System User’s Guide. Release 6.0. Chicago, IL: SPSS Inc.
• Norusis, J. M. (1993). SPSS for Windows Advanced Statistics. Release 6.0. Chicago, IL: SPSS Inc.
• Norusis, J. M. (2006). SPSS Statistics 15.0 Guide to Data Analysis. Upper Saddle River, NJ.: Prentice Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Guide to Data Analysis. Upper Saddle River, NJ.: Prentice Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Statistical Procedures Companion. Upper Saddle River, NJ.: Prentice
Hall.
• Norusis, J. M. (2008). SPSS Statistics 17.0 Advanced Statistical Procedures Companion. Upper Saddle River, NJ.:
Prentice Hall.
• Pedhazur, E.J. (1997). Multiple regression in behavioral research, third edition. New York: Harcourt Brace
College Publishers.
197
Bibliographical References
• SPSS Base 7.0 Application Guide (1996). Chicago, IL: SPSS Inc.
• SPSS Base 7.5 For Windows User’s Guide (1996). Chicago, IL: SPSS Inc.
• SPSS Base 8.0 Application Guide (1998). Chicago, IL: SPSS Inc.
• SPSS Base 8.0 Syntax Reference Guide (1998). Chicago, IL: SPSS Inc.
• SPSS Base 9.0 User’s Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Base 10.0 Application Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Base 10.0 Application Guide (1999). Chicago, IL: SPSS Inc.
• SPSS Interactive graphics (1999). Chicago, IL: SPSS Inc.
• SPSS Regression Models 11.0 (2001). Chicago, IL: SPSS Inc.
• SPSS Advanced Models 11.5 (2002) Chicago, IL: SPSS Inc.
• SPSS Base 11.5 User’s Guide (2002). Chicago, IL: SPSS Inc.
• SPSS Base 12.0 User’s Guide (2003). Chicago, IL: SPSS Inc.
• SPSS 13.0 Base User’s Guide (2004). Chicago, IL: SPSS Inc.
• SPSS Base 14.0 User’s Guide (2005). Chicago, IL: SPSS Inc..
• SPSS Base 15.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• SPSS Base 16.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• SPSS Statistics Base 17.0 User’s Guide (2007). Chicago, IL: SPSS Inc.
• Tabachnik, B.G., & Fidell, L.S. (2001). Using multivariate statistics (4th Ed). Boston, MA: Allyn and Bacon.
198
Confirmatory factor analysis
The exploratory factor model
1 1
...............
x1 xq
x1 x2 x3 x4 . . . . . . . . . . . . x. 5. . x6 x7 x8
d1 d2 d3 d4 d5 d6 d7 d8
j21
1 1
x1 x2
x1 x2 x3 x4 x5 x6 x7 x8
d1 d2 d3 d4 d5 d6 d7 d8
(Source: http://en.wikipedia.org/wiki/Confirmatory_factor_analysis)
Why Confirmatory factor analysis ?
1 j21 1
x1 x2
x1 x2 x3 x4 x5 x6 x7 x8
d1 d2 d3 d4 d5 d6 d7 d8
Evaluating Model Fit
• Note: For more details please go through study materials (CFA & SEM) ,Specially paper 3 in SEM
(Source: http://en.wikipedia.org/wiki/Confirmatory_factor_analysis)
Confirmatory factor analysis
Hair, J., Black, W., Babin, B., and Anderson, R. (2010). Multivariate data analysis (7th ed.):
Prentice-Hall, Inc. Upper Saddle River, NJ, USA.
Barbara M. Byrne (2010). Structural Equation Modeling with Amos: Basic Concepts,
Applications and Programming (2th Edition).
Field, Andy (2009). Discovering statistics using SPSS, Sage Publications Limited.
Let us do a exercise my dear friends
Interactive Graphical
Data Analysis through
Tableau Software
Dr. Manoj Kumar Dash
M.A;M.Phil;M.B.A;NET; Ph.D
ABV-Indian Institute of Information Technology and Management, Gwalior
( An Autonomous institute of Government of India)
Agenda for Discussion
Tableau
Software
Tableau Demonstration
Software
30 Min
Visualization 20 Min
Software
10 Min.
Data
Visualization
10 Min.
Big Data
10 Min. Question
and Answer
10min
Data Visualization
A graphical, animation, or video presentation
of data and the results of data analysis
– The ability to quickly identify important trends in
corporate and market data can provide
competitive advantage
– Check their magnitude of trends by using
predictive models that provide significant business
advantages in applications that drive content,
transactions, or processes
01. Dygraphs
02. ZingChart
3.InstantAtlas
04. Timeline
05. Exhibit
06. Modest Maps
07. Leaflet
08. WolframAlpha
09. Visual.ly
10. Visualize Free
11.Better World Flux
12.FusionCharts
13. jqPlot
14. Highcharts
15. iCharts
Open Source Software
Data Visualization
• D3.js, •Plotly
• http://www.fusioncharts.com •ChartBlocks
/ata Visualizat
• Chart.js
•Flot
• Google Charts •Raphaël
• Highcharts •Visual.ly
• Leaflet •Crossfilter
•
•
Dygraphs
Datawrapper
•Tangle
•Polymaps
Tableau
• Raw •Kartograph
• Timeline JS
•CartoDB
• Infogram
•NodeBox
• Ggobi
• Xmdv
•Weka
•Gephi
GARTNER MAGIC QUADRANT
FOR BI
Contents to cover
• Step-1 Tableau Introduction
• Step-2 Connecting to Data
• Step-3 Building basic views
• Step-4 Data manipulations and Calculated
fields
• Step-5 Tableau Dashboards
• Step-6 Advanced Data Options
• Step-7 Advanced graph Options
What is Tableau
• Tableau is a rapid BI software
• Great visualizations: Allows anyone to
connect to data, visualize and create
interactive, sharable dashboards in a few clicks
• Ease of use: It's easy enough that any Excel
user can learn it, but powerful enough to
satisfy even the most complex analytical
problems.
• Fast: We can create parallelized dashboards,
quick filters and calculations
Venkata Reddy Konasani
What is Tableau?
Achieves
20th
consecutive
quarter of
record
growth
Three Achieves 4th Achieves Adds 1000
professor s consecutive 8th customer
(Chris, Pat, quarter of consecutive accounts Launches
Chabot) record sales quarter of Tableau 7.1
in the record Launches
Stanford growth Tableau 7.2
university Launches Launche
started Named Tableau Launches
“Product s
research Launches Tableau 9.0.1
7.3 Tableau
of the Year” Desktop 5.0
to build Launches by PC
8.0 Launches To 9.0.5
visual tool Desktop1.0 – & Server
Magazine Launche Tableau
5.0
customers s 8.2
in every #400 on Inc Tableau Launches
industry 500 8.1 Tableau
#132 on 8.3
Deloitte
Technol
ogy
Fast500
1991 2005 2006 2009 2012 2013 2014 2015
Tableau’s History
+ ad hoc analytics, + business intelligence + share visualizations & + create and publish
dashboards, reports, graphs solution scales to dashboards on the desktop interactive visualizations
+ explore, visualize, and organizations of all sizes + filter, sort, and page through and dashboards
analyze your data + share visual analytics with the views + embed in websites and
+ create dashboards to anyone with a web browser + “Acrobat for Data” blogs
consolidate multiple views + publish interactive analytics + free download + free download and free
+ deliver interactive data or dashboards hosting service
experiences + secure information and
manage metadata
+ collaborate with others
Market Expectations:
Strength of Tableau….
Fast Cost Effective Everyone
Dimensions Measures
Lab
• Start Tableau
• Open a new workbook
• Add one additional sheet
• Identify data connection tab
• Can we connect to MySQL server?
• Can we connect to txt file?
• How to go back to workbook from connect to data window?
• Add a new dashboard
• Where are various types of graphs options available?
• Can we draw pie chart using Tableau?
Tableau Repository
• The Tableau repository holds Workbooks Bookmarks and
data sources.
• located in a folder called My Tableau Repository inside of
your My Documents folder.
1) Excel
2) Data Roles – Dimensions and Measures
3) Data Window and Right-click Options
4) Excel Visual With Modified Defaults
5) Tabular View
6) Show Me
7) Formatting Text Visuals
8) Geo-Coding
9) Geo-Coding Filled Maps
10) Scatter Charts
11) Time Series – Trend Lines
12) Visual Filtering
13) Sorting
14) Filtering
15) Map Filters
16) Percentages and Totals
17) First Dashboard
Section 3 – Intermediate
1) Data Roles and Options
2) Changing Data Roles
3) Maps
4) Dates and Times
5) Dates with Calculations –Gantt Chart
6) Grouping
7) Bins and Histograms
8) Sets within Scatter Charts
9) Concatenated Sets
10) Sorting with Sets
11) Quick Table Calculations
12) Secondary Table Calculations
13) Create Calculated Fields
14) Histogram with Running Totals
15) Trend Lines
16) Reference Lines
17) Performance
Section 4 – Advanced Features
1) Combination Charts
2) Trends and Motions
3) Data Blending
4) Parameters
5) Shipping Parameters
6) Area Charts
Section 5 – Masters
1) Heat Maps
2) Box Plots
3) Pareto Or the 20 – 80 Rule
4) Bullet Chart
5) Bar In A Bar
6) Standard Deviations
7) Reference Lines with Banding
8) Groups and Sets
9) Dashboard
Section 6 – Dashboards and
Guided Analytics
There is a 90%
probability of
surviving to the end
of 10th term.
Surviving =
remaining enrolled!
Example of Survival Probability Graph
http://wpfau.blogspot.com/2011/08/safe-withdrawal-rates-and-life.html
One Minus survival function
There is a 10%
probability of not-
surviving to the end of
10th term.
Not surviving =
graduating!!
Contd.
Survival analysis
Terms
− Events: what terminates an episode (such as churn, adoption
of an innovation), it is the change which causes the subject to
transition from one state to another.
− Durations: the number of time units an individual spends in a
given state.
− Dependent: probability of an event.
− Survival function, s(t): is the cumulative frequency of the
proportion of the sample Not experiencing the event by time
t. In another word, it is the probability of event will NOT occur
until time t.
− Censored cases: data are censored if events start before (left-
censored) or ended after (right-censored) the period of
observation.
Survival analysis
Censored cases
Survival analysis
Outline of topics
− Life tables
− Kaplan-Meier
− Cox regression
Life tables
Variables
− Time variable (duration variable): must be a continuous
variable.
− Status variable: binary or categorical variable, represents the
event of interest.
− Factor variable: categorical variable.
Assumption
− The probability for the event of interest should depend
only
on time. Cases that enter the study at different times should
behave similarly.
− No systematic differences between censored and
uncensored
cases
Life tables
Run analysis
Life tables
Click Options
SPSS Outputs
Clearly defined event: (death, onset of illness, recovery from illness, marriage,
birth, mechanical failure, success, job loss, employment, graduation).
Terminal event
Time variable = Time measured from the entry of a subject into the study until the
defined event. Months, terms, days, years, seconds.
Covariates:
To determine if different groups have different survival times
Regression models
Survival analysis – SPSS Data layout
Basic student data
• Time variable – terms enrolled
• Event status – graduation status
Binary or
dummy
Censored variables Group into
indicator
categories
Student 1 5 0 1 3.4
Student 2 9 1 0 4.0
Student 3 14 0 1 2.9
Student 4 7 1 1 3.9
Student 5 8 1 0 3.1
Cohort Description
Count of still
enrolled
students at
start of term
Survival Analysis – Life Table produced
byof the
primary output SPSSsurvival analysis procedure
There is a 90%
probability of
surviving to the end
of 10th term.
Surviving =
remaining enrolled!
One Minus survival function
There is a 10%
probability of not-
surviving to the end of
10th term.
Not surviving =
graduating!!
Survival Analysis: SPSS, with Covariate
Factor = Gender
SPSS
• Analyze
• Survival
• Life Tables
Survival Pattern: SPSS will produce a different colored line for each of the
factor’s values
Second Approach
Kaplan-Meier Estimator
• The Kaplan-Meier estimator, independently
described by Edward Kaplan and Paul Meier
and conjointly published in 1958 in the
Journal of the American Statistical Association,
is a non-parametric statistic that allows us to
estimate the survival function.
Kaplan-Meier procedure
Assumptions
− Probabilities for the event of interest should depend only on
time after the initial event without covariates effects.
− Cases that enter the study at different times (for example,
patients who begin treatment at different times) should
behave similarly.
− Censored and uncensored cases behave the same. If, for
example, many of the censored cases are patients with more
serious conditions, your results may be biased.
Survival Analysis: Kaplan-meier Method
Assumptions
Censored individual – student who has not experienced the
event (graduated) by the end of the study, e.g. they are no
longer enrolled
Check for differences between censored and non-censored
groups
KM Terms_enrolled BY
Gender
/STATUS=graduated(1)
/PRINT TABLE MEAN
/PLOT SURVIVAL
/TEST LOGRANK BRESLOW
TARONE
/COMPARE OVERALL
POOLED.
Kaplan-Meier Results – Gender
Breslow gives
more weight to
earlier
graduations
Taron-Ware is
mixture of two
Kaplan-Meier Results – Gender
Curves not
significantly
different at p < .05
Kaplan-Meier procedure
Variables
− Time variable (duration variable): must be a continuous
variable
− Status variable: categorical or continuous variable, represents
the event of interest (drug has effect or not).
− Factor variable: categorical variable, represents a causal effect
(type of treatment for example).
− Stratification variable: categorical variable.
Kaplan-Meier procedure
Analyze data
Kaplan-Meier procedure
Log rank: Tests equality of survival functions by weighting all time points the
same.
Breslow: Tests equality of survival functions by weighting all time points by the
number of cases at risk at each time point.
Tarone-Ware: Tests equality of survival functions by weighting all time points by
the square root of the number of cases at risk at each time point.
Kaplan-Meier procedure
Compare factor
Pooled over strata: a single test is computed for all factor levels, testing for
equality of survival function across all levels of the factor variable.
Pairwise over strata: a separate test is computed for each pair of factor levels
when a pooled test shows non-equality of survival functions.
For each stratum: a separate test is computed for group formed by the
stratification variable.
Pairwise for each stratum: a separate test is computed for each pair of factor
variable, for each stratum of the stratification variable.
Kaplan-Meier procedure
Click Options
Kaplan-Meier procedure
Overall comparison
Terms
− Status variable: the dependent in Cox regression, should be
binary variable.
− Time variable: measures duration to the event defined by the
status variable (continuous or discrete).
− Covariates: independent/predictor variables. They can be
categorical or continuous. They also can be time-fixed or time-
dependent.
− Interac on terms
− Categorical covariates: SPSS automa cally convert them into a set of
dummy variables, omitting one category.
Cox Regression, Checking proportional hazards
SPSS
assumption • Analyze
• Survival
• Cox Regression
Repeat for
each factor!
Cox Regression:
Use log minus log function to check
Proportional Hazards Assumption
SPSS
• Analyze
• Survival
• Cox Regression
• (move gender to
Covariates box)
COX REGRESSION MODEL RESULTS: EXAMPLE, GENDER
Interpretation of SPSS Cox
Regression Results:
• The reference category is
female because I made that
choice for this model
• It is not statistically significant
at p < 0.05 that females and
males have different survival
curves
Exp(B) = Hazard
ratio: Female vs. Male
The null hypothesis is
that this ratio = 1.
p value is produced
that indicates if
difference between
curves is significant
or not
~ 9% probability of
continued enrollment
Cox regression
Click Categorical
Cox regression
Click Plots
Cox regression
Click Options
Cox regression
SPSS Outputs
Cox regression
SPSS Outputs
Cox regression
SPSS Outputs
Newell, J. & Hyun, S. (2011). Survival Probabilities With and Without the Use of
Censored Failure Times Retrieved from
https://www.uscupstate.edu/uploadedFiles/Academics/Undergraduate_Research/Reseach_
Journal/2011_007_ARTICLE_NEWELL_HYUN.pdf
Singh, R., Mukhopadhyay, K. (2011). Survival analysis in clinical trials: Basics and must
know areas, Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3227332/t
Wiorkowski, J., Moses, A., & Redlinger, L. (2014).The Use of Survival Analysis to
Compare Student Cohort Data, Presented at the 2014 Conference of the Association of
Institutional Research
Conjoint Analysis
12/1/2018
CONJOINT ANALYSIS
12/1/2018
Concept of Performing Conjoint Analysis
12/1/2018
Concept of Performing Conjoint Analysis (Cont.)
12/1/2018
Steps in Conducting Conjoint Analysis
12/1/2018
Step 1: Problem Formulation
12/1/2018
Step 1: Problem Formulation (Cont.)
12/1/2018
Step 2: Trade-off Data Collection
12/1/2018
Step 2: Trade-off Data Collection (Cont.)
Both the methods, that is, pair-wise (two factor) approach and full-profile
approach, have their own utility, but full-profile approach is the most
widely used method.
12/1/2018
Step 3: Metric Versus Non-Metric Input Data
• Conjoint analysis data can be of both the forms: metric data and
non-metric data.
• For non-metric data, the respondents indicate ranking, and for
metric data, the respondents indicate rating.
• Rating approach has got popularity in recent days. As obvious, in
conjoint analysis, the dependent variable is consumer preference or
intention to buy a product (rating or ranking provided by the
customers for buying a product). In the colour television example,
ratings are obtained in a 7-point Likert scale with 1 as not preferred
and 7 as highly preferred.
• These ratings are given in Table 17.11.
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
TABLE 17.11 : Selected profiles of colour television example
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
TABLE 17.12 : The colour television data converted into dummy
variables on applying regression technique
Using any statistical software regression equation can be obtained
as
12/1/2018
Step 4: Result Analysis and Interpretation
12/1/2018
SPSS output (multiple regression) for conjoint problem
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 4: Result Analysis and Interpretation (Cont.)
12/1/2018
Step 5: Reliability and Validity Check
12/1/2018
Assumptions and Limitations of Conjoint Analysis
12/1/2018
Multi Dimensional Scaling
DR MANOJ KUMAR DASH
FIGURE : Perceptual map (two dimensional) with labelling of
dimensions
Example of MDS output – holiday
destinations in two dimensions
Common Space
Interpretation: How trendy is the city
0.75
London
• Each of the respondents is asked to
Paris
rank the cities, without necessarily
0.50 Berlin specifying why one city was preferred to
another
0.25 Amsterdam
Rome • Similarities in ranking across an
Dimension 2
423
Example of brand positioning
The two dimensions are the output of some
reduction technique
- PCA or FA for interval (metric) data
- correspondence analysis for non-metric
data
coordinates for brands are obtained by
running PCA (or FA) on sensory
assessments (usually through a panel of
experts unless objective measures exists)
Consumer positions (as individuals or as
segments) can be defined in two ways
1) using their “ideal brand” characteristics
2) by translating preference ranking for
brands into coordinates through unfolding
424
Brand positioning
The product should be Segment A
healthy as both A & C There is room for a new
chooses
like that dimension. product for segment C
three but it
The thicker it is, the also close to sgm. A
is not that
closer is to C close
compared to A Brand five survives
because of segment
C, but it is far from C’s
Consumer segment B preferences
Brand 5
is close to Brand three
Brand repositioning. If brand five had this marketing research information, one could
improve one’s performance by enhancing the perceived healthiness of the product
Brand
(e.g. reducing the 1salt
and 4 are and through a targeted advertising campaign). This
content Consumer segment D
perceivedwould
as similar
move brand fivcloser to segment C with Brand 2
is happy
425
Plots
426
According to the sample, basketball,
Joint plot baseball and cricket share
similarities in subjects’ perceptions
and so do American football, motor
sports and ice hockey.
A third “cluster” is provided by
handball, waterpolo and volleyball,
while football seems to be
equidistant from all other sports.
Consumers are also grouped in
clusters according to their
preferences and the joint
representation allows one to show
not only which sports (products) are
closer to the preferences of different
segments, but also which sports
need to be repositioned to attract
more public, like the cluster with
volleyball, waterpolo and handball.
427
MULTIDIMENSIONAL SCALING
Perceptions Preferences
• The input data used for multidimensional scaling may be connected with the
similarity data or the preference data.
• Similarity Data: Similarity data are collected through the respondents by just
noting the perceived similarity between the two brands or objects.
• These data are often referred to as similarity judgment. Figure 1 provides
respondent’s similarity judgment between two pairs (Fortune–Saffola) of
edible oil brands.
• As a second way, derived approach for data collection in terms of conducting
multidimensional scaling can also be used.
• Using this approach, the respondents are supposed to rate the brands for
identified attributes on a rating scale.
• Responses obtained from a single respondent are summarized in Table 18.5
above.
Step 2: Input Data Collection (Cont.)
21-438
Similarity Rating Of Toothpaste Brands
Table 21.1
Aqua-Fresh Crest Colgate Aim Gleem Plus White Ultra Brite Close-Up Pepsodent Sensodyne
Aqua-Fresh
Crest 5
Colgate 6 7
Aim 4 6 6
Gleem 2 3 4 5
Plus White 3 3 4 4 5
Ultra Brite 2 2 2 3 5 5
Close-Up 2 2 2 2 6 5 6
Pepsodent 2 2 2 2 6 6 7 6
Sensodyne 1 2 4 2 4 3 3 4 3
FIGURE : SPSS output exhibiting iteration for stress value improvement, stress
value, and R2 value
Step 5: Substantive Interpretation (Cont.)
• As a first step of checking reliability and validity of the model, the value of
R2 must be examined. As discussed, an R2 value greater than or equal to
60% is considered acceptable.
• In edible oil multidimensional scaling model, R2 value comes to 0.9707
(97.07%), which is very close to 1 and hence the model is very well
acceptable.
• As a second step, stress value must be examined.
• In edible oil multidimensional scaling model, stress value comes to 0.0746
(close to 5%). This is an indication of a good-fit multidimensional scaling
model.
• Original data should be divided in two or parts and obtained results must
be compared.
• Input data must be gathered at two different points of time and test–
retest reliability must be computed.
Example of MDS output – holiday
destinations in two dimensions
Common Space
Interpretation: How trendy is the city
0.75
London
• Each of the respondents is asked to
Paris
rank the cities, without necessarily
0.50 Berlin specifying why one city was preferred to
another
0.25 Amsterdam
Rome • Similarities in ranking across an
Dimension 2
454
Example of brand positioning
The two dimensions are the output of some
reduction technique
- PCA or FA for interval (metric) data
- correspondence analysis for non-metric
data
coordinates for brands are obtained by
running PCA (or FA) on sensory
assessments (usually through a panel of
experts unless objective measures exists)
Consumer positions (as individuals or as
segments) can be defined in two ways
1) using their “ideal brand” characteristics
2) by translating preference ranking for
brands into coordinates through unfolding
455
Brand positioning
The product should be Segment A
healthy as both A & C There is room for a new
chooses
like that dimension. product for segment C
three but it
The thicker it is, the also close to sgm. A
is not that
closer is to C close
compared to A Brand five survives
because of segment
C, but it is far from C’s
Consumer segment B preferences
Brand 5
is close to Brand three
Brand repositioning. If brand five had this marketing research information, one could
improve one’s performance by enhancing the perceived healthiness of the product
Brand
(e.g. reducing the 1salt
and 4 are and through a targeted advertising campaign). This
content Consumer segment D
perceivedwould
as similar
move brand fivcloser to segment C with Brand 2
is happy
456
The MDS data set
457
IPM & unfolding
458
Unfolding
Proximities are defined
from the subjects’
preference rankings
Select identity
as data come
from a single
source
Rankings are
dissimilarities
and ordinal
data
Number of
dimensions to
be explored
460
Options
Convergence criterion for
the STRESS function
Applies different
The final colors or markets to
common different objects
space shows
subjects and
objects on the
same plot
462
Outputs
Output tables can be
selected here
Output coordinates
(distances) can be saved
into a new file
463
Unfolding output
Measures The final STRESS-I value of 0.04 is acceptable.
Iterations 992
Other measures of “badness-of-fit” and “goodness-of-fit” are
Final Function Value .3835645 provided and confirm that the results are acceptable.
Function Value Stress Part .0410912
Parts Penalty Part 3.5803705 The variation coefficient of the transformed proximities
Badness of Fit Normalized Stress .0016885 can be used to check for the risk of degenerated solutions
Kruskal's Stress-I .0410908 (points are too close to each other). In this case, the
Kruskal's Stress-II .1905153
variation coefficient of the transformed proximities is 0.33 as
Young's S-Stress-I .0720164
Young's S-Stress-II
compared to the 0.50 of the original ones, which means that
.1781156 most of the variability is retained after transformation.
Goodness of Fit Dispersion Accounted For .9983115 Furthermore, the distances show a variability which is more
Variance Accounted For .9666225 or less equal to the original one, indicating that the points in
Recovered Preference
.8471837 space should be scattered enough to reflect the initial
Orders
distances.
Spearman's Rho .8617494
Kendall's Tau-b .7273984
Variation Variation Proximities .5043544
The DeSarbo’s Intermixedness index and the Shepard’s
Coefficients Variation Transformed RNI also provide warning signals for degenerated solutions:
.3322572
Proximities the former should be as close to zero as possible and the
Variation Distances .5071630 latter as close to one as possible. There are no strong
Degeneracy Indices Sum-of-Squares of signals for a degenerated solution
DeSarbo's .4694185
Intermixedness Indices
Shepard's Rough One may wish to try different parameters for the penalty
.5609796
Nondegeneracy Index term to see whether these indicators improve.
464
Plots
465
According to the sample, basketball,
Joint plot baseball and cricket share
similarities in subjects’ perceptions
and so do American football, motor
sports and ice hockey.
A third “cluster” is provided by
handball, waterpolo and volleyball,
while football seems to be
equidistant from all other sports.
Consumers are also grouped in
clusters according to their
preferences and the joint
representation allows one to show
not only which sports (products) are
closer to the preferences of different
segments, but also which sports
need to be repositioned to attract
more public, like the cluster with
volleyball, waterpolo and handball.
466
Repositioning
• If one can attach a meaning to dimensions one and two
it becomes possible to understand what characteristics
of the products should be changed
• A method to obtain an interpretation of the coordinates
consists in looking at the correlations betweens the
coordinates of the sports and the object characteristics
that can be measured objectively or through the
evaluation of expert panellists.
• The algorithm has created an output file coord.sav which
contains the two coordinates for each sport and
consumer and can be used to obtain the bivariate
correlations
467
Using SPSS for Multidimensional Scaling
Dr Manoj Dash
Learning Objectives
• Understand basic concepts in decision tree
modeling
• Understand how decision trees can be used
to solve classification problems
• Understand the risk of model over-fitting
and the need to control it via pruning
• Able to evaluate the performance of a
classification modelusing training, validation
and test datasets
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
What is CART?
• Classification And Regression Trees
• Developed by Breiman, Friedman, Olshen, Stone in early 80’s.
– Introduced tree-based modeling into the statistical mainstream
– Rigorous approach involving cross-validation to select the optimal tree
• One of many tree-based modeling techniques.
– CART -- the classic
– CHAID
– C5.0
– Software package variants (SPSS)
EXAMPLE:
A bank wants to categorize credit applicants according
to whether or not they represent a reasonable credit
risk. Based on various factors, including the known
credit ratings of past customers, you can build a model
to predict if future customers are likely to default on
their loans.
A tree-based analysis provides some attractive features:
a)It allows you to identify homogeneous groups with
high or low risk.
b) It makes it easy to construct rules for making
predictions about individual cases.
The Decision Tree procedure creates a tree-based
classification model. It classifies cases into groups or
predicts values of a dependent (target) variable
based on values of independent (predictor)
variables. The procedure provides validation tools for
exploratory and confirmatory classification analysis.
Classification Trees (cont.)
• Business marketing: predict whether a person will buy a computer?
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
10
?
Hunt’s Algorithm
Tid Refund Marital Taxable
Refund Status Income Cheat
Don’t
Yes No 1 Yes Single 125K No
Cheat
Don’t Don’t 2 No Married 100K No
Cheat Cheat 3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Don’t Cheat
Cheat
Steps to Follow:
1. Select a dependent variable.
2. Select one or more independent variables.
3. Select a growing method.
4. Optionally, you can:
a) Change the measurement level for any variable in the source list.
b) Force the first variable in the independent variables list into the model
as the first split variable.
5. Select an influence variable that defines how much influence a case has on
the tree-growing process. Cases with lower influence values have less
influence; cases with higher values have more. Influence variable values
must be positive.
6. Validate the tree.
7. Customize the tree-growing criteria.
8. Save terminal node numbers, predicted values, and predicted probabilities
as variables.
9. Save the model in XML (PMML) format.
EXAMPLE:
A bank wants to categorize credit applicants according
to whether or not they represent a reasonable credit
risk. Based on various factors, including the known
credit ratings of past customers, you can build a model
to predict if future customers are likely to default on
their loans.
A tree-based analysis provides some attractive features:
a)It allows you to identify homogeneous groups with
high or low risk.
b) It makes it easy to construct rules for making
predictions about individual cases.
The tree diagram is a graphic representation of the tree model. This
tree diagram shows that:
Using the CHAID method, income level is the best predictor of credit
rating.
a. For the low income category, income level is the only significant
predictor of credit rating. Of the bank customers in this category, 82%
have defaulted on loans. Since there are no child nodes below it, this is
considered a terminal node.
b. For the medium and high income categories, the next best predictor
is number of credit cards.
c. For medium income customers with five or more credit cards, the
model includes one more predictor: age. Over 80% of those customers
28 or younger have a bad credit rating, while slightly less than half of
those over 28 have a bad credit rating.
The tree table, as the name suggests, provides most of the essential tree diagram
information in the
form of a table. For each node, the table displays:
a. The number and percentage of cases in each category of the dependent variable.
b. The predicted category for the dependent variable. In this example, the predicted
category is the credit rating category with more than 50% of cases in that node, since
there are only two possible credit ratings.
c. The parent node for each node in the tree. Note that node 1—the low income level
node—is not the parent node of any node. Since it is a terminal node, it has no child
nodes.
The gains for nodes table provides a summary of information about the terminal nodes in the model.
a. Only the terminal nodes—nodes at which the tree stops growing—are listed in this table.
b. Since gain values provide information about target categories, this table is available only if you specified one
or more target categories. In this example, there is only one target category, so there is only one gains for nodes
table.
c. Node N is the number of cases in each terminal node, and Node Percent is the percentage of the total number
of cases in each node.
d. Gain N is the number of cases in each terminal node in the target category, and Gain Percent is the percentage
of cases in the target category with respect to the overall number of cases in the target category—in this
example, the number and percentage of cases with a bad credit rating.
e. For categorical dependent variables, Response is the percentage of cases in the node in the specified target
category. In this example, these are the same percentages displayed for the Bad category in the tree diagram.
f. For categorical dependent variables, Index is the ratio of the response percentage for the target category
compared to the response percentage for the entire sample.
Case
EXPERIMENT
Cross validation
1. Cross validation divides the sample into a number of
subsamples, or folds. Tree models are then generated,
excluding the data from each subsample in turn.
2. The first tree is based on all of the cases except those in the first
sample fold,
3. the second tree is based on all of the cases except those in the
second sample fold, and so on.
4. For each tree, misclassification risk is estimated by applying the
tree to the subsample excluded in generating it.
5. a. You can specify a maximum of 25 sample folds. The higher the
value, the fewer the number of cases excluded for each tree
model.
b. Cross validation produces a single, final tree model. The cross
validated risk estimate for the final tree is calculated as the
average of the risks for all of the trees.
Split-Sample Validation
With split-sample validation, the model is generated using a training sample and tested
on a hold-out sample.
a. You can specify a training sample size, expressed as a percentage of the total sample
size, or a variable that splits the sample into training and testing samples.
b. If you use a variable to define training and testing samples, cases with a value of 1 for
the variable are assigned to the training sample, and all other cases are assigned to the
testing sample. The variable cannot be the dependent variable, weight variable, influence
variable, or a forced independent variable.
a. You can display results for both the training and testing samples or just the testing
sample.
b. Split-sample validation should be used with caution on small data files (data files with
a small number of cases). Small training sample sizes may yield poor models, since there
may not be enough cases in some categories to adequately grow the tree.
The Growth Limits tab
allows you to limit the
number of levels in the
tree and control the
minimum number of cases
for parent and child nodes.
Maximum Tree Depth.
Controls the maximum
number of levels of
growth beneath the root
node.
The Automatic setting
limits the tree to three
levels beneath the root
node for the CHAID and
Exhaustive CHAID
methods and five levels for
the CRT and QUEST
methods
For the CHAID and Exhaustive
CHAID methods, you can control:
Significance Level:
You can control the significance
value for splitting nodes and
merging categories. For both
criteria, the default significance
level is 0.05.
a. For splitting nodes, the value
must be greater than 0 and less
than 1. Lower values tend to
produce trees with fewer nodes.
b. For merging categories, the
value must be greater than 0
and less than or equal to 1. To
prevent merging of categories,
specify a value of 1. For a scale
independent variable, this means
that the number of categories
for the variable in the final tree is
the specified number of
intervals (the default is 10).
In CHAID analysis, scale
independent (predictor) variables
are always banded into discrete
groups (for example, 0–10, 11–20,
21–30, etc.) prior to analysis. You
can control the initial/maximum
number of groups (although the
procedure may merge contiguous
groups after the initial split):
a. Fixed number. All scale
independent variables are initially
banded into the same number of
groups. The default is 10.
b. Custom. Each scale independent
variable is initially banded into the
number of groups
specified for that variable.
The extent to which a node does
not represent a homogenous
subset of cases is an indication of
impurity.
For example,
a terminal node in which all cases
have the same value for the
dependent variable is a
homogenous node that requires no
further splitting because it is
“pure.” You can select the method
used to measure impurity and the
minimum decrease in impurity
required to split nodes.
Impurity Measure. For scale
dependent variables, the least-
squared deviation (LSD) measure
of impurity is used. It is computed
as the within-node variance,
adjusted for any frequency weights
or influence values.
For categorical (nominal, ordinal) dependent variables, you can
select the impurity measure:
Gini. Splits are found that maximize the homogeneity of child
nodes with respect to the value of the dependent variable. Gini is
based on squared probabilities of membership for each category of
the dependent variable. It reaches its minimum (zero) when all cases
in a node fall into a single category. This is the default measure.
Twoing. Categories of the dependent variable are grouped into two
subclasses. Splits are found that best separate the two groups.
Ordered twoing. Similar to twoing except that only adjacent
categories can be grouped. This measure is available only for ordinal
dependent variables.
Minimum change in improvement. This is the minimum decrease in
impurity required to split a node. The default is 0.0001. Higher
values tend to produce trees with fewer nodes.
Saved Variables
Terminal node number. The terminal node
to which each case is assigned. The value is
the tree node number.
Predicted value. The class (group) or value
for the dependent variable predicted by
the model.
Predicted probabilities. The probability
associated with the model’s prediction.
One variable is saved for each category of
the dependent variable. Not available for
scale dependent variables.
Sample assignment (training/testing). For
split-sample validation, this variable
indicates whether a case was used in the
training or testing sample. The value is 1 for
the training sample and 0 for the testing
sample. Not available unless you have
selected split-sample validation
Export Tree Model as XML
You can save the entire tree model in XML (PMML) format.
You can use this model file to apply
the model information to other data files for scoring
purposes.
Training sample. Writes the model to the specified file.
For split-sample validated trees, this is the model for the
training sample.
Test sample. Writes the model for the test sample to the
specified file. Not available unless you have selected split-
sample validation
OUTPUT .
Tree. By default, the tree diagram is included in the output displayed in the Viewer.
Deselect (uncheck) this option to exclude the tree diagram from the output.
Display. These options control the initial appearance of the tree diagram in the Viewer.
All of these attributes can also be modified by editing the generated tree.
a. Orientation. The tree can be displayed top down with the root node at the top, left to
right, or right to left.
b. Node contents. Nodes can display tables, charts, or both. For categorical dependent
variables, tables display frequency counts and percentages, and the charts are bar charts.
For scale dependent variables, tables display means, standard deviations, number of
cases, and predicted values, and the charts are histograms.
c. Scale. By default, large trees are automatically scaled down in an attempt to fit the
tree on the page. You can specify a custom scale percentage of up to 200%.
d. Independent variable statistics. For CHAID and Exhaustive CHAID, statistics include F
value (for scale dependent variables) or chi-square value (for categorical dependent
variables) as well as significance value and degrees of freedom. For CRT, the
improvement value is shown. For QUEST, F, significance value, and degrees of freedom
are shown for scale and ordinal independent variables; for nominal independent
variables, chi-square, significance value, and degrees of freedom are shown.
Risk. Risk estimate and its standard
error. A measure of the tree’s predictive
accuracy.
a. For categorical dependent variables,
the risk estimate is the proportion of
cases incorrectly classified after
adjustment for prior probabilities and
misclassification costs.
b. For scale dependent variables, the risk
estimate is within-node variance.
Classification table. For categorical
(nominal, ordinal) dependent variables,
this table shows the number of cases
classified correctly and incorrectly for
each category of the dependent
variable. Not available for scale
dependent variables
Gain. Gain is the percentage of total
cases in the target category in each
node, computed as:
(node target n / total target n) x 100. The
gains chart is a line chart of cumulative
percentile gains, computed as:
(cumulative percentile target n / total
target n) x 100.
A separate line chart is produced for
each target category. Available only for
categorical dependent variables with
defined target categories. For more
information
The gains chart plots the same values
that you would see in the Gain Percent
column in the gains
for percentiles table, which also reports
cumulative values
Gain. Gain is the percentage of total
cases in the target category in each
node, computed as:
(node target n / total target n) x 100.
The gains chart is a line chart of
cumulative percentile gains,
computed as: (cumulative percentile
target n / total target n) x 100.
A separate line chart is produced for
each target category. Available only
for categorical dependent variables
with defined target categories. For
more information The gains chart
plots the same values that you
would see in the Gain Percent
column in the gains for percentiles
table, which also reports cumulative
values
Index. Index is the ratio of the node
response percentage for the target
category compared to the overall target
category response percentage for the
entire sample.