SPSS Introduction

Neuroscience Center Zurich
Crash Course in Statistics
Data Analysis (with SPSS)
July 2014
Dr. Jrg Schwarz juerg.schwarz@schwarzpartners.ch
Slide 2
Part 1: Program 9 July 2014: Morning Lessons (09.00 12.00)
Some notes about.

- Type of Scales
- Distributions & Transformation of data
- Data trimming
Exercises
- Self study about Boxplots
- Data transformation
- Check of Dataset
Slide 3
Part 2: Program 10 July 2014: Morning Lessons (09.00 12.00)
Multivariate Analysis (Regression, ANOVA)
- Introduction to Regression Analysis

General Purpose
Key Steps
Testing of Requirements
Simple Example
Example of Multiple Regression
- Introduction to Analysis of Variance (ANOVA)

Types of ANOVA
Simple Example: One-Way ANOVA
Example of Two-Way ANOVA
Requirements
Slide 4
Part 2: Program 10 July 2014: Afternoon Lessons (13.00 16.00)
Introduction to other multivariate methods (categorical/categorical metric/metric)

- Methods
- Choice of method
- Example of discriminant analysis
Exercises
- Regression Analysis
- Analysis of Variance (ANOVA)
Remains of the course

- Evaluation (Feedback form will be handed out and collected afterwards)
- Certificate of participation will be issued
Christof Luchsinger will attend at 15.30
Slide 5
Table of Contents
Some notes about ______________________________________________________________________________________ 9
Types of Scales ......................................................................................................................................................................................................9
Nominal scale ............................................................................................................................................................................................................................ 10
Ordinal scale .............................................................................................................................................................................................................................. 11
Metric scales (interval and ratio scales) .................................................................................................................................................................................... 12
Hierarchy of scales .................................................................................................................................................................................................................... 13
Properties of scales ................................................................................................................................................................................................................... 14
Summary: Type of scales .......................................................................................................................................................................................................... 15
Exercise in class: Scales ......................................................................................................................................................................................16
Distributions .........................................................................................................................................................................................................17
Measure of the shape of a distribution ...................................................................................................................................................................................... 18
Transformation of data .........................................................................................................................................................................................20
Why transform data? ................................................................................................................................................................................................................. 20
Type of transformation ............................................................................................................................................................................................................... 20
Linear transformation ................................................................................................................................................................................................................. 21
Logarithmic transformation ........................................................................................................................................................................................................ 22
................................................................................................................................................................................................................................................... 24
Summary: Data transformation.................................................................................................................................................................................................. 25
Data trimming .......................................................................................................................................................................................................26
Finding outliers and extremes ................................................................................................................................................................................................... 26
Boxplot ....................................................................................................................................................................................................................................... 27
Boxplot and error bars ............................................................................................................................................................................................................... 28
Q-Q plot ..................................................................................................................................................................................................................................... 29
Example ..................................................................................................................................................................................................................................... 33
Exercises 01: Log Transformation & Data Trimming ___________________________________________________________ 34
Slide 6
Linear Regression_______________________________________________________________________________________ 35
Example ...............................................................................................................................................................................................................35
General purpose of regression .............................................................................................................................................................................38
Key Steps in Regression Analysis ........................................................................................................................................................................39
Regression model ................................................................................................................................................................................................40
Mathematical model .................................................................................................................................................................................................................. 40
Stochastic model ....................................................................................................................................................................................................................... 40
Gauss-Markov Theorem, Independence and Normal Distribution .........................................................................................................................42
Regression analysis with SPSS: Some examples ................................................................................................................................................43
Simple example (EXAMPLE02)................................................................................................................................................................................................. 43
Step 1: Formulation of the model .............................................................................................................................................................................................. 43
Step 2: Estimation of the model ................................................................................................................................................................................................ 44
Step 3: Verification of the model................................................................................................................................................................................................ 45
Step 3: Verification of the model t-tests .................................................................................................................................................................................. 46
Step 6. Interpretation of the model ............................................................................................................................................................................................ 47
Back to Step 3: Verification of the model .................................................................................................................................................................................. 48
Step 5: Testing of assumptions ................................................................................................................................................................................................. 50
Violation of the homoscedasticity assumption........................................................................................................................................................................... 53
Multiple regression ...............................................................................................................................................................................................54
Many similarities with simple Regression Analysis from above ................................................................................................................................................ 54
What is new? ............................................................................................................................................................................................................................. 54
Multicollinearity .....................................................................................................................................................................................................55
Outline ....................................................................................................................................................................................................................................... 55
How to identify multicollinearity ................................................................................................................................................................................................. 56
Slide 7
Multiple regression analysis with SPSS: Some detailed examples .......................................................................................................................57
Example of multiple regression (EXAMPLE04) ......................................................................................................................................................................... 57
Step 3: Verification of the model (without dummy for gender) .................................................................................................................................................. 58
SPSS Output regression analysis (EXAMPLE04) ..................................................................................................................................................................... 58
Dummy coding of categorical variables ..................................................................................................................................................................................... 60
Gender as dummy variable ....................................................................................................................................................................................................... 61
Step 1: Formulation of the model (with dummy for gender) ...................................................................................................................................................... 61
Step 3: Verification of the model (with dummy for gender) ....................................................................................................................................................... 62
SPSS Output regression analysis (EXAMPLE04) ..................................................................................................................................................................... 62
Example of multicollinearity ....................................................................................................................................................................................................... 63
SPSS Output regression analysis (Example of multicollinearity) I ............................................................................................................................................ 64
Exercises 02: Regression_________________________________________________________________________________ 66
Analysis of Variance (ANOVA) _____________________________________________________________________________ 67

Example ...............................................................................................................................................................................................................67
Key steps in analysis of variance ..........................................................................................................................................................................71
Designs of ANOVA ...............................................................................................................................................................................................72
Sum of Squares....................................................................................................................................................................................................73
Step by step ............................................................................................................................................................................................................................... 73
Basic idea of ANOVA ................................................................................................................................................................................................................ 74
Significance testing of the model ............................................................................................................................................................................................... 75
ANOVA with SPSS: A detailed example ...............................................................................................................................................................76
Example of one-way ANOVA: Survey of nurse salaries (EXAMPLE05) ................................................................................................................................... 76
SPSS Output ANOVA (EXAMPLE05) Tests of Between-Subjects Effects I .......................................................................................................................... 77
2
Partial Eta Squared (partial ) .................................................................................................................................................................................................. 79
Two-Way ANOVA ...................................................................................................................................................................................................................... 80
Main effects ............................................................................................................................................................................................................................... 81
Interaction effects ...................................................................................................................................................................................................................... 82
Example of two-way ANOVA: Survey of nurse salary (EXAMPLE06) ...................................................................................................................................... 85
Interaction .................................................................................................................................................................................................................................. 86
Requirements of ANOVA......................................................................................................................................................................................89
Slide 8
Exercises 03: ANOVA ____________________________________________________________________________________ 90
Other multivariate Methods _______________________________________________________________________________ 91

Type of Multivariate Statistical Analysis ................................................................................................................................................................91
Methods for identifying structures Methods for discovering structures ........................................................................................................................... 91
Choice of Method ...................................................................................................................................................................................................................... 92
Tree of methods (also www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm) ......................................................................................................................... 93
Example of multivariate Methods (categorical / metric) .........................................................................................................................................94
Linear discriminant analysis ...................................................................................................................................................................................................... 94
Example of linear discriminant analysis .................................................................................................................................................................................... 95
................................................................................................................................................................................................................................................... 95
Very short introduction to linear discriminant analysis .............................................................................................................................................................. 96
SPSS Output Discriminant analysis (EXAMPLE07) I ................................................................................................................................................................ 99
Slide 9
Some notes about.

Types of Scales
Items measure the value of attributes using a scale.
There are four scale types that are used to capture the attributes of measurement objects
(e.g., people): nominal, ordinal, interval, and ratio scales.
Example from a health survey:
Measurement object Person
Attribute of Object Sex Attitude to health Body temp- Income in US $

erature in C
Value of Attribute Male / Female 1 to 5 Real number Real number
Type of Scale Nominal Ordinal Interval Ratio
Categorical Metric
(SPSS: Ordinal, Nominal) (SPSS: Scale)
Stevens S.S. (1946): On the Theory of Scales of Measurement; Science, Volume 103, Issue 2684, pp. 677-680
Slide 10
Nominal scale
Consists of "names" (categories). Names have no specific order.
Must be measured with an unique (statistical) procedure.
Each category is assigned a number (code can be arbitrary but must be unique).
Examples from the Health Survey

Sex is either male or female.
Ethnic group
Slide 11
Ordinal scale
Consists of a series of values
Each category is associated with a number which represents the category's order.
The Likert scale (rating scale) is a special kind of ordinal scale.
Example from the Health Survey

I've been feeling optimistic about the future: None of the time, Rarely, Some of the time .
Slide 12
Metric scales (interval and ratio scales)

Measures the exact value
The actual measured value is assigned
In SPSS metric scales are called "Scale".
Example from the Health Survey for England 2003

Age (in years)
Slide 13
Hierarchy of scales
The nominal scale is the "lowest" while the ratio scale is the "highest".
A scale from a higher level can be used as the scale for a lower level, but not vice versa.
(Example: Based on age in years (ratio scale), a binary variable can be generated to capture
whether a respondent is a minor (nominal scale), but not vice versa.)
Possible statements Example

Equality, Sex (male = 0, female = 1): male female
Nominal
inequality (=, )
Categorical
In addition: Self-perception of health (1 = "very bad", . 5 = "very

Ordinal
Relation larger (>), good"): 1 < 2 < 3 < 4 < 5

smaller (<) But "very good" is neither five times better than "very bad"
nor does "very good" have a distance of 4 to "very bad".
In addition: Temperature in C: Difference between 20 and 15 = dif-
(SPSS: "Scale")
Interval
Comparison of differ- ference between 10 and 15. But a temperature of 10 is

Metric
ences not twice as warm as 5. Compare with the Fahrenheit-

scale! 10 C = 50 F, 5 C = 41 F
Ratio
In addition: Income: $ 8,000 is twice as large as $ 4,000. There is a

Comparison of ratios true zero point in this scale: $ 0. Division by 1000.
Slide 14
Properties of scales
Level Determination of ... Statistics

categorical
Nominal equality or unequality =, Mode
Ordinal greater, equal or less >, <, = Median
Interval equality of differences (x1 - x2) (x3 - x4) Arithmetic mean

metric
Ratio equality of ratios (x1 / x2) (x3 / x4) Geometric mean
Level Possible transformation

categorical
Nominal one-to-one substitution x1 ~ x2 <=> f(x1) ~ f(x2)
Ordinal monotonic increasing x1 > x2 <=> f(x1) > f(x2)
Interval positiv linear ' = a + b with a > 0

metric
Ratio postiv proportional ' = a with a > 0

Slide 15
Summary: Type of scales
Statistical analysis assumes that the variables have specific levels of measurement.
Variables that are measured nominal or ordinal are also called categorical variables.
Exact measurements on a metric scale are statistically preferable.
Why does it matter whether a variable is categorical or metric?

For example, it would not make sense to compute an average for gender.
In short, an average requires a variable to be metric.
Sometimes variables are "in between" ordinal and metric.

Example:
A Likert scale with "strongly agree", "agree", "neutral", "disagree" and "strongly disagree".
If it is unclear whether or not the intervals between each of these five values are the same, then
it is an ordinal and not a metric variable.
In order to calculate statistics, it is often assumed that the intervals are equally spaced.
Many circumstances require metric data to be grouped into categories.

Such ordinal categories are sometimes easier to understand than exact metric measurements.
In this process, however, valuable exact information is lost.
Slide 16
Exercise in class: Scales

1. Read "Summary: Type of Scales" above.
2. Which type of scale?

Where do you live?

north south east west
Size of T-shirt (XS, S, M, L, XL, XXL)
Please mark one box per question
Compared with the health of 1 2 3 4 5

2.01 very bad very good
others in my age, my health is

How much did you spend on food this week? _____ $

Size of shoe in Europe
Slide 17
Distributions
Take an optical impression. Source: http://en.wikipedia.org (Date of access: July, 2014)
Normal Poisson
Widely used Law of rare
in statistics events (origin
(statistical 1898: number
inference). of soldiers
killed by
horse-kicks
each year).
Exponential Pareto
Queuing Allocation of
model (e.g. wealth
average time among indi-
spent in a viduals of a
queue). society ("80-
20 rule").
Slide 18
Analyze Descriptive Statistics Frequencies...
Measure of the shape of a distribution
Skewness (German: Schiefe)

A distribution is symmetric if it looks the same to the
left and right of the center point.
Skewness is a measure of the lack of symmetry.
Range of skewness
Negative values for the skewness indicate distribution that is skewed left.
Positive values for the skewness indicate distribution that is skewed right.
Kurtosis (German: Wlbung)

Kurtosis is a measure of how the distribution is shaped relative to a normal distribution.
A distribution with high kurtosis tend to have a distinct peak near the mean.
A distribution with low kurtosis tend to have a flat top near the mean.
Range of kurtosis
Standard normal distribution has a kurtosis of zero.
Positive values for the kurtosis indicates a "peaked" distribution.
Negative values for the kurtosis indicates a "flat" distribution.
Slide 19
Example
Dataset "Data_07.sav" (Tschernobyl fallout of radioactivity, measured in becquerel)
Distribution of original data is skewed right.

BQ has skewness 2.588 and kurtosis 7.552
Distinct peak near zero.
Statistics
BQ LNBQ
N Valid 23 23
Missing 0 0
Skewness 2.588 .224
Std. Error of Skewness .481 .481
Kurtosis 7.552 -.778
Std. Error of Kurtosis .935 .935
Logarithmic transformation
Compute lnbq = ln(bq).
freq bq lnbq.
Log transformed data is slightly skewed right.

LNBQ has skewness .224 and kurtosis -.778
More likely to show normal distribution.
Slide 20
Transformation of data
Why transform data?
1. Many statistical models require that the variables (in fact: the errors) are
approximately normally distributed.
2. Linear least squares regression assumes that the relationship between two variables is linear.
Often we can "straighten" a non-linear relationship by transforming the variables.
3. In some cases it can help you better examine a distribution.
When transformations fail to remedy these problems, another option is to use:

nonparametric methods, which makes fewer assumptions about the data.
Type of transformation
Linear Transformation
Does not change shape of distribution.
Non-linear Transformation
Changes shape of distribution.
Analyze Descriptive Statistics Descriptives...
Slide 21
Linear transformation
A very useful linear transformation is standardization.

(z-transformation, also called "converting to z-scores" or "taking z-scores")
Transformation rule
x - mean of sample
zi = i standard deviation of sample

Original distribution will be transformed to one in which

the mean becomes 0 and
the standard deviation becomes 1
A z-score quantifies the original score in terms of

the number of standard deviations that the score is
from the mean of the distribution.
=> For example use z-scores to filter outliers
Slide 22
Logarithmic transformation
Works for data that are skewed right.
Works for data where residuals get bigger for bigger values of the dependent variable.
Such trends in the residuals occur often, if the error in the value of an
outcome variable is a percent of the value rather than an absolute value.
For the same percent error, a bigger value of the variable means a bigger absolute error,
so residuals are bigger too.
Taking logs "pulls in" the residuals for the bigger values.
log(Y*error) = log(Y) + log(error)
Example: Body size against weight
100
90
Transformation rule
80
f(x) = log(x);x 1
f(x) = log(x +1);x 0
70
60
weight (in kg)
50
40
150 160 170 180 190 200
size (in cm)

Slide 23
Logarithmic transformation I
Symmetry Histogram of
original data
A logarithmic transformation reduces
positive skewness because it compresses
the upper tail of the distribution while
stretching out the lower trail. This is be-
cause the distances between 0.1 and 1, 1
and 10, 10 and 100, and 100 and 1000
are the same in the logarithmic scale.
This is illustrated by the histogram of
data simulated with salary (hourly wag-
es) in a sample of nurses*. In the origi-
nal scale, the data are long-tailed to the
right, but after a logarithmic transfor-
mation is applied, the distribution is Histogram of
symmetric. The lines between the two transformed data
histograms connect original values with
their logarithms to demonstrate the
compression of the upper tail and
stretching of the lower tail.
*More to come in chapter "ANOVA".
Slide 24
Logarithmic transformation II
nearly normal distributed
Histogram of
transformed data
Transformation
y = log10(x)
Histogram of
original data
skewed right
Slide 25
Summary: Data transformation
Linear transformation and logarithmic transformation as discussed above.
Other transformations
Root functions Hyperbola function
f(x) = x1/2 ,x1/3 ;x 0 f(x) = x -1;x 1
usable for right skewed distributions usable for right skewed distributions
Box-Cox-transformation Probit & Logit functions (cf. logistic regression)
p p
f(x) = x ; > 1 ln( ) f (p) = ln( ); p [0,1]
1 p 1 p
usable for left skewed distributions usable for proportions and percentages
Interpretation and usage

Interpretation is not always easy.
Transformation can influence results significantly.
Look at your data and decide if it makes sense in the context of your study.
Slide 26
Data trimming
Data trimming deals with

Finding outliers and extremes in a data set.
Dealing with outliers: Correction, deletion, discussion, robust estimation
Dealing with missing values: Correction, treatment (SPSS), (also imputation)
Transforming data if necessary (see chapter above).
Finding outliers and extremes

Get an overview over the dataset!
How does distribution looks like?
Arte there any values that are not expected?
Methods?
Use basic statistics: <Analyze> with <Frequencies>, <Explore> and <Descriptives.>
Outliers => e.g. z-scores higher/lower 2 st. dev., extremes => higher/lower 3 st. dev.
Use graphical techniques: Histogram, Boxplot, Q-Q plot, .
Outliers => e.g. as indicated in boxplot
Slide 27
Boxplot
A Boxplot displays the center (median), spread and outliers of a distribution.
See exercise for more details about whiskers, outliers etc.
1 40 .0
92
14 0 .0
92
Outliers (Number in Dataset) 190

83
83 1 20 .0
12 0 .0
88
88 196
Whisker 196 168
191
65
income
income
10 0 .0
1 00 .0
80 .0 Median "Box" identifies the

middle 50% of datset 8 0.0
60 .0
Whisker
in com e 6 0.0
income
2 3 4 5 6 7
education
educ
Boxplots are an excellent tool for detecting

and illustrating location and variation
changes between different groups of data.
Slide 28
Boxplot and error bars
Boxplot Error bars

Keyword "median" Keyword "mean"
Overview over data and illustration of data Overview over mean and confidence interval
distribution (range, skewness, outliers) or standard error
1 40 .0 9 2.0
92
9 0.0
190
83 8 8.0
1 20 .0
88
196 8 6.0
168
95% CI income
191
65
income
8 4.0
1 00 .0
8 2.0
8 0.0
8 0.0
7 8.0
7 6.0
6 0.0 7 4.0
2 3 4 5 6 7 2 3 4 5 6 7
educ educ
Slide 29
Q-Q plot
The quantile-quantile (q-q) plot is a graphical technique for deciding if two samples come from
populations with the same distribution.
Quantile: the fraction (or percent) of data points below a given value.
For example the 0.5 (or 50%) quantile is the position at which 50% percent of the data fall below
and 50% fall above that value. In fact, the 50% quantile is the median.
Normal Distribution Sample Distribution (simulated data)
50% Quantile 50% Quantile
Slide 30
In the q-q plot, quantiles of the first sample are set against the quantiles of the second sample.
If the two sets come from a population with the same distribution, the points should fall
approximately along a 45-degree reference.
The greater the displacement from this reference line, the greater the evidence for the
conclusion that the two data sets have come from populations with different distributions.
Some advantages of the q-q plot are:

The sample sizes do not need to be equal.
Many distributional aspects can be simultaneously tested.
Difference between Q-Q plot and P-P plot

Q-Q plot: Plots the quantiles of a varia- P-P plot: Plots a variable's cumulative pro-
ble's distribution against the quantiles of portions against the cumulative proportions
any of a number of test distributions. of any of a number of test distributions.
A q-q plot is better when assessing the goodness of fit in the tail of the distributions.
The normal q-q plot is more sensitive to deviances from normality in the tails of the distribution,
whereas the normal p-p plot is more sensitive to deviances near the mean of the distribution.
Slide 31
Quantiles of the first sample are set against the quantiles of the second sample.
Standard Normal Distribution

Standard Normal Distribution
Normal Distribution Sample Distribution (simulated data)
Slide 32
Example of q-q plot with simulated data
Normal vs. Standard Normal Sample Distribution vs. Standard Normal
300 300
Hufigkeit
Hufigkeit
200 200
100 100
0 0
9 12
Test distribution (SPSS)
Test distribution (SPSS)
10
8
Erwarteter Wert von Normal
Erwarteter Wert von Normal
8
Standard Normal
Standard Normal
5
2
4
0
3 -2
3 4 5 6 7 8 9 -2 0 2 4 6 8 10 12 14 16
Beobachteter
Normal Wert Beobachteter
Sample Wert
Distribution
Simulated data Simulated data

Slide 33
Example
Dataset "Data_07.sav" (Tschernobyl fallout of radioactivity)
Distribution of original data Distribution of log transformed data
Slide 34
Exercises 01: Log Transformation & Data Trimming

Ressources => www.schwarzpartners.ch/ZNZ_2012 => Exercises Analysis => Exercise 01
Slide 35
Linear Regression
Example
Medical research: Dependence of age and systolic blood pressure
240 Dataset (EXAMPLE01.SAV)

Systolic blood pressure [mm HG]
230 Sample of n = 10 men

220
Variables for
210 age (age)
200 systolic blood pressure (pressure)
190
180
Typical questions
170
Is there a linear relation between
160 age and systolic blood pressure?
150
140
What is the predicted mean blood
35 40 45 50 55 60 65 70 75 80 85 90 pressure for men aged 67?
Age [years]
Slide 36
The questions
Question in everyday language:
Is there a linear relation between age and systolic blood pressure?
Research question:
What is the relation between age and systolic blood pressure?
What kind of model is best for showing the relation? Is regression analysis the right model?
Statistical question:
Forming hypothesis
H0: "No model" (= No overall model and no significant coefficients)
HA: "Model" (= Overall model and significant coefficients)
Can we reject H0?
The solution
Linear regression equation of age on systolic blood pressure
pressure = 0 + 1 age + u
pressure = dependent variable
age = independent variable
0 , 1 = coefficients
u = error term
Slide 37
"How-to" in SPSS
Scales
240
Dependent variable: metric
Systolic blood pressure [mmHG]

Independent variable: metric 230
220
SPSS
Analyze Regression Linear... 210
200
Result
190
Significant linear model
180
Significant coefficient
170
pressure = 135.2 + 0.956 age
160
Predicted mean blood pressure 150
199.2 = 135.2 + 0.956 67 140

35 40 45 50 55 60 65 70 75 80 85 90
Age [years]
Typical statistical statement in a paper:
There is a linear relation between age and systolic blood pressure.
(Regression: F = 102.763, R2 = .93, p = .000).
Slide 38
General purpose of regression

Cause analysis
State a relationship between independent variables and the dependent variable.
Example
Is there a model that describes the dependence between blood pressure and
age, or do these two variables just form a random pattern?
Impact analysis
Assess the impact of the independent variable to the dependent variable.
Example
If age increases, blood pressure also increases:
How strong is the impact? By how much will pressure increase with each additional year?
Prediction
Predict the values of a dependent variable using new values for the independent variable.
Example
Which is the predicted mean systolic blood pressure of men aged 67?
Slide 39
Key Steps in Regression Analysis

1. Formulation of the model
Common sense . (remember the example with storks and babies)
Linearity of relationship plausible
Not too many variables (Principle of parsimony: Simplest solution to a problem)
2. Estimation of the model
Estimation of the model by means of OLS estimation (ordinary least squares)
Decision on procedure: Enter, stepwise regression
3. Verification of the model
Is the model as a whole significant? (i.e. are the coefficients significant as a group?)
F-test
Are the regression coefficients significant?
t-tests (should be performed only if F-test is significant)
How much variation does the regression equation explain?
Coefficient of determination (adjusted R-squared)
4. Considering other aspects (for example, multicollinearity)
5. Testing of assumptions (Gauss-Markov, independence and normal distribution)
6. Interpretation of the model and reporting
Text in italics: Only important in the case of multiple regression see next chapter.
Slide 40
Regression model
More details about mathematics
in Christof Luchsinger's part
Mathematical model
The linear model describes y as a function of x y
=
x

y = 0 + 1 x equation of a straight line

The variable y is a linear function of the variable x.

0 (intercept, constant)
The point where the regression line crosses the Y-axis.

The value of the dependent variable when all of the independent variables = 0.
1 (regression coefficient)
The increase in the dependent variable per unit change in the
independent variable (also known as "the rise over the run", slope)
Stochastic model
y = 0 + 1 x + u
The error term u comprises all factors (other than x) that affect y.
These factors are treated as being unobservable.
u stands for "unobserved"
Slide 41
Stochastic model Assumptions related to the error term

The error term u is (must be) .
independent of the explanatory variable x
normally distributed with mean 0 and variance 2: u ~ N(0,2)
Wooldridge, Jeffrey (2011): Introductory econometrics.

0
5th Edition. [S.l.]: South-Western.

0
E(y) = 0 + 1 x
0

Slide 42
Gauss-Markov Theorem, Independence and Normal Distribution

Under the 5 Gauss-Markov assumptions the OLS estimator is the best, linear, unbiased estima-
tor of the true parameters i, given the present sample.
The OLS estimator is BLUE
1. Linear in coefficients y = 0 + 1 x + u
2. Random sample of n observations {(xi ,yi ): i = 1,.,n}
3. Zero conditional mean: E(ux) = 0

The error u has an expected value of 0,
given any values of the explanatory variable
4. Sample variation in explanatory variables. x const

The xis are not constant and not all the same. x1 x2 . xn
5. Homoscedasticity: Var(ux) = 2
The error u has the same variance given any value of the
explanatory variable.
Independence and normal distribution of error 2)

u ~ Normal(0,
These assumptions need to be tested among else by analyzing the residuals.

Based on: Wooldridge J. (2005). Introductory Econometrics: A Modern Approach. 3rd edition, South-Western.
Slide 43
Regression analysis with SPSS: Some examples
Simple example (EXAMPLE02)

Dataset: Sample of 99 men by body height and weight
Step 1: Formulation of the model

Regression equation of weight on height
weight = 0 + 1 height + u
weight = dependent variable
height = independent variable
0 , 1 = coefficients
u = error term
The scatterplot confirms that there could be a

linear relationship between weight and height.
Slide 44
Step 2: Estimation of the model

SPSS: Analyze Regression Linear.
Slide 45
Step 3: Verification of the model

SPSS Output (EXAMPLE02) F-test
The null hypothesis (H0) is that there is no effect of height.

The alternative hypothesis (HA) is that this is not the case.
H0: 1 = 0 (Multiple Regression => H0: 1 = 1 = . = p = 0)
HA: 1 0 (Multiple Regression => HA: j 0 for at least one value of j)
Empirical F-value and the appropriate p-value ("Sig.") are computed by SPSS.
In the example, we can reject H0 in favor of HA (Sig. < 0.05).
The overall model is significant (F(1,97) = 116.530, p = .000).
The estimated model is not only a theoretical construct but one that exists in a statistical sense.
Slide 46
Step 3: Verification of the model t-tests

SPSS Output (EXAMPLE02) t-test
The Coefficients table provides significance tests for the coefficients.

The significance test evaluates the null hypothesis that the regression coefficient is zero
H0: i = 0
HA: i 0
The t statistic for the height variable (1) is associated with a p-value of .000 ("Sig.").
This indicates that the null hypothesis can be rejected.
Thus, the coefficient is significantly different from zero.
This holds also for the constant (0) with Sig. = .000.
Slide 47
Step 6. Interpretation of the model

SPSS Output (EXAMPLE02) Regression coefficients
weight i = 0 + 1 height i
weighti = 120.375 + 1.086 height i
Unstandardized coefficients show absolute

change of the dependent variable if the
independent variable increases by one unit.
If height increases by 1 cm,
weight increases by 1.086 kg.
Note: The constant -120.375 has no specific

meaning. It's just the intersection with the Y-axis.
Slide 48
Back to Step 3: Verification of the model

SPSS Output (EXAMPLE02) Coefficient of determination
yi
Error
y i
Total Gap
Regression
y i = Data point
y i = Estimation (model)
y = Sample mean
Error is also called residual

Slide 49
SPSS Output (EXAMPLE02) Coefficient of determination I

Summing up squared distances to sum of squares (SS)
SSTotal = SSRegression + SSError
n n n
(y
i =1
i y) = ( y i y) + ( y i y i ) 2
2
i =1
2
i =1
SS Regression
R Square = 0 R Square 1
SS Total
R Square, the coefficient of determination, is .546.

In the example, about half the variation of weight is explained by the model (R2 = 54.6%).
In bivariate regression, R2 is qual to the squared value of the correlation coefficient of the two
variables (rxy = .739, rxy2 = .546).
The higher R Square, the better the fit.
Slide 50
Step 5: Testing of assumptions

In the example, are the requirements of the Gauss-Markov theorem as well as the other as-
sumptions met?
1. Is the model linear in coefficients Yes, decision for regression model.
2. Is it a random sample? Yes, clinical study.
3. Do the residuals have an expected value of 0 Scatterplot of residuals

for all values of x? (zero conditional mean)
4. Is there variation in the explanatory variable? Yes, clinical study.
5. Do the residuals have constant variance Scatterplot of residuals

for all values of x? (homoscedasticity)
Are the residuals independent from one another? Scatterplot of residuals

(consider Durbin-Watson)
Are the residuals normally distributed? Histogram
Slide 51
Scatterplot of standardized predicted values of y vs. standardized residuals
3. Zero conditional mean: The mean values of the residuals do not differ visibly from 0 across
the range of standardized estimated values. OK
5. Homoscedasticity: Residual plot trumpet-shaped; residuals do not have constant variance.
This Gauss-Markov requirement is violated. There is heteroscedasticity.
Independence: There is no obvious pattern that indicates that the residuals would be influenc-
ing one another (for example a "wavelike" pattern). OK
Slide 52
Histogram of standardized residuals
Normal distribution of residuals:

Distribution of the standardized residuals is more or less normal. OK
Slide 53
Violation of the homoscedasticity assumption

How to diagnose heteroscedasticity
Informal methods:
Look at the scatterplot of standardized predicted y-values vs. standardized residuals.
Graph the data and look for patterns.
Formal methods (not pursued further in this course):

Breusch-Pagan test / Cook-Weisberg test
White test
Corrections
Transformation of the variable:
Possible correction in the case of this example is a log transformation of variable weight
Use of robust standard errors (not implemented in SPSS)
Use of Generalized Least Squares (GLS):
The estimator is provided with information about the variance and covariance of the errors.
(The last two options are not pursued further in this course.)
Slide 54
Multiple regression
Many similarities with simple Regression Analysis from above
Key steps in regression analysis

General purpose of regression
Mathematical model and stochastic model
Ordinary least squares (OLS) estimates and Gauss-Markov theorem as well as independence
and normal distribution of error
All concepts are the same also regarding multiple regression analysis.
What is new?
Concept of multicollinearity
Concept of stepwise conduction of regression analysis
Dummy coding of categorical variables
Standardized regression coefficients
Adjustment of the coefficient of determination ("Adjusted R Square")
Slide 55
Multicollinearity
Outline
Multicollinearity means there is a strong correlation between independent variables.
Perfect collinearity means a variable is a linear combination of other variables.
=> Unique estimate of coefficients not possible because of infinite number of combinations.
Perfect collinearity is rare in real-life data (except the fact that you make a mistake.)
However, correlations or even strong correlations between variables are unavoidable.
Symptoms of multicollinearity
When correlation is strong, standard errors of the parameters become large
and thus t-tests and confidence intervals inaccurate.
The probability is increased that a good predictor will be found non-significant and rejected.
In stepwise regression coefficient estimation is subject to large changes.
There might be coefficients with sign opposite of that expected.
Multicollinearity is .
a severe problem when the research purpose includes causal modelling.
less important where the research purpose is prediction since the predicted values of
remain stable relative to each other.
Slide 56
How to identify multicollinearity

If the correlation coefficients between pairs of variables are greater than |0.80|, the variables
should not be used in the same model.
An indicator for multicollinearity reported by SPSS is Tolerance.

Tolerance reflects the percentage of unexplained variance in a variable,
given the other independent variables.
Tolerance informs about the degree of independence of an independent variable.
Tolerance ranges from 0 (= multicollinear) to 1 (= independent).
Rule of thumb (O'Brien 2007): Tolerance less than .10 problem with multicollinearity
In addition, SPSS reports the Variance Inflation Factor (VIF) which is simply the inverse of the
Tolerance (1/Tolerance). VIF has a range 1 to infinity.
Slide 57
Multiple regression analysis with SPSS: Some detailed examples

Example of multiple regression (EXAMPLE04)
Dataset: Sample of 198 men and women based on body height and weight and age

Regression of weight on height and age
weight = 0 + 1 size + 2 age + u
weight = dependent variable
size = independent variable
age = independent variable
0 , 1, 2 = coefficients
u = error term
Slide 58
Step 3: Verification of the model (without dummy for gender)

SPSS Output regression analysis (EXAMPLE04)
Overall F-test: OK (F(2, 195) = 487.569, p = .000) (table not shown here)
weight = 0 + 1 height + 2 age + u

weight = 85.933 + .812 height + .356 age
The unstandardized B coefficients show the absolute change of the dependent variable weight
if the respective independent variable, height or age, changes by one unit.
The Beta coefficients are the standardized regression coefficients.
Their magnitudes reflect their relative importance in predicting weight.
Beta coefficients are only comparable within a model, not between. Moreover, they are highly
influenced by misspecification of the model.
Adding or leaving out variables in the equation will affect the size of the beta coefficients.
Slide 59
SPSS Output regression analysis (EXAMPLE04) I
R Square is influenced by the number of independent variables.

R Square increases with increasing number of variables.
m (1 R Square)
Adjusted R Square = R Square
n m 1
n = number of observations
m = number of independent variables
n m 1= degrees of freedom(df)
Slide 60
Dummy coding of categorical variables

In regression analysis, a dummy variable (also called indicator or binary variable) is one that
takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may
be expected to shift the outcome.
For example, seasonal effects may be captured by creating dummy variables for each of the
seasons. Also gender effects may be treated with dummy coding.
The number of dummy variables is always one less than the number of categories.
Categorical variable Dummy variables

season season_1 season_2 season_3 season_4
If season = 1 (spring) 1 0 0 0
If season = 2 (summer) 0 1 0 0
If season = 3 (fall) 0 0 1 0
If season = 4 (winter) 0 0 0 1
Categorical variable Dummy variables

gender gender_1 gender_2
If gender = 1 (male) 1 0 SPSS syntax:
If gender = 2 (female) 0 1 recode gender (1 = 1) (2 = 0) into gender_d.
Slide 61
Gender as dummy variable

Step 1: Formulation of the model (with dummy for gender)
Mean
Height Weight
Men 181.19 76.32
Women and men have different Women 170.08 63.95
mean levels of height and weight. Total 175.64 70.14
Introduce gender as independent dummy variable

=> Syntax: RECODE gender (1 = 0) (2 = 1) INTO female.
Slide 62
Step 3: Verification of the model (with dummy for gender)

SPSS Output regression analysis (EXAMPLE04)
Overall F-test: OK (F(3, 194) = 553.586, p = .000) (table not shown here)
weight = 16.949 + .417 size + .476 age 8.345 female
Switching from male (female = 0) to female (female = 1) lowers weight by 8.345 kg.
Model fits better (Adjusted R square .894 vs. .832) due to "separation" of gender.
Slide 63
Example of multicollinearity
Human resources research in hospitals: Survey of nurse satisfaction and commitment
Dataset Sample of n = 198 nurses

Regression model
salary = 0 + 1 age + 2 education + 3 experience + 4 experience2 + u
Why a new variable experience2?

The experience effect on salary is disproportional for younger and older people.
The disproportionality can be described by a quadratic term.
"experience" and "experience2"

are highly correlated!
Slide 64
SPSS Output regression analysis (Example of multicollinearity) I
Tolerance is very low for "experience" and "experience2"

One of the two variables might be eliminated from the model
=> Use stepwise regression? Unfortunately SPSS does not take into account multicollinearity.
Slide 65
SPSS Output regression analysis (Example of multicollinearity) II
Prefer this model,

because a not significant
constant is difficult to handle.
Slide 66
Exercises 02: Regression

Slide 67
Analysis of Variance (ANOVA)

Example
Research in human resource management: Survey of nurse salaries in hospitals
Level of Experience grand mean

1 2 3 All
All 36.- 38.- 42.- 39.-
Nurse Salary [CHF/h]
Data (EXAMPLE05.sav)
Subsample of n = 96 nurses
Among other variables: work experience (3 levels), salary (hourly wage in CHF/h)
Typical questions
Has experience an effect on the level of salary?
Are the results only due to chance?
What is the relation between work experience and salary?
Slide 68
Boxplot
- - - grand mean
The boxplot indicates that salary may differ significantly depending on levels of experience.
Slide 69
Questions
Question in everyday language:
Has work experience an effect on salary?
Research question:
Is there a relation between work experience and salary?
What kind of model is suitable for the relation?
Is analysis of variance the right model?
Statistical question:
Forming hypothesis
H0: "No model" (= Not significant factors)
HA: "Model" (= Significant factors)
Can we reject H0?
Solution
Linear model with salary as the dependent variable (ygk = wage of nurse k in group g)
y gk = y + g + gk
y = grand mean
g = effect of group g
gk = random term
Slide 70
"How-to" in SPSS
Scales
Dependent Variable: metric
Independent Variable(s): categorical, part of them metric (called covariates)
SPSS
Analyze General Linear Model Univariate...
Results
Overall model significant ("Corrected Model": F(2, 93) = 46.193, p = .000).
experien significant example interpretation:

There is a main effect of experience (levels 1, 2, 3) on salary, F(2, 93) = 46.193, p = .000. The
value of Adjusted R Squared = .488 shows that 48.8% of the variance in salary around the
grand mean can be predicted by the model (here by experien).
Slide 71
Key steps in analysis of variance
1. Design of experiments
ANOVA is typically used for analyzing the findings of experiments
Oneway ANOVA, Repeated measures ANOVA
Multi-factorial ANOVA (two or more factor analysis of variance) Mixed ANOVA
2. Calculating differences and sum of squares
Differences between group means, individual values and grand mean are squared and
summed up. This leads to the fundamental equation of ANOVA.
Test statistics for significance test is calculated from the means of the sums of squares.
3. Prerequisites
Data is Independent
Normally distributed variables
Homogeneity of variance between groups
4. Verification of the model and the factors
Is the overall model significant? (F-test)? Are the factors significant?
Are prerequisites met?
5. Checking measures
Adjusted R squared / partial Eta squared
Slide 72
Designs of ANOVA
One-way ANOVA: one factor analysis of variance
1 dependent variable and 1 independent factor
Multi-factorial ANOVA: two or more factor analysis of variance
1 dependent variable and 2 or more independent factors
MANOVA: multivariate analysis of variance
Extension of ANOVA used to include more than one dependent variable
Repeated measures ANOVA

1 independent variable but measured repeatedly under different conditions
ANCOVA: analysis of COVariance

Model includes a so called covariate (metric variable)
MANCOVA: multivariate analysis of COVariances
Mixed-design ANOVA possible (e.g. two-way ANOVA with repeated measures)

Slide 73
Guess: What if y1 y 2 y3?
Sum of Squares
Step by step
Survey on hospital nurse salary: Salaries differ by level of experience.
y y
42.7 42.7 y3i salary of i-th nurse with experience level 3

B
41.6 41.6 y3 mean of experience level 3
Salary [CHF/h]
38.6 y 38.6 y mean of all nurses salary

y2
35.9 35.9 y1 Legend
individual nurse salaries
A+B total variation from mean of all nurses
A part of variation due to experience level
Expand 1 2 3 B random part of variation

level of experience
Slide 74
If y1 y 2 y3 then SSbetween SSwithin
Basic idea of ANOVA
Total sum of squared variance of differences SStotal is separated into two parts
(SS is short for Sum of Squares)
SSbetween Part of sum of squared difference due to groups ("between groups", treatments)
(here: between levels of experience)
SSwithin Part of sum of squared difference due to randomness ("within groups", also SSerror)
(here: within each experience group)
Fundamental equation of ANOVA:

G Kg G G Kg
(y
g=1 k =1
gk
2
y) = K (y
g=1
g g
2
y) + (y
g=1 k =1
gk yg )2
SStotal SSbetween SS within
g: index for groups from 1 to G (here: G = 3 levels of experience)

k: index for individuals within each group from 1 to Kg
(here: K1 = K2 = K3 = 32, Ktotal = K1 + K2 + K3 = 96 nurses)Swithin
Slide 75
If y1 y 2 y3 then MSb MS w
Significance testing of the model
Test statistic F for significance testing is computed by relation of means of sum of squares
SSt
MSt = Mean of SStotal
K total 1
SSb
MSb = Mean of SSbetween
G 1
SSw
MSw = Mean of SSwithin
K total G
Calculating test statistic F and significance testing for the global model
MSb
F= F follows an F-distribution with (G 1) and (Ktotal G) degrees of freedom
MS w
The F-test verifies the hypothesis that the group means are equal:
H0 : y1 = y 2 = y3
HA : yi y j for at least one pair ij
Slide 76
ANOVA with SPSS: A detailed example

Example of one-way ANOVA: Survey of nurse salaries (EXAMPLE05)
SPSS: Analyze General Linear Model Univariate...

Slide 77
SPSS Output ANOVA (EXAMPLE05) Tests of Between-Subjects Effects I
Significant overall model (called "Corrected Model")

Significant constant (called "Intercept")
Significant variable experien
Example interpretation for the main effect of experien:

There is a main effect of experience (levels 1, 2, 3) on salary, F(2, 93) = 46.193, p = .000.
The value of Adjusted R Squared (.488) shows that 48.8% of the variance in salary around the
grand mean can be predicted by the model (here: variable experien).
Slide 78
SPSS Output ANOVA (EXAMPLE05) Tests of Between-Subjects Effects II
Allocation of sum of squares to terms in the SPSS output
"Grand mean"
SSbetween
SSwithin (= SSerror)
SStotal
SSbetween reflects the sum of squares of all factors in the model.
In this case (one-way analysis) SSbetween experien

Slide 79
Partial Eta Squared (partial ) 2
Partial Eta Squared compares the amount of variation explained by a particular factor (all other
variables fixed) to the amount of variation that is not explained by any other factor in the model.
This means, we are only considering variation that is not explained by other variables in the
model. Partial 2 indicates what percentage of this variation is explained by a variable.
SSEffect In case of one-way ANOVA:

Partial 2 = Partial 2 is the proportion of the corrected total variation
SSEffect + SSError
that is explained by the model (= R2).
Example: Experience explains 49.8% of the previously unexplained variation.
Note: The values of partial 2 do not sum up to 100%! ( "partial")
Slide 80
Two-Way ANOVA
Research in human resource management: Survey of nurse salary
Level of Experience
1 2 3 All
Position
Office 35.- 37.- 39.- 37.-
Hospital 37.- 40.- 44.- 40.-
All 36.- 38.- 42.- 39.-
Nurse Salary [CHF/h]
Now two factors are in the design

Work experience (Level of experience 1-3): experien
Work position (Position in office or hospital): position
Typical questions
Do work position and experience have an effect on salary? ( main effects)
What "interaction" exists between work position and experience? ( interaction effects)
Slide 81
Main effects
The direct effect of an independent variable on the dependent variable is called main effect.
In the example:
The main effect of experien reveals that the nurses salaries depend on their level of profes-
sional experience.
The main effect of position reveals that the nurses salaries depend on whether they work in
the office or the hospital.
Profile plots are used as visualization:
Main effect experien Main effect position
45
45
40 40
35 35
30 30
25 25
salary
salary
20 20
15 15
10 10
5 5
0 0
1 2 3 office hospital
experien position
If the profile plot shows a (nearly) horizontal line, the main effect in question is presumably not
significant. (Attention: SPSS cuts off lower area of graph, Y-axis often does not start at 0!)
Slide 82
Interaction effects
An interaction between experience and position means there is dependency between the two
variables.
The independent variables have a complex influence on the dependent variable.
The factors do not just function additively but act together in a different manner.
An interaction means that the effect of one factor depends on the value of another factor.
experience
(factor A)
interaction
salary
(factor A x B)
position
(factor B)
Slide 83
Interaction effects
In the example: The interaction between experien and position means ...
that the effect of work experience on salary is not the same for nurses who work in offices
and for nurses who work in the hospital.
that the difference in salary between nurses working in the hospital and nurses working in
the office depends on the level of experience.
Profile plots:
Separate lines for position Separate lines for experien
45 45
experien
40 40
hospital 3
35 35 2
office
30 30 1
25 25
salary
salary
20 20
15 15
10 10
5 5
0 0
1 2 3 office hospital
experien position
If there is an interaction, the lines are not parallel.

The more the lines deviate from being parallel, the more likely is an interaction.
If there is no interaction, the lines are parallel.
Slide 84
Sum of Squares (with interaction)

Again SStotal = SSbetween + SSwithin
With SSbetween = SSExperience + SSPosition + SSExperience x Position
Follows SStotal = (SSExperience + SSPosition + SSExperience x Position) + SSwithin
Where SSExperience x Position is the interaction of both factors simultaneously
Slide 85
Example of two-way ANOVA: Survey of nurse salary (EXAMPLE06)
SPSS: Analyze General Linear Model Univariate...
Slide 86
Interaction
Interaction term between fixed factors is calculated by default in ANOVA
Example interpretation (among other duty descriptions):

There is also an interaction of experience and position on salary, F(2, 90) = 18.991, p = .000,
partial 2 = .297.
The interaction term experien * position explains 29.7% of the previously unexplained variance.
Slide 87
Interaction I
Do different levels of experience influence the impact of different levels of position differently?
Yes, if experience has values 2 or 3 then the influence of position is raised.
office
hospital
Simplified: Lines not parallel

Interpretation: Experience is more important in hospitals than in offices.
Slide 88
More on interaction
Main effect of experien Main effect of experien Main effect of experien
Main effect of position Main effect of position Main effect of position
Interaction Interaction Interaction
salary
salary
salary
experien experien experien
Main effect of experien Main effect of experien Main effect of experien

Main effect of position Main effect of position Main effect of position
Interaction Interaction Interaction
salary
salary
salary
experien experien experien

Slide 89
Requirements of ANOVA
0. Robustness
ANOVA is relatively robust against violations of prerequisites.
1. Sampling
Random sample, no treatment effects
A well designed study avoids violation of this assumption
2. Distribution of residuals
Residuals (= error) are normally distributed
Correction transformation
3. Homogeneity of variances
Residuals (= error) have constant variance
Correction weight variances
4. Balanced design
Same sample size in all groups
Correction weight mean
SPSS automatically corrects unbalanced designs by Sum of Squares "Type III"
Syntax: /METHOD = SSTYPE(3)
Slide 90
Exercises 03: ANOVA

Slide 91
Other multivariate Methods

Type of Multivariate Statistical Analysis
Regarding the practical application multivariate methods can be divided into two main parts:
Methods for identifying structures Methods for discovering structures
Independent Dependent
Variable (IV) Variable(s) (DV)
Employee Customer
Price of satisfaction satisfaction
product
Quality of Customer
Products satisfaction
Quality of Motivation of
customer service employee
Also called dependence analysis be- Also called interdependence analysis

cause methods are used to test direct because methods are used to discover
dependencies between variables. dependencies between variables.
Variables are divided into independent This is especially the case with explora-
variables and dependent variable(s). tory data analysis (EDA).
Slide 92
Choice of Method
Methods for identifying structures Regression Analysis

(Dependence Analysis) Analysis of Variance (ANOVA)
Discriminant Analysis
Contingency Analysis
(Conjoint Analysis)
Methods for discovering structures Factor Analysis

(Interdependence Analysis) Cluster Analysis
Multidimensional Scaling (MDS)
Independent Variable (IV)
metric categorical
Dependent Variable metric Regression analysis Analysis of Variance (ANOVA)

(DV) categorical Discriminant analysis Contingency analysis
Slide 93
Tree of methods (also www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm)

(See also www.methodenberatung.uzh.ch (in German))
Data Analysis
Descriptive Inductive
Univariate Bivariate Multivariate Univariate Bivariate

Correlation t-Test t-Test
2 Adjustment 2 Independence
Dependence Interdependence
DV metric DV not metric metric not metric

Cluster MDS
Factor
IV metric IV not metric IV metric IV not metric
Regression ANOVA Discriminant Contingenc y
Conjoint
DV = dependent variable
IV = independent variable
Slide 94
Example of multivariate Methods (categorical / metric)
Linear discriminant analysis

Linear discriminant analysis (LDA) is used to find the linear combination of features which
best separates two or more groups in a sample.
The resulting combination may be used to classify groups in a sample.

(Example: Credit card debt, debt to income ratio, income => predict bankrupt risk of clients)
LDA is closely related to ANOVA and logistic regression analysis, which also attempt to express
one dependent variable as a linear combination of other variables.
LDA is an alternative to logistic regression, which is frequently used in place of LDA.

Logistic regression is preferred when data are not normal in distribution or group sizes
are very unequal.
Slide 95
Example of linear discriminant analysis

Data from measures of body length of
two subspecies of puma (South & North America)
140
135
130
125
x2 [cm]
120
115
110
Species x1 x2
105 1 191 131
1 185 134
100 1 200 137
150 160 170 180 190 200 210 220 230 240 250
1 173 127
x1 [cm]
1 171 118
1 160 118
1 188 134
1 186 129
1 174 131
1 163 115
Other names for puma 2 186 107
2 211 122
cougar 2 201 114
2 242 131
mountain lion 2 184 108
2 211 118
catamount 2 217 122
Species 1 = North America, 2 = South America
2 223 127
panther 2 208 125 x1 body length: nose to top of tail
2 199 124 x2 body length: nose to root of tail
Slide 96
Sketch of LDA
Very short introduction to linear discriminant analysis
Dependent Variable (also called discriminant variable): categorical

Puma's example: type (two subspecies of puma)
Independent Variables: metric

Puma's example: x1 & x2 (different measures of body length)
Goal
Discrimination between groups
Puma's example: discrimination between two subspecies
Estimate a function for discriminating between group

Yi = + 1xi,1 + 2 xi,2 + ui
Yi discriminant variable
,1,2 coefficients
xi,1,xi,2 measurement of body lenght
ui error term
Slide 97
Data from measurement of body-length of two subspecies of puma
140
135
130
125
x2 [cm]
120
115
110
105
100
150 160 170 180 190 200 210 220 230 240 250
x1 [cm]
140
135
130
125
x2 [cm]
120
115
110
105
100
150 160 170 180 190 200 210 220 230 240 250
x1 [cm]
Slide 98
SPSS-Example of linear discriminant analysis (EXAMPLE07)
DISCRIMINANT
/GROUPS=species(1 2)
/VARIABLES=x1 x2
/ANALYSIS ALL
/PRIORS SIZE
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE
/CLASSIFY=NONMISSING POOLED MEANSUB .
Slide 99
SPSS Output Discriminant analysis (EXAMPLE07) I
Both coefficients significant
Yi = + 1x i,1 + 2 x i,2 + i
Yi = 4.588 +.131 x i,1 -.243 x i,2 + i
Slide 100
5
"Found" two pumas A & B: 4 A
x1
175
x2
120
B 200 110
x1 x2 3
2
A 175 120
discriminant variable Y
1
B 200 110
0
1 1 1 1 1 1 A 1 1 1 1 2 2 2 2 2 2 2 B 2 2 2
What subspecies are they? -1
-2
Use -3
Yi = 4.588 +.131 x i,1 -.243 x i,2 + i

-4
-5
to determine their subspecies. subspecies of puma [0,1]
The two subspecies of pumas can be com-

pletely classified (100%)
See also plot above that is generated with

Yi = 4.588 +.131 x i,1 -.243 x i,2 + i

SPSS Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPSS Introduction

Uploaded by

Copyright:

Available Formats

Neuroscience Center Zurich

Crash Course in Statistics

Data Analysis (with SPSS)

Dr. Jrg Schwarz juerg.schwarz@schwarzpartners.ch

Part 1: Program 9 July 2014: Morning Lessons (09.00 12.00)

Some notes about.

Part 2: Program 10 July 2014: Morning Lessons (09.00 12.00)

Multivariate Analysis (Regression, ANOVA)

- Introduction to Regression Analysis

- Introduction to Analysis of Variance (ANOVA)

Part 2: Program 10 July 2014: Afternoon Lessons (13.00 16.00)

Introduction to other multivariate methods (categorical/categorical metric/metric)

Remains of the course

Exercises 01: Log Transformation & Data Trimming ___________________________________________________________ 34

Exercises 02: Regression_________________________________________________________________________________ 66

Analysis of Variance (ANOVA) _____________________________________________________________________________ 67

Other multivariate Methods _______________________________________________________________________________ 91

Some notes about.

Measurement object Person

Attribute of Object Sex Attitude to health Body temp- Income in US $

Value of Attribute Male / Female 1 to 5 Real number Real number

Type of Scale Nominal Ordinal Interval Ratio

Examples from the Health Survey

Example from the Health Survey

Metric scales (interval and ratio scales)

Example from the Health Survey for England 2003

Possible statements Example

In addition: Self-perception of health (1 = "very bad", . 5 = "very

Relation larger (>), good"): 1 < 2 < 3 < 4 < 5

Comparison of differ- ference between 10 and 15. But a temperature of 10 is

ences not twice as warm as 5. Compare with the Fahrenheit-

In addition: Income: $ 8,000 is twice as large as $ 4,000. There is a

Level Determination of ... Statistics

Nominal equality or unequality =, Mode

Ordinal greater, equal or less >, <, = Median

Interval equality of differences (x1 - x2) (x3 - x4) Arithmetic mean

Ratio equality of ratios (x1 / x2) (x3 / x4) Geometric mean

Level Possible transformation

Nominal one-to-one substitution x1 ~ x2 <=> f(x1) ~ f(x2)

Ordinal monotonic increasing x1 > x2 <=> f(x1) > f(x2)

Interval positiv linear ' = a + b with a > 0

Ratio postiv proportional ' = a with a > 0

Summary: Type of scales

Why does it matter whether a variable is categorical or metric?

Sometimes variables are "in between" ordinal and metric.

Many circumstances require metric data to be grouped into categories.

Exercise in class: Scales

2. Which type of scale?

Compared with the health of 1 2 3 4 5

How much did you spend on food this week? _____ $

Skewness (German: Schiefe)

Kurtosis (German: Wlbung)

Distribution of original data is skewed right.

Log transformed data is slightly skewed right.

When transformations fail to remedy these problems, another option is to use:

A very useful linear transformation is standardization.

Original distribution will be transformed to one in which

A z-score quantifies the original score in terms of

size (in cm)

*More to come in chapter "ANOVA".

nearly normal distributed

Summary: Data transformation

Linear transformation and logarithmic transformation as discussed above.

Interpretation and usage

y = 0 + 1 x equation of a straight line

The variable y is a linear function of the variable x.