You are on page 1of 28

Lecture #1 - 7/18/2011 Slide 1 of 28

Introduction to Multivariate Analysis


Lecture 1
July 18, 2011
Advanced Multivariate Statistical Methods
ICPSR Summer Session #2
Overview
GTodays Lecture
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 2 of 28
Todays Lecture
I
Introductions
I
Syllabus and course overview
I
Chapter 1 (a brief review, really):
N
Data organization/notation
N
Graphical techniques
N
Distance measures
I
Introduction to SAS
Overview
Course Overview
GMultivariate
GCourse Structure
GMultivariate
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 3 of 28
Multivariate Statistics and Thinking
I
Although titled Advanced Multivariate Statistical Methods
this course is an overview of thinking about data and
methods from a multivariate lens:
N
Many methods fall under the label multivariate statistics
(e.g., Multivariate ANOVA, Discriminant Analysis, Principal
Component Analysis)
N
Many multivariate statistical distributions exist (e.g.,
Multivariate Normal, Wishart)
N
Many modern (univariate) statistical methods rely on
these multivariate distributions, especially the multivariate
normal distribution
I
This course will focus on multivariate thinking, not just about
methods, but also about the foundations of multivariate
statistical analysis
Overview
Course Overview
GMultivariate
GCourse Structure
GMultivariate
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 4 of 28
Course Structure
I
The course is organized around a central topic each week:
1. Foundations of Multivariate Thinking and The
Multivariate Normal Distribution
N
Matrix algebra
N
Multivariate normal distribution
2. Multivariate Normal and Linear Mixed Models
N
Multivariate ANOVA
N
Discrimination/classication
N
Linear models
3. Multivariate Data Reduction Procedures
N
Principal components analysis
N
Factor analysis and structural equation modeling
4. Generalized Multivariate Techniques
N
Distance methods
N
Finite mixture models
N
Categorical distributions
Overview
Course Overview
GMultivariate
GCourse Structure
GMultivariate
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 5 of 28
Multivariate Statistics
A taxonomy of multivariate statistical analyses shows that most
techniques fall into one of the following categories:
1. Data reduction or structural simplication
2. Sorting and grouping
3. Investigation of the dependence among variables
4. Prediction
5. Hypothesis construction and testing
Overview
Course Overview
Data Organization
GArrays
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 6 of 28
Data Organization
I
As a precursor of things to come, here is a preview of the
ways data are organized in this book/course
I
Multivariate data are a collection of observations (or
measurements) of:
N
p variables (k = 1, . . . , p)
N
n items (j = 1, . . . , n)
I
items can also be though of as
subjects/examinees/individuals or entities (when people
are not under study)
I
In some disciplines (such as educational measurement),
items are considered the variables collected per
individual
Lecture #1 - 7/18/2011 Slide 7 of 28
Data Organization
I
x
jk
= measurement of the k
th
variable on the j
th
entity
Variable 1 Variable 2 . . . Variable k . . . Variable p
Item 1: x
11
x
12
. . . x
1k
. . . x
1p
Item 2: x
21
x
22
. . . x
2k
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Item j: x
j1
x
j2
. . . x
jk
. . . x
jp
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Item n: x
n1
x
n2
. . . x
nk
. . . x
np
Overview
Course Overview
Data Organization
GArrays
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 8 of 28
Arrays
I
To represent the entire collection of items and entities, a
rectangular array can be constructed:
X =

x
11
x
12
. . . x
1k
. . . x
1p
x
21
x
22
. . . x
2k
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
x
j1
x
j2
. . . x
jk
. . . x
jp
.
.
.
.
.
.
.
.
.
.
.
.
x
n1
x
n2
. . . x
nk
. . . x
np

I
In the next class, we will learn about how arrays like this
have an algebra that makes life somewhat easier
I
All arrays will be symbolized by boldfaced font
Overview
Course Overview
Data Organization
GArrays
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 9 of 28
Array Example
I
So, putting things all together, envision standing outside of
the bookstore, asking people for receipts
I
You are interested in looking at two variables:
N
Variable 1: the total amount of the purchase
N
Variable 2: the number of books purchased
I
You nd four people, and here is what you see observe (with
notation:
x
11
= 42 x
21
= 52 x
31
= 48 x
41
= 58
x
12
= 4 x
22
= 5 x
32
= 4 x
42
= 3
Overview
Course Overview
Data Organization
GArrays
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 10 of 28
Array Example (Continued)
I
The data array would the look like:
X =

x
11
x
12
x
21
x
22
x
31
x
32
x
41
x
42

42 4
52 5
48 4
58 3

I
Notice for any variable, x
jk
:
N
The rst subscript (j) represents the ROW location in the
data array
N
The second subscript (k) represents the COLUMN
location in the data array
Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 11 of 28
Descriptive Statistics Review
I
When we have a large amount of data, it is often hard to get
a manageable description of the nature of the variables
under study
I
For this reason (and as a way of introducing a review topics
from previous courses), descriptive statistics are used
I
Such descriptive statistics include:
N
Means
N
Variances
N
Covariances
N
Correlations
Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 12 of 28
Sample Mean
I
For the k
th
variable, the sample mean is:
x
k
=
1
n
n

j=1
x
jk
I
An array of the means for all p variables then looks like this
(which we will come to know as the mean vector):

x =

x
1
x
2
x
3
x
4

Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 13 of 28
Sample Variance
I
For the k
th
variable, the sample variance is:
s
2
k
= s
kk
=
1
n
n

j=1
(x
jk
x
k
)
2
I
Note the kk subscript, this will be important because the
equation that produces the variance for a single variable is a
derivation of the equation of the covariance for a pair of
variables
I
Also note the division by n
N
Reasons for this will become apparent in the near future
(hint: its a type of estimate)
I
For a pair of variables, i and k, the sample covariance is:
s
ik
=
1
n
n

j=1
(x
ji
x
i
)(x
jk
x
k
)
Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 14 of 28
Sample Covariance Matrix
I
Making an array of all sample covariances give us:
S
n
=

s
11
s
12
. . . s
1p
s
21
s
22
. . . s
2p
.
.
.
.
.
.
.
.
.
.
.
.
s
p1
s
p2
. . . s
pp

Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 15 of 28
Sample Correlation
I
Sample covariances are dependent upon the scale of the
variables under study
I
For this reason, the correlation is often used to describe the
association between two variables
I
For a pair of variables, i and k, the sample correlation is
found by dividing the sample covariance by the product of
the standard deviation of the variables:
r
ik
=
s
ik

s
ii

s
kk
I
The sample correlation:
N
Ranges from -1 to 1
N
Measures linear association
N
Is invariant under linear transformations of i and k
N
Is a biased statistic
Overview
Course Overview
Data Organization
Descriptive Statistics
GSample Mean
GSample Variance
GSample Correlation
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 16 of 28
Sample Correlation Matrix
I
Making an array of all sample correlations give us:
R =

1 r
12
. . . r
1p
r
21
1 . . . r
2p
.
.
.
.
.
.
.
.
.
.
.
.
r
p1
r
p2
. . . 1

Overview
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
GBivariate Scatterplots
GTrivariate Scatterplots
GStars
GChernoff Faces
GDendrograms
GVariable Space
GNetwork Diagrams
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 17 of 28
Graphical Techniques
I
Displaying multivariate data can be difcult due to our natural
limitations of seeing the world in three dimensions
I
Several simple ways of displaying data include:
N
Bivariate scatterplots
N
Three-dimensional scatterplots
Lecture #1 - 7/18/2011 Slide 18 of 28
Bivariate Scatterplots
Lecture #1 - 7/18/2011 Slide 19 of 28
Trivariate Scatterplots
Overview
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
GBivariate Scatterplots
GTrivariate Scatterplots
GStars
GChernoff Faces
GDendrograms
GVariable Space
GNetwork Diagrams
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 20 of 28
Graphical Techniques
I
But you likely already have seen those plots
I
Some plots that can be achieved by multivariate methods
include:
N
Stars
N
Chernoff faces
N
Dendrograms
N
Bivariate plots, but of the variable space
N
Network graphs
Lecture #1 - 7/18/2011 Slide 21 of 28
Stars
Lecture #1 - 7/18/2011 Slide 22 of 28
Chernoff Faces
Lecture #1 - 7/18/2011 Slide 23 of 28
Dendrograms
Lecture #1 - 7/18/2011 Slide 24 of 28
Variable Space Plots
Lecture #1 - 7/18/2011 Slide 25 of 28
Network Diagrams
P000000
P000001
P000010
P000011
P000100
P000101
P000110
P000111
P001000
P001001
P001010
P001011
P001100
P001101
P001110
P001111
P010000
P010001
P010010
P010011
P010100
P010101
P010110
P010111
P011000
P011001
P011010
P011011
P011100
P011101
P011110
P011111
P100000
P100001
P100010
P100011
P100100
P100101
P100110
P100111
P101000
P101001
P101010
P101011
P101100
P101101
P101110
P101111
P110000
P110001
P110010
P110011
P110100
P110101
P110110
P110111
P111000
P111001
P111010
P111011
P111100
P111101
P111110
P111111
Pajek
Overview
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 26 of 28
Distance Measures
I
A great number of multivariate techniques revolve around the
computation of distances:
N
Distances between variables
N
Distances between entities (people, objects, etc.)
I
The formula for the Euclidean distance formula between the
coordinate pair P = (x
1
, x
2
) and the origin O = (0, 0):
d(O, P) =

(x
1
0)
2
+ (x
2
0)
2
Overview
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
Lecture #1 - 7/18/2011 Slide 27 of 28
Distance Measures
I
Elaborate discussions of distance measures will be found
later in the class
I
There are also some statistical analogs to distance
measures, taking the variability of variables into account
I
Also be aware that there are literally an innite number of
distance measures!
I
To be considered an actual distance, a distance measure
must satisfy the following:
N
d(P, Q) = d(Q, P)
N
d(P, Q) > 0 if P = Q
N
d(P, Q) = 0 if P = Q
N
d(P, Q) d(P, R) + d(R, Q) (known as the triangle
inequality)
Overview
Course Overview
Data Organization
Descriptive Statistics
Graphical Techniques
Distance Measures
Wrapping Up
GFinal Thoughts
Lecture #1 - 7/18/2011 Slide 28 of 28
Final Thoughts
I
We introduced what this course will be about - the wild world
of multivariate statistics
I
Things will become increasingly relevant as time
progresses...but do not hesitate to ask why?
I
We will now head down to the lab for a SAS introduction
session
I
Tomorrows Class: Matrix algebra (Chapter 2 and
Supplement 2A)

You might also like