You are on page 1of 21

MATH 6303

Multivariate Statistical Analysis


Final Project
By
Marius M. Mihai

December 3, 2014

Final Project Multivariate Statistical Analysis

December 2014

1. Do a discriminant analysis and write a report for the data of Table5.8.

Y1: Length of Cycle, Y2: Percentage of Rising Prices, Y3: Cyclical Amplitude, Y4: Rate of Change
The purpose of the discriminant analysis is to identify a linear combination of the variables
described above that would show the separation between consumer goods and producer goods.
But before we find the discriminant function, we need to see if the univariate differences in
groups are significant. Thus, we conduct 4 separate ANOVAs and analyze the results.
Thus for variable y1 (length of cycle), the p-value is less than 5% thus the difference is significant.

2|Page

Final Project Multivariate Statistical Analysis

December 2014

For variable y2 (% of rising prices), the difference between consumer and producer goods was
not significant.

3|Page

Final Project Multivariate Statistical Analysis

December 2014

For variable y3 (cyclical amplitude), the difference between groups is significant.

For variable y4 (rate of change), the difference between groups is also not significant.

4|Page

Final Project Multivariate Statistical Analysis

December 2014

The univariate ANOVA results yielded differences between y1 and y3, while for y2 and y4 it did
not identify any significant differences. The next step is to run a MANOVA to test for the overall
differences between consumer goods and producer goods. The results were pretty significant for
all test statistic values. The p-value was lower than 5% in all cases, thus there is a significant
difference between the two groups, consumer goods and producer goods.

5|Page

Final Project Multivariate Statistical Analysis

December 2014

Because the difference between consumer and producer goods was significant, the discriminant
analysis will help identify which variables contribute more to the difference between groups. The
analysis will follow as:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.
b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Third, a stepwise selection of variables will be applied to identify any redundancies.
The analysis was carried assuming that the covariance matrices were equal. The discriminant
. Thus, a= (-0.05689, -0.00971, )
function can be computed using a= (Spooled)-1 * (
0.24213, -0.0713). To test for the significance of this discriminant function, the Hotelling-T2 was
computed. In the case of two groups, the discriminant function is significant if T2 is also
significant. It was proven that T2= (n1+n2-2)*(1-Wilks lambda)/Wilks lambda. Hence from my
MANOVA output, Wilks lambda=0.48, and so T2= (9+10-2)*(1-0.48)/0.48= 18.42. This test
statistics follows a distribution T2=0.05, p=4, n1+n2-2=17=15.117. The test statistic is higher than the
table value, thus T2 is significant, resulting in a significant discriminant function. Thus the linear
discriminant function can be written as z=-0.06*y1-0.0097*y2-0.24*y3-0.071*y4
The standardized coefficients of the discriminant functions can be computed using the formula
a_standardized=( )*a. The standardized coefficients were computed in SAS,
and a_standardized= (-1.390, -0.083, -1.025, -0.032). By taking the absolute value, these
coefficients give us a good idea about the variable contribution in the model. Thus, the ranking
is as follows from the most important variable to the least important variable: y1 (length of
cycle), y3 (Cyclical amplitude), y2 (percentage of rising prices) and y4 (rate of change). These
results are comparable to the ones obtained in the individual ANOVAs, when the strongest
differences were in y1, and y3 (according to their p-values), while y2 and y4 did not exhibit any
significant differences.

6|Page

Final Project Multivariate Statistical Analysis

December 2014

The last step of the discriminant analysis is the stepwise procedure which will be conducted in
order to identify any redundancies in the data. The output from the stepwise procedure in SAS
can be found below. As expected, variable y1 (length of cycle) was entered first because it had
the highest F-value, followed by y3 (cyclical amplitude) because it has the second highest F-value.
After step 2, there were no more significant variables. This was expected, because y2 (percentage
of rising prices), and y4 (rate of change) were not found significant in the individual ANOVAs run
in the initial step of the analysis. Also, according to the standardized discriminant function, y2
and y4 have the lowest contribution. Thus the reduced model with only y1 and y3, is as good as
the full model because y2 and y4 appear to be redundant in the full model.

7|Page

Final Project Multivariate Statistical Analysis

December 2014

2. Do a classification analysis and write a report on Table5.6.

Y1: Intelligence, Y2: Form Relations, Y3: Dynamometer, Y4: Dotting, Y5: Sensory Motor
Coordination, Y6: Perseveration
Classification analysis is discriminant analysis taken one step forward. The purpose of this analysis
is to tell us where to place future subjects with various scores in intelligence, form relations,
dynamometer, dotting, sensory motor coordination, and perseveration. In our case, we have a
two group classification analysis, engineer apprentices and pilots. It is important to note that any
preliminary analysis on this data such as ANOVA, MANOVA, and tests for normality, was done
previously in the midterm exam. Similarly to the previous problem, the analysis will be carried as
follows:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.
b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Test for the equality of covariance matrices
d) We classify each observation based on both the linear discriminant function and the
quadratic discriminant function, and estimate the error rates
e) We use the holdout method to see how it compares with the previous two.

8|Page

Final Project Multivariate Statistical Analysis

December 2014

The discriminant function was computed in SAS. Same as in the analysis done in problem 1, the
vector a= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049) contains the coefficients of the linear
discriminant function. The T2=66.7 was computed in the midterm exam, and it was significant;
hence the linear discriminant function is also significant.
The next step was to compute the standardized coefficient, to identify the contribution of each
variable in the overall model. The standardized coefficients were (0.174, 1.496, -1.391, -1.280,
1.131, -1.440). Taking their absolute value, the ranking from the most important to the least
important is as follows: y2 (form relations), y6 (perseveration), y3 (dynamometer), y4 (dotting),
y5 (sensory motor coordination), ad y1 (intelligence). These results are on par with the ones in
the midterm, when sensory motor coordination, and intelligence appeared to be redundant in
the full model. In the discriminant analysis, these two variables were last in the level of
importance.

An assumption that was not tested in the midterm is particularly important for this analysis: the
equality of the covariance matrices. This assumption was tested in a previous homework (see
problem 7.22), and the covariance matrices appeared to be equal.
With the covariance matrices being equal, the first classification will be done based on the linear
discriminant function a= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049). The next page contains
the analysis which was done in Microsoft Excel. Based on the analysis, there were two
misclassifications in engineer apprentices, and two misclassifications in pilots. This would yield
an error rate of (2+2)/ (20+20) =0.1 (10%). The error will be compared to other error rates which
will be obtained further in the analysis in order to judge the ability to predict group membership.

9|Page

Engineer Apprentices

Pilots

y1

y2

y3

y4

y5

y6

Value

Decision

y1

y2

y3

y4

y5

y6

Value

Decision

121

22

74

223

54

254

-22.5221

pilots

132

17

77

232

50

249

-24.2204

pilots

108

30

80

175

40

300

-23.0577

pilots

123

32

79

192

64

315

-22.1684

pilots

122

49

87

266

41

223

-20.2287

engineer apprentices

129

31

96

250

55

319

-27.8371

pilots

77

37

66

178

80

209

-12.9194

engineer apprentices

131

23

67

291

48

310

-27.4374

pilots

140

35

71

175

38

261

-18.9185

engineer apprentices

110

24

96

239

42

268

-27.2938

pilots

108

37

57

241

59

245

-17.4942

engineer apprentices

47

22

87

231

40

217

-24.2903

pilots

124

39

52

194

72

242

-13.2495

engineer apprentices

125

32

87

227

30

324

-27.5684

pilots

130

34

89

200

85

242

-18.2732

engineer apprentices

129

29

102

234

58

300

-27.1664

pilots

149

55

91

198

50

277

-18.4763

engineer apprentices

130

26

104

256

58

270

-27.4665

pilots

129

38

72

162

47

268

-17.6913

engineer apprentices

147

47

82

240

30

322

-24.3164

pilots

154

37

87

170

60

244

-17.8653

engineer apprentices

159

37

80

227

58

317

-23.0874

pilots

145

33

88

208

51

228

-20.3226

engineer apprentices

135

41

83

216

39

306

-23.2369

pilots

112

40

60

232

29

279

-20.7101

engineer apprentices

100

35

83

183

57

242

-18.8145

engineer apprentices

120

39

73

159

39

233

-16.424

engineer apprentices

149

37

94

227

30

240

-23.2048

pilots

118

21

83

152

88

233

-17.3899

engineer apprentices

149

38

78

258

42

271

-22.9299

pilots

141

42

80

195

36

241

-18.739

engineer apprentices

153

27

89

283

66

291

-26.7734

pilots

135

49

73

152

42

249

-14.6486

engineer apprentices

136

31

83

257

31

311

-27.7362

pilots

151

37

76

223

74

268

-18.9062

engineer apprentices

97

36

100

252

30

225

-24.8977

pilots

97

46

83

164

31

243

-17.8151

engineer apprentices

141

37

105

250

27

243

-26.0327

pilots

109

42

82

188

57

267

-18.7053

engineer apprentices

164

32

76

187

30

264

-21.1992

engineer apprentices

124.5

38.1

76.2

192.75

53.65

250.3

129.3

31.7

87.4

236.6

44.25

280.2

The following classification will be done based on a quadratic classification function. Although
the sample covariance matrices did not yield any significant differences, for the purpose of this
analysis we will try to compare the error rates from the two models. The SAS results are below.
It appears that using a quadratic discriminant function will yield a similar error rate of 10%.

A third type of classification analysis will be the holdout method. Again, I copied the SAS results
on the following page. There were 4 misclassifications in engineer apprentices, and 2 in pilots for
a total error rate of 0.1750. As expected, the error rate increased compared to the previous two
methods giving a more realistic expectation of how the linear discriminant function can perform
for future data subjects.

Final Project Multivariate Statistical Analysis

12 | P a g e

December 2014

Final Project Multivariate Statistical Analysis

December 2014

3. Do a regression analysis and write a report on Table 3.4.


The first step in this multivariate regression analysis is to try to estimate the parameters, which
is matrix . The parameter matrix was computed in SAS, and the output is below. The first table
is the set of parameters for y1 (relative weight) and the second one is the set of parameters for
y2 (fasting plasma glucose).

The overall regression appears to be significant at =5%, as indicated by all four tests shown
below. However, the R2 values appear to be relatively low for the overall model. Only 25% of the
variability in y1 (relative weight) can be explained by x1 (glucose intolerance), x2 (insulin response
to oral glucose) and x3 (insulin resistance), and only 1.6% of the variability in y2 (fasting plasma
glucose) can be explained by x1, x2, and x3.

13 | P a g e

Final Project Multivariate Statistical Analysis

14 | P a g e

December 2014

Final Project Multivariate Statistical Analysis

December 2014

The relatively low values of R2 suggest that more explanatory variables may be needed in order
to improve the model. However, for the purpose of this analysis we will run a stepwise procedure
in order to identify redundancies. A backward elimination will be applied to find a subset of the
xs.
To find the subset of the xs we compute a conditional Wilks lambda by formula 10.72 in the
book. Thus, for example, Wilks lambda (X1|X2 X3) = (Wilks lambda (X1, X2, X3)/Wilks lambda
(X2 X3)). This would be the first value that would go under X1 in the table below. The rest of the
xs are computed similarly. No elimination could be done at step 1 because the highest Wilks
lambda (0.93) was significant at =5%. Thus the largest Wilks lambda is significant so the
backward elimination process would have to stop there.
Step #
1

X1
(Glucose X2 (Insulin Response to X3 (Insulin Resistance)
Intolerance)
Oral Glucose)
Wilks Lambda=0.93
Wilks Lambda=0.89
Wilks Lambda=0.76

It appears that none of the independent variables could be eliminated and all three xs are
needed in the full model. However, because of the small values of R2, they seem to explain only
a very small portion of the variability in ys. Thus, as I mentioned earlier, this model needs more
explanatory variables in order to increase its accuracy.

15 | P a g e

Final Project Multivariate Statistical Analysis

December 2014

12.8. Carry out a principal component analysis on all six variables of the glucose data of Table
3.8. Use both S and R. Which do you think is more appropriate here? Show the percent of
variance explained. Based on the average eigenvalue or a scree plot, decide how many
components to retain. Can you interpret the components of either S or R?
The purpose of principal component analysis is to eliminate variables and optimize the model.
For the data in Table 3.8 principal components were computed on both S and R.
First I will present the runs on the correlation matrix. The first four eigenvalues account for about
85% of the variance which is greater than 80%, so we can keep the first four.

16 | P a g e

Final Project Multivariate Statistical Analysis

December 2014

In the case of the covariance matrix, the first three eigenvalues account for 89% of the variability,
which is higher than 80%. Thus, we can keep the first three principal components. This makes
sense because the variances are significantly influenced by the larger variances of x1, x2, and x3.

17 | P a g e

Final Project Multivariate Statistical Analysis

December 2014

In this case, because of the disparate variances in S, choosing the principal components from R
will be more appropriate.
For the interpretation of the principal components in the case of R, we need a correlation
procedure between the chosen components and the variables. The runs were done in SAS and
they can be seen circled in the figure below. The correlations between the principal components
and the variables differ, and only the ones above 0.5 were deemed to be significant. For example,
after selecting the first four principal components, a significant correlation (over 0.5) can be
identified between the first principal component and variables y1, y3, x1, x2, and x3. Significant
correlations can also be identified between the second principal component and y2. X2 has a
significant correlation with the third component, while y1, and y3 are strongly correlated with
the fourth principal component.

18 | P a g e

Final Project Multivariate Statistical Analysis

December 2014

12.12 Carry out a principal component analysis on the engineer data of Table 5.6 as follows:
(a) Use the pooled covariance matrix.
(b) Ignore groups and use a covariance matrix based on all 40 observations.
(c) Which of the approaches in (a) or (b) appears to be more successful?
Here, we are running a principal component analysis using an unpooled covariance matrix, and a
pooled covariance matrix. The two matrices were computed in SAS and are shown in the figures
below. The first figure is the pooled covariance matrix and the second figure is the unpooled
covariance matrix.

19 | P a g e

Final Project Multivariate Statistical Analysis

December 2014

First we run the component analysis on the unpooled covariance matrix. The following results
were obtained in SAS. The first three components account for 87% of the variance, thus it will be
enough to keep them. Thus the first three components are a1= (0.212, -0.039, 0.08, 0.775, 0.956, 0.580), a2= (0.389, 0.064, -0.066, -0.608, 0.01, 0.686), and a3= (0.889, 0.096, 0.08, 0.08,
0.01, -0.434)

For the unpooled matrix, I could not use a procedure in SAS so I computed in IML. The output is
copied on the next page. The table under the figures gives the cumulative proportion of the
eigenvalues in the overall model. Similar to the analysis done for the pooled covariance matrix,
the first three eigenvalues account for about 85% of the total variance, thus the first three
eigenvectors (components) can be kept in the model.
Given that the two analyses are very similar, it appears that neither is more successful and that
the results are independent of the choice made: to use the pooled covariance matrix, or the
unpooled covariance matrix.

20 | P a g e

Final Project Multivariate Statistical Analysis

21 | P a g e

December 2014

Eigenvalue

Proportion

Cumulative Proportion

1,050.5963
858.3158
398.9035
259.1484
108.0892
43.3535

38.6%
31.6%
14.7%
9.5%
4.0%
1.6%

38.6%
70.2%
84.9%
94.4%
98.4%
100.0%

You might also like