Professional Documents
Culture Documents
December 3, 2014
December 2014
Y1: Length of Cycle, Y2: Percentage of Rising Prices, Y3: Cyclical Amplitude, Y4: Rate of Change
The purpose of the discriminant analysis is to identify a linear combination of the variables
described above that would show the separation between consumer goods and producer goods.
But before we find the discriminant function, we need to see if the univariate differences in
groups are significant. Thus, we conduct 4 separate ANOVAs and analyze the results.
Thus for variable y1 (length of cycle), the p-value is less than 5% thus the difference is significant.
2|Page
December 2014
For variable y2 (% of rising prices), the difference between consumer and producer goods was
not significant.
3|Page
December 2014
For variable y4 (rate of change), the difference between groups is also not significant.
4|Page
December 2014
The univariate ANOVA results yielded differences between y1 and y3, while for y2 and y4 it did
not identify any significant differences. The next step is to run a MANOVA to test for the overall
differences between consumer goods and producer goods. The results were pretty significant for
all test statistic values. The p-value was lower than 5% in all cases, thus there is a significant
difference between the two groups, consumer goods and producer goods.
5|Page
December 2014
Because the difference between consumer and producer goods was significant, the discriminant
analysis will help identify which variables contribute more to the difference between groups. The
analysis will follow as:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.
b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Third, a stepwise selection of variables will be applied to identify any redundancies.
The analysis was carried assuming that the covariance matrices were equal. The discriminant
. Thus, a= (-0.05689, -0.00971, )
function can be computed using a= (Spooled)-1 * (
0.24213, -0.0713). To test for the significance of this discriminant function, the Hotelling-T2 was
computed. In the case of two groups, the discriminant function is significant if T2 is also
significant. It was proven that T2= (n1+n2-2)*(1-Wilks lambda)/Wilks lambda. Hence from my
MANOVA output, Wilks lambda=0.48, and so T2= (9+10-2)*(1-0.48)/0.48= 18.42. This test
statistics follows a distribution T2=0.05, p=4, n1+n2-2=17=15.117. The test statistic is higher than the
table value, thus T2 is significant, resulting in a significant discriminant function. Thus the linear
discriminant function can be written as z=-0.06*y1-0.0097*y2-0.24*y3-0.071*y4
The standardized coefficients of the discriminant functions can be computed using the formula
a_standardized=( )*a. The standardized coefficients were computed in SAS,
and a_standardized= (-1.390, -0.083, -1.025, -0.032). By taking the absolute value, these
coefficients give us a good idea about the variable contribution in the model. Thus, the ranking
is as follows from the most important variable to the least important variable: y1 (length of
cycle), y3 (Cyclical amplitude), y2 (percentage of rising prices) and y4 (rate of change). These
results are comparable to the ones obtained in the individual ANOVAs, when the strongest
differences were in y1, and y3 (according to their p-values), while y2 and y4 did not exhibit any
significant differences.
6|Page
December 2014
The last step of the discriminant analysis is the stepwise procedure which will be conducted in
order to identify any redundancies in the data. The output from the stepwise procedure in SAS
can be found below. As expected, variable y1 (length of cycle) was entered first because it had
the highest F-value, followed by y3 (cyclical amplitude) because it has the second highest F-value.
After step 2, there were no more significant variables. This was expected, because y2 (percentage
of rising prices), and y4 (rate of change) were not found significant in the individual ANOVAs run
in the initial step of the analysis. Also, according to the standardized discriminant function, y2
and y4 have the lowest contribution. Thus the reduced model with only y1 and y3, is as good as
the full model because y2 and y4 appear to be redundant in the full model.
7|Page
December 2014
Y1: Intelligence, Y2: Form Relations, Y3: Dynamometer, Y4: Dotting, Y5: Sensory Motor
Coordination, Y6: Perseveration
Classification analysis is discriminant analysis taken one step forward. The purpose of this analysis
is to tell us where to place future subjects with various scores in intelligence, form relations,
dynamometer, dotting, sensory motor coordination, and perseveration. In our case, we have a
two group classification analysis, engineer apprentices and pilots. It is important to note that any
preliminary analysis on this data such as ANOVA, MANOVA, and tests for normality, was done
previously in the midterm exam. Similarly to the previous problem, the analysis will be carried as
follows:
a) First, the discriminant function will be identified along with its coefficients and test if it is
significant.
b) Second, the coefficients will be standardized in order to eliminate any unit issues, so that
we can analyze the contribution of each variable.
c) Test for the equality of covariance matrices
d) We classify each observation based on both the linear discriminant function and the
quadratic discriminant function, and estimate the error rates
e) We use the holdout method to see how it compares with the previous two.
8|Page
December 2014
The discriminant function was computed in SAS. Same as in the analysis done in problem 1, the
vector a= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049) contains the coefficients of the linear
discriminant function. The T2=66.7 was computed in the midterm exam, and it was significant;
hence the linear discriminant function is also significant.
The next step was to compute the standardized coefficient, to identify the contribution of each
variable in the overall model. The standardized coefficients were (0.174, 1.496, -1.391, -1.280,
1.131, -1.440). Taking their absolute value, the ranking from the most important to the least
important is as follows: y2 (form relations), y6 (perseveration), y3 (dynamometer), y4 (dotting),
y5 (sensory motor coordination), ad y1 (intelligence). These results are on par with the ones in
the midterm, when sensory motor coordination, and intelligence appeared to be redundant in
the full model. In the discriminant analysis, these two variables were last in the level of
importance.
An assumption that was not tested in the midterm is particularly important for this analysis: the
equality of the covariance matrices. This assumption was tested in a previous homework (see
problem 7.22), and the covariance matrices appeared to be equal.
With the covariance matrices being equal, the first classification will be done based on the linear
discriminant function a= (0.0075, 0.1933, -0.129, -0.043, 0.072,-0.049). The next page contains
the analysis which was done in Microsoft Excel. Based on the analysis, there were two
misclassifications in engineer apprentices, and two misclassifications in pilots. This would yield
an error rate of (2+2)/ (20+20) =0.1 (10%). The error will be compared to other error rates which
will be obtained further in the analysis in order to judge the ability to predict group membership.
9|Page
Engineer Apprentices
Pilots
y1
y2
y3
y4
y5
y6
Value
Decision
y1
y2
y3
y4
y5
y6
Value
Decision
121
22
74
223
54
254
-22.5221
pilots
132
17
77
232
50
249
-24.2204
pilots
108
30
80
175
40
300
-23.0577
pilots
123
32
79
192
64
315
-22.1684
pilots
122
49
87
266
41
223
-20.2287
engineer apprentices
129
31
96
250
55
319
-27.8371
pilots
77
37
66
178
80
209
-12.9194
engineer apprentices
131
23
67
291
48
310
-27.4374
pilots
140
35
71
175
38
261
-18.9185
engineer apprentices
110
24
96
239
42
268
-27.2938
pilots
108
37
57
241
59
245
-17.4942
engineer apprentices
47
22
87
231
40
217
-24.2903
pilots
124
39
52
194
72
242
-13.2495
engineer apprentices
125
32
87
227
30
324
-27.5684
pilots
130
34
89
200
85
242
-18.2732
engineer apprentices
129
29
102
234
58
300
-27.1664
pilots
149
55
91
198
50
277
-18.4763
engineer apprentices
130
26
104
256
58
270
-27.4665
pilots
129
38
72
162
47
268
-17.6913
engineer apprentices
147
47
82
240
30
322
-24.3164
pilots
154
37
87
170
60
244
-17.8653
engineer apprentices
159
37
80
227
58
317
-23.0874
pilots
145
33
88
208
51
228
-20.3226
engineer apprentices
135
41
83
216
39
306
-23.2369
pilots
112
40
60
232
29
279
-20.7101
engineer apprentices
100
35
83
183
57
242
-18.8145
engineer apprentices
120
39
73
159
39
233
-16.424
engineer apprentices
149
37
94
227
30
240
-23.2048
pilots
118
21
83
152
88
233
-17.3899
engineer apprentices
149
38
78
258
42
271
-22.9299
pilots
141
42
80
195
36
241
-18.739
engineer apprentices
153
27
89
283
66
291
-26.7734
pilots
135
49
73
152
42
249
-14.6486
engineer apprentices
136
31
83
257
31
311
-27.7362
pilots
151
37
76
223
74
268
-18.9062
engineer apprentices
97
36
100
252
30
225
-24.8977
pilots
97
46
83
164
31
243
-17.8151
engineer apprentices
141
37
105
250
27
243
-26.0327
pilots
109
42
82
188
57
267
-18.7053
engineer apprentices
164
32
76
187
30
264
-21.1992
engineer apprentices
124.5
38.1
76.2
192.75
53.65
250.3
129.3
31.7
87.4
236.6
44.25
280.2
The following classification will be done based on a quadratic classification function. Although
the sample covariance matrices did not yield any significant differences, for the purpose of this
analysis we will try to compare the error rates from the two models. The SAS results are below.
It appears that using a quadratic discriminant function will yield a similar error rate of 10%.
A third type of classification analysis will be the holdout method. Again, I copied the SAS results
on the following page. There were 4 misclassifications in engineer apprentices, and 2 in pilots for
a total error rate of 0.1750. As expected, the error rate increased compared to the previous two
methods giving a more realistic expectation of how the linear discriminant function can perform
for future data subjects.
12 | P a g e
December 2014
December 2014
The overall regression appears to be significant at =5%, as indicated by all four tests shown
below. However, the R2 values appear to be relatively low for the overall model. Only 25% of the
variability in y1 (relative weight) can be explained by x1 (glucose intolerance), x2 (insulin response
to oral glucose) and x3 (insulin resistance), and only 1.6% of the variability in y2 (fasting plasma
glucose) can be explained by x1, x2, and x3.
13 | P a g e
14 | P a g e
December 2014
December 2014
The relatively low values of R2 suggest that more explanatory variables may be needed in order
to improve the model. However, for the purpose of this analysis we will run a stepwise procedure
in order to identify redundancies. A backward elimination will be applied to find a subset of the
xs.
To find the subset of the xs we compute a conditional Wilks lambda by formula 10.72 in the
book. Thus, for example, Wilks lambda (X1|X2 X3) = (Wilks lambda (X1, X2, X3)/Wilks lambda
(X2 X3)). This would be the first value that would go under X1 in the table below. The rest of the
xs are computed similarly. No elimination could be done at step 1 because the highest Wilks
lambda (0.93) was significant at =5%. Thus the largest Wilks lambda is significant so the
backward elimination process would have to stop there.
Step #
1
X1
(Glucose X2 (Insulin Response to X3 (Insulin Resistance)
Intolerance)
Oral Glucose)
Wilks Lambda=0.93
Wilks Lambda=0.89
Wilks Lambda=0.76
It appears that none of the independent variables could be eliminated and all three xs are
needed in the full model. However, because of the small values of R2, they seem to explain only
a very small portion of the variability in ys. Thus, as I mentioned earlier, this model needs more
explanatory variables in order to increase its accuracy.
15 | P a g e
December 2014
12.8. Carry out a principal component analysis on all six variables of the glucose data of Table
3.8. Use both S and R. Which do you think is more appropriate here? Show the percent of
variance explained. Based on the average eigenvalue or a scree plot, decide how many
components to retain. Can you interpret the components of either S or R?
The purpose of principal component analysis is to eliminate variables and optimize the model.
For the data in Table 3.8 principal components were computed on both S and R.
First I will present the runs on the correlation matrix. The first four eigenvalues account for about
85% of the variance which is greater than 80%, so we can keep the first four.
16 | P a g e
December 2014
In the case of the covariance matrix, the first three eigenvalues account for 89% of the variability,
which is higher than 80%. Thus, we can keep the first three principal components. This makes
sense because the variances are significantly influenced by the larger variances of x1, x2, and x3.
17 | P a g e
December 2014
In this case, because of the disparate variances in S, choosing the principal components from R
will be more appropriate.
For the interpretation of the principal components in the case of R, we need a correlation
procedure between the chosen components and the variables. The runs were done in SAS and
they can be seen circled in the figure below. The correlations between the principal components
and the variables differ, and only the ones above 0.5 were deemed to be significant. For example,
after selecting the first four principal components, a significant correlation (over 0.5) can be
identified between the first principal component and variables y1, y3, x1, x2, and x3. Significant
correlations can also be identified between the second principal component and y2. X2 has a
significant correlation with the third component, while y1, and y3 are strongly correlated with
the fourth principal component.
18 | P a g e
December 2014
12.12 Carry out a principal component analysis on the engineer data of Table 5.6 as follows:
(a) Use the pooled covariance matrix.
(b) Ignore groups and use a covariance matrix based on all 40 observations.
(c) Which of the approaches in (a) or (b) appears to be more successful?
Here, we are running a principal component analysis using an unpooled covariance matrix, and a
pooled covariance matrix. The two matrices were computed in SAS and are shown in the figures
below. The first figure is the pooled covariance matrix and the second figure is the unpooled
covariance matrix.
19 | P a g e
December 2014
First we run the component analysis on the unpooled covariance matrix. The following results
were obtained in SAS. The first three components account for 87% of the variance, thus it will be
enough to keep them. Thus the first three components are a1= (0.212, -0.039, 0.08, 0.775, 0.956, 0.580), a2= (0.389, 0.064, -0.066, -0.608, 0.01, 0.686), and a3= (0.889, 0.096, 0.08, 0.08,
0.01, -0.434)
For the unpooled matrix, I could not use a procedure in SAS so I computed in IML. The output is
copied on the next page. The table under the figures gives the cumulative proportion of the
eigenvalues in the overall model. Similar to the analysis done for the pooled covariance matrix,
the first three eigenvalues account for about 85% of the total variance, thus the first three
eigenvectors (components) can be kept in the model.
Given that the two analyses are very similar, it appears that neither is more successful and that
the results are independent of the choice made: to use the pooled covariance matrix, or the
unpooled covariance matrix.
20 | P a g e
21 | P a g e
December 2014
Eigenvalue
Proportion
Cumulative Proportion
1,050.5963
858.3158
398.9035
259.1484
108.0892
43.3535
38.6%
31.6%
14.7%
9.5%
4.0%
1.6%
38.6%
70.2%
84.9%
94.4%
98.4%
100.0%