You are on page 1of 15

multivariate laboratory exercise iii

Student:

Asaad, Al-Ahmadgaid B. website:www.alstat.weebly.com email:alstated@gmail.com Instructor: Prof. Baguio, Carolina B. email: carolina.baguio@yahoo.com A. Obtain a Data with two and three response variable. 1. Data with three response variables Source: Rencher, A. C. (2002), Methods of Multivariate Analysis, 2nd Edition. pg. 56 Table A.1 Calcium in Soil and Turnip Greens Location Number 1 2 3 4 5 6 7 8 9 10 y1 y2 y3 2.8 2.7 4.38 3.21 2.73 2.8 2.88 2.9 3.28 3.2

35 3.5 35 4.9 40 30 10 2.8 6 2.7 20 2.8 35 4.6 35 10.9 35 8 30 1.6

Table A.1 gives partial data from Kramer and Jensen (1969). Three variables were measured (in milliequivalents per 100 g) at 10 different locations in the South. The variables are y1 = available soil calcium, y2 = exchangeable soil calcium, and y3 = turnip green calcium. Test the normality of the data using R software. Solution:
Res3Data <- read.table(header = TRUE, text = " y1 y2 y3

multivariate laboratory exercise iii

35 3.5 35 4.9 10 2.8 6 2.7 20 2.8 35 4.6 35 8.0 30 1.6 Res3Data.Mat <- t(Res3Data) library(mvnormtest) mshapiro.test(Res3Data.Mat) Shapiro-Wilk normality test data: Z

2.80 2.70 3.21 2.73 2.81 2.88 3.28 3.20")

40 30.0 4.38

35 10.9 2.90

W = 0.5855, p-value = 3.718e-05

The p-value of 3.718e-05 is less than the level of signicance, 0.05. And thus the null hypothesis is rejected, and we conclude that the data is not normally distributed. What about using quantile-quantile plot, will it coincide with the test used? Lets check it, Using the usual quantile-quantile plot in R, the following goes,
Res3Data.Mat <- as.matrix(Res3Data) qqnorm(Res3Data.Mat) qqline(Res3Data.Mat)

Take a look at Figure 1 for the output of the above codes, notice that it does not follow a normal distribution. And thus it coincides with the Shapiro-Wilk test done above. But this is expected to happen since the data has three variables, in which all of them have different measurements. Hence, the extreme values of y1 variable has a

multivariate laboratory exercise iii

great effect on the other observations of the variables y2 and y3 , and thus this contributes to the formed outliers on the plot. Now to avoid this, it is better to test
Figure 1: Normal Quantile-Quantile Plot of the Table A.1.

the normality and plot the quantile-quantile plot of each variables. So that, the units of the observations is homogeneous. Testing the Normality of each variable a. y1 =available soil calcium
library(mvnormtest) attach(Res3Data) shapiro.test(y1) Shapiro-Wilk normality test data: y1

W = 0.7874, p-value = 0.0102

The p-value of variable y1 is 0.0102 which is less than the level of signicance 0.05, and thus it is not nor-

Figure 2: Normal Probability Plot of variable y1 . The green line is the 95% condence interval of the data, and the purple line is the normal line. For the codes of the plot refer to the appendix.

multivariate laboratory exercise iii

mally distributed. Check out the quantile-quantile plot, Figure 2. Observe that in the normal probability plot of variable y1 , theres a single point that is not inside of a 95% condence interval, and thus it is not normally distributed. b. y2 =exchangeable soil calcium
library(mvnormtest) attach(Res3Data) shapiro.test(y2) Shapiro-Wilk normality test data: y2

W = 0.6405, p-value = 0.0001687

The observe p-value of variable y2 is also less than 0.05, and thus it is not normally distributed. Refer also to the quantile-quantile plot of variable y2 , Figure 3. In the plot, the data is not normally distributed, because there is an outliers which lie outside the 95% condence interval. And thus coincide with the performed test of y2 variable. c. y3 = turnip green calcium
library(mvnormtest) attach(Res3Data) shapiro.test(y3) Shapiro-Wilk normality test data: y3

Figure 3: The normal probability plot of variable y2 . The green line is the 95% condence interval of the data, and the purple line is the normal line. For the codes of the plot refer to the appendix.

W = 0.7294, p-value = 0.002001

Again the third variable also follows, that the observations on it is not normally distributed, since again the 0.002001 is less than the level of signicance 0.05. The quantile-quantile plot of the variable y3 is not nor-

Figure 4: The normal probability plot of variable y3 . The green line is the 95% condence interval of the data, and the purple line is the normal line. For the codes of the plot refer to the appendix.

multivariate laboratory exercise iii

mally distributed, since another outliers that lie outside the 95% condence interval. And hence summing up the decisions of the three variables tested, the decision of the Shapiro-Wilk test which was rst applied for the data combining the three variables is true, that the observations in the data is not normally distributed. Since the data is not normally distributed then it is difcult to estimate the appropriate probability density function of the data due to the small sample size n. 2. Data with two response variables Source: Hardle, W., et al. (2007), Multivariate Statistics: Exercises and Solutions. pg. 336 Table A.2 Sales Data Sales 1 2 3 4 5 6 7 8 9 10 230 181 165 150 97 192 181 189 172 170 Price 125 99 97 115 120 100 80 90 95 125 Advert 200 55 105 85 0 150 85 120 110 130 Ass. Hours 109 107 98 71 82 103 111 93 86 78

This is a data set consisting of 10 measurements of 4 variables. The story: A textile shop manager is studying the sales of "classic blue" pullovers over 10 periods. He uses three different marketing methods and hopes to understand his sales as a t of these variables using statistics. The variables measured are X1 : Numbers of sold pullovers, X2 : Price (in EUR), X3 : Advertisement costs in local newspapers (in EUR), X4 : Presence of a sales assistant (in hours per period).

multivariate laboratory exercise iii

Test the normality of the data using R software. Solution:


Res2Data <- read.table(header = TRUE, text = " Sales Price 230 181 165 150 97 192 181 189 172 170 Res2Data.Mat <- t(Res2Data) library(mvnormtest) mshapiro.test(Res2Data.Mat) Shapiro-Wilk normality test data: Z 125 99 97 115 120 100 80 90 95 125")

W = 0.8834, p-value = 0.1429

Since the p-value is greater than the level of signicance 0.05. Then the null hypothesis is not rejected that the data is normal. Now we will plot the quantile-quantile plot of the data.
Res2Data.Mat <- as.matrix(Res2Data) qqnorm(Res2Data.Mat) qqline(Res2Data.Mat)

Well, most of the points are along the line (refer to Figure 5), but there are still outliers. Moreover, the points should be normally distributed to coincide with the test

multivariate laboratory exercise iii

Figure 5: Normal Quantile-Quantile Plot of the Table A.2.

used. Now just as before, it is better to test the normality of each variables, to make sure the homogeneity of the measurements. Testing the Normality of each variable a. Sales - products sold
library(mvnormtest) attach(Res2Data) shapiro.test(Sales) Shapiro-Wilk normality test data: Sales

W = 0.9067, p-value = 0.2591

The p-value generated is greater than the level of signicance 0.05, and thus the observations on variable Sales is normally distributed. Furthermore, the quantile quantile plot of it at Figure 6 is also normally distributed since all of the points are uctuated within the 95% condence interval.

Figure 6: The normal probability plot of variable Sales. The green line is the 95% condence interval of the data, and the purple line is the normal line. For the codes of the plot refer to the appendix.

multivariate laboratory exercise iii

b. Price - Price of the products sold


library(mvnormtest) attach(Res2Data) shapiro.test(Price) Shapiro-Wilk normality test data: Price

W = 0.9187, p-value = 0.346

For this variable the p-value of it is also greater than the level of signicance 0.05 which means that the assumption in the null H0 is true, that the data is normally distributed. And as seen on Figure 7, the points are within the 95% condence interval, implying that the observations in variable Price of Sales data is normally distributed. Thus, summing up the conclusions of the two variables above (Sales, and Price). The Sales data is normally distributed which coincides with the performed test of the Shapiro-Wilk test for multivariate data, in which the two variables were combined and tested. If Normal, what is the probability density function of the data? Let the variable Sales be S, and Price be P. If the data set Res2Data is X, then

Figure 7: The normal probability plot of variable Price. The green line is the 95% condence interval of the data, and the purple line is the normal line. For the codes of the plot refer to the appendix.

f (X) =

1 2 (||) 2
2
1

1 2 [(SS )( P P )]

2 S SP 2 PS P

(S S ) (P P )

(1)

Note that the is just equal to

2 S SP 2 PS P

And these are the values of each matrix above,


2 S SP 2 PS P

a.

1152.46 88.91 88.91 244.27

multivariate laboratory exercise iii

b.

2 S SP 2 PS P
1

=
2 S

0.00089 0.00032 0.00032 0.0042


1 2

c. (||) 2 =

PS

SP 2 P

= 523.0691

d. [(S S )( P P )] = The output of the data generates a large matrix which is not easy to input it here. However, the following codes will generate it.
attach(Res2Data) FVar <- (Sales-172.7) SVar <- (Price-104.6) MatrixA <- cbind(c(FVar,SVar))

e.

(S S ) = The same case in (d.) that the output (P P ) of the data generates a large matrix which is not easy again to input it here. Moreover, the following codes will generate it.
attach(Res2Data) FVar <- (Sales-172.7) SVar <- (Price-104.6) MatrixA <- cbind(c(FVar,SVar)) MatrixB <- t(MatrixA)

Plot the probability density function. Two types of plot will be generated from the density function, the contour plot and three-dimensional plot. To see the changes on the smoothing of the density, three bandwidth will be used, (5, 10, 15). a. Bandwidth = (5,5)
library(KernSmooth) est <- bkde2D(Res2Data, bandwidth=c(5,5)) persp(est$fhat, xlab = "X", ylab ="y", theta=45,phi=45,col="lightblue",shade = 0.1) persp(est$fhat, xlab = "X", ylab = "y", theta=45,phi=90,col="lightblue",shade = 0.1)

multivariate laboratory exercise iii

10

Figure 8: 2D Binned Kernel Density Estimate, with bandwidth of (5,5).

contour(est$x1,est$x2,est$fhat,col = "blue")

Figure 9: Contour plot of 2D Binned Kernel Density Estimate, with bandwidth of (5,5).

b. Bandwidth = (10,10)
library(KernSmooth) est <- bkde2D(Res2Data, bandwidth=c(10,10)) persp(est$fhat, xlab = "X", ylab ="y", theta=45,phi=45,col="red",shade = 0.1) persp(est$fhat, xlab = "X", ylab = "y",

multivariate laboratory exercise iii

11

theta=45,phi=90,col="red",shade = 0.1)

Figure 10: 2D Binned Kernel Density Estimate, with bandwidth of (10,10).

contour(est$x1,est$x2,est$fhat,col = "red")

Figure 11: Contour plot of 2D Binned Kernel Density Estimate, with bandwidth of (10,10). This is a contour plot of Figure 10.

multivariate laboratory exercise iii

12

c. Bandwidth = (15,15)
library(KernSmooth) est <- bkde2D(Res2Data, bandwidth=c(15,15)) persp(est$fhat, xlab = "X", ylab ="y", theta=45,phi=45,col="green",shade = 0.1) persp(est$fhat, xlab = "X", ylab = "y", theta=45,phi=90,col="green",shade = 0.1)

Figure 12: 2D Binned Kernel Density Estimate, with bandwidth of (15,15).

contour(est$x1,est$x2,est$fhat,col = "green")

Figure 13: Contour plot of 2D Binned Kernel Density Estimate, with bandwidth of (15,15).

multivariate laboratory exercise iii

13

It is observed that the three-dimensional plot of the data is not very smooth in the rst plot with bandwidth of (5,5), but with the following plot of bandwidth (10,10) it became a little smooth. And the third bandwidth makes it more smoother than the two, but still there are two circles seen on the plot that makes it not perfectly smooth. However, a further increase in the bandwidth, the plot will form a smooth normal plot. Like using the bandwidth (45,45) below, it forms a smoothness over the mesh induced by the grid points.
est <- bkde2D(Res2Data, bandwidth=c(45,45)) persp(est$fhat, xlab = "X", ylab = "y", theta=45,phi=45,col="yellow",shade=0.4) persp(est$fhat, xlab = "X", ylab = "y", theta=45,phi=90,col="yellow",shade=0.4)

Figure 14: Contour plot of 2D Binned Kernel Density Estimate, with bandwidth of (45,45).

Notice that the plot is now smooth. Moreover, refering also to Figure 15 the two circles is not contained already, it forms now a single ellipse in the plot.

Figure 15: Contour plot of 2D Binned Kernel Density Estimate, with bandwidth of (45,45). Codes of the plot,
contour(est$x1,est$x2,est$fhat,col = "yellow2")

multivariate laboratory exercise iii

14

Appendix A. R Codes for gures 2, 3, 4, 6, and 7. By using "Variable" as the place value, the codes can be modied for different variables of the data set. Now, since there are two data sets (three and two response variables) the place value for the data sets can be "DataSet". And thus, when using the two response data simply replace the "DataSet" with "Res2Data". For using the Sales variable of the Sales data, simply replace the "Variable" with Sales and with the place value of data set replaced with "Res2Data".
library(ggplot2) attach(DataSet) df<-data.frame(x=sort(Variable), y=qnorm(ppoints(length(Variable)))) probs <- c(0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99) qprobs<-qnorm(probs) xl <- quantile(Variable, c(0.25, 0.75)) yl <qnorm(c(0.25, 0.75)) slope <- diff(yl)/diff(xl) int <- yl[1] - slope * xl[1] library(MASS) #Maximum-likelihood Fitting of Univariate #Dist from MASS fd<-fitdistr(Variable, "normal") #estimated perc. for the fitted normal xp_hat<-fd$estimate[1]+qprobs*fd$estimate[2] #var. of estimated perc v_xp_hat<- fd$sd[1]^2 + qprobs^2*fd$sd[2]^2

multivariate laboratory exercise iii

15

+ 2*qprobs*fd$vcov[1,2] #lower bound xpl<-xp_hat + qnorm(0.025)*sqrt(v_xp_hat) #upper bound xpu<-xp_hat + qnorm(0.975)*sqrt(v_xp_hat) df.bound<-data.frame(xp=xp_hat,xpl=xpl, xpu = xpu,nquant=qprobs) ggplot(data = df, aes(x = x, y = y)) + geom_point(colour = "darkred", size = 3) + geom_abline(intercept = int,slope = slope, colour = "purple", size = 2, alpha = 0.5) + scale_y_continuous(limits=range(qprobs), breaks=qprobs, labels = 100*probs) + labs(y ="Percent" , x="Data") + geom_line(data=df.bound,aes(x = xpl, y = qprobs), colour = "darkgreen", alpha = 0.5, size = 1) + geom_line(data=df.bound,aes(x = xpu, y = qprobs), colour = "darkgreen", alpha = 0.5, size = 1) + xlab(expression(bold("Variable"))) + ylab(expression(bold("Normal % Probability"))) + theme_bw() + opts(title = expression(bold ("Normal Probabiliy Plot of Variable")), plot.title = theme_text(size = 20, colour = "darkblue"), panel.border = theme_rect(size = 2, colour = "red"))

You might also like