You are on page 1of 10

Problem 3.

16

b)
> library(MASS)

> datv = scan() # copy the given data

1: 0.07 9.0

3: 0.09 9.0

5: 0.08 9.0

7: 0.16 7.0

9: 0.17 7.0

11: 0.21 7.0

13: 0.49 5.0

15: 0.58 5.0

17: 0.53 5.0

19: 1.22 3.0

21: 1.15 3.0

23: 1.07 3.0

25: 2.84 1.0

27: 2.57 1.0

29: 3.10 1.0

31:

Read 30 items

> dat = matrix(datv,15,2,byrow=T)

> y = dat[,1]

> x = dat[,2]

>

> fit1 = lm(y~x)

> box.c = boxcox(fit1, lambda=seq(-2,2,0.5),plotit=T) # possible lambda values

> box.opt = box.c$x[which.max(box.c$y)] # choose the x value which has the maximum y
value

> ci = range(box.c$x[box.c$y>max(box.c$y)-qchisq(0.95,1)/2]) # get the lower and upper


interval

> c(best=box.opt,"lower"=ci[1],"upper"=ci[2])[c(2,1,3)] # report the lower, best and


upper values

lower best upper

-0.10101010 0.02020202 0.10101010


Figure 1. Box-Cox plot for the estimation of λ.

For the study of the concentration of a solution (Y) over time (X), the transformation that leads to the
smallest value of SSE corresponds to 𝜆̂ = 0.02. However, the upper and lower limits of 𝜆̂ can also be
considered and the values are -0.1 and 0.1, respectively.

Based on this interval and the ladder of power transformation table (which contains the most common
powers to be raised in the Box-Cox transformation), the logarithmic transformation was chosen (λ = 0 by
definition).

Table 1. Cox-Box ladder of power transformation.

λ 𝒚′ = 𝒚𝝀 Name
2 𝑦2 Square
1 𝑦 Original (no transformation)
1/2 √𝑦 Square root
0 log(𝑦) Logarithmic
-1/2 1/√𝑦 Reciprocal square root
-1 1/𝑦 Reciprocal
c)
### Model logarithmic fit ###

> slogfit1 = lm(log10(y)~x) # logarithmic transformation

> summary(slogfit1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.654880 0.026181 25.01 2.22e-12 ***

x -0.195400 0.004557 -42.88 2.19e-15 ***

Residual standard error: 0.04992 on 13 degrees of freedom

Multiple R-squared: 0.993, Adjusted R-squared: 0.9924

F-statistic: 1838 on 1 and 13 DF, p-value: 2.188e-15

𝛽̂0 = 0.6549 , 𝑓𝑟𝑜𝑚 𝑅 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒; 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡


̂𝛽1 = −0.1954 , 𝑓𝑟𝑜𝑚 𝑅 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒; 𝑥

𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛: log(𝑦𝑖 ) = 0.6549 − 0.1954𝑥𝑖

### ANOVA ###

> slogfit0 = lm(log10(y)~1) # full ANOVA trick

> rbind(anova(slogfit1), anova(slogfit0))

Analysis of Variance Table

Response: log10(y)

Df Sum Sq Mean Sq F value Pr(>F)

x 1 4.5818 4.5818 1838.2 2.188e-15 ***

Residuals 13 0.0324 0.0025

Residuals1 14 4.6142 0.3296

Table 2. ANOVA table layout.

Source df SS MS F
Regression 1 SSR MSR F*
Error n-2 SSE MSE -
Total n-1 SST - -
Table 3. ANOVA table for the problem.

Source df SS MS F
Regression 1 4.582 4.582 1838.2
Error 13 0.032 0.0025 -
Total 14 4.614 - -

Hypotheses:

𝑇𝑒𝑠𝑡 𝐻0 : 𝛽1 = 0 𝑣𝑒𝑟𝑠𝑢𝑠 𝐻𝑎 : 𝛽1 ≠ 0 𝑎𝑡 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝛼 = 0.05

Test statistic value:


𝑀𝑆𝑅 4.582
𝐹∗ = = ≈ 1838.2
𝑀𝑆𝐸 0.0025
Statistical decision:
𝑀𝑆𝑅
𝐴𝑁𝑂𝑉𝐴 𝐹 − 𝑡𝑒𝑠𝑡 𝐹 ∗ = ~ 𝐹(1 − 𝛼, 1, 𝑛 − 2)
𝑀𝑆𝐸
4.582
𝐹∗ = ≈ 1838.2
0.0025
Assume ɑ = 0.05 and n = 15,

𝐹(1 − 𝛼, 1, 𝑛 − 2) = 𝐹(0.95,1,13) = 4.667


𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑠𝑖𝑛𝑐𝑒 |𝐹 ∗ | = 1838.2 > 𝐹(0.95,1,13) = 4.667
Practical conclusion:

Since 𝐻0 is rejected in favor of 𝐻𝑎 (conclude 𝛽1 ≠ 0) at the significance level of 𝛼 = 0.05, there is


evidence to claim a linear association exists in a population between the logarithm of the solution
concentration and time.

e)
> pred = predict(slogfit1) # prediction

> student = rstandard(slogfit1) # studentized residuals

> plot(pred,student,pch=16); abline(h=0)


Figure 2. Studentized residuals versus predicted values.
qqnorm(student, pch=16); abline(a=0,b=1) # qqplot

Figure 3. Normal qqplot for the SLR model between the logarithm of the solution concentration and time.
Based on Figure 2., it can be seen ups and downs (curvature) of the points but it is hard to assume that
the trend exists. It might be possible to say that there is slight variability of the variances but it can only
be concluded if the Breusch-Pagan test stastistic is performed. Likewise, it is hard to conclude there is any
negative autocorrelation of the error terms. Some increases and decreases can be seen but it may not be
strong enough to conclude that the nonindependency exists in the population. A Durbin-Watson statistics
test would be required to get further conclusions. About the outliers, it seems that there is no value that
deviates a lot from the studentized residual = 0 line. Only two points have a studentized residual value
bigger than 1.5 which are not enough to be considered as outliers. A test for outlier using the Bonferroni
correction would be needed to get more accurate conclusions.

According to Figure 3., although the judgment is somewhat subjective, it is safe to assume that the data
matches with the straight line. The most deviated points are the three points close to -1 but still the
discrepancies are not too big. All the other points almost fit the straight line. Therefore, no concerns are
raised about the normality assumption and it is safe to conclude that all model errors are normally
distributed.
Problem 3.1

a)
> ###Data###

> dat = read.table("C:/soccer.txt",header=T)

> x=dat$x; y = dat$y;

>

> ### Model fit###

> fit1 = lm(y~x)

>

> ### Residual plots ###

> student = rstandard(fit1)

> fitted = predict(fit1) # since no set of predictors is required, the predict


function works as the fitted function

> plot(fitted,student,pch=16); abline(h=0)

Figure 4. Studentized residuals versus fitted values of the soccer data.


Based on Figure 4., there are two points with student residual values close to 2 and one near -2. These
can be considered outliers which deviate from the model. The consideration of these points in the model
may give a misleading fit since the sum of squares is minimized. The Bonferroni correction method can be
used to identify the outliers with more accuracy. If any direct evidence exists, it might be considered to
remove these outliers from the set for further analyses. However, the existence of outliers might be due
to the interaction of the current regressor with other predictor variables not analyzed in the problem. The
other independent variables of this problem may be the age, number of assists, accuracy of their passes,
shots on target, etc.

According to the shape of the plot, it seems there is no reason to be concerned about the equal variance
assumption. No pattern can be observed since the points are randomly spread out. Likewise, since there
is no particular trend, neither negative autocorrelation nor positive autocorrelation is verified for the
population. As a result, it is reasonable to declare that the population correlation of errors is independent.

b)
qqnorm(student,pch=16); abline(a=0,b=1)

Figure 5. Normality plot of the studentized residuals for the soccer data.

> shapiro.test(student)

Shapiro-Wilk normality test

data: student

W = 0.91898, p-value = 0.1859


The revised normality plot of the 15 studentized residuals shows that the studentized residuals lie
considerably close to the theoretical normal quantile line, except for the two studentized residuals near
2. The values are not too deviated from the straight line but there could be a slight chance of affecting the
normality assumption of the model errors.

Test for normality

Hypotheses:

𝐻0 : 𝑎𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟𝑠 {𝑒𝑖 } 𝑎𝑟𝑒 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑


𝐻𝑎 : 𝑛𝑜𝑡 𝑎𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 𝑒𝑟𝑟𝑜𝑟𝑠 {𝑒𝑖 } 𝑎𝑟𝑒 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑
Test statistic value:

From R code

𝑊 ∗ = 0.9190
𝑃 − 𝑣𝑎𝑙𝑢𝑒 = 0.1859
Statistical decision:

Since the Shapiro-Wilk test statistic is 𝑊 ∗ = 0.9190 with a 𝑃 − 𝑣𝑎𝑙𝑢𝑒 of 0.1859 that is greater than 0.05
(assumed), fail to reject 𝐻0 at the 0.05 level.

Practical conclusion:

Since it fails to reject 𝐻0 in favor of 𝐻𝑎 at the significance level of 𝛼 = 0.05, it can be declared that all
model errors {𝑒𝑖 } are normally distributed.

c)
> library(lmtest)

> chi2crit = qchisq(0.95,1)

> chi2crit = qchisq(0.95,1)

> bptest(fit1,studentize=T) # modified BP test

studentized Breusch-Pagan test

data: fit1

BP = 1.5926, df = 1, p-value = 0.207


Test for constancy of variance

Statistical hypotheses:

𝐻0 : 𝜎𝑖2 = 𝜎 2 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖 𝑣𝑒𝑟𝑠𝑢𝑠 𝐻𝑎 : 𝜎𝑖2 ≠ 𝜎 2 𝑓𝑜𝑟 𝑠𝑜𝑚𝑒 𝑖


𝑆𝑆𝑇 ∗
2
𝑥𝐵𝑃𝑚 = 𝑆𝑆𝑅∗ ÷ ( ) ⩪ χ2 (𝑝 − 1)
n
From R code
2 ∗
𝑥𝐵𝑃𝑚 = 1.5926
𝑃 − 𝑣𝑎𝑙𝑢𝑒 = 0.207
2 ∗
Since 𝑥𝐵𝑃𝑚 = 1.8373 < 3.841 = 𝑥 2 (0.95,1), fail to reject 𝐻0 at the 0.05 level.

Statistical conclusion:

Since it fails to reject 𝐻0 in favor of 𝐻𝑎 at the significance level of 𝛼 = 0.05, the homogeneity of variance
assumption for the model errors does hold.

The residuals showed in Figure 4. are randomly spread out without showing any specific tendency (no
linear or polynomial trend). Therefore, it is a good indication that the error variance is constant (it confirms
the Breusch-Pagan test statistic). However, it can be noticed that there are fewer points above the error
mean = 0 line than below it. Nevertheless, the vertical width seems to be the same above and below the
mean = 0 line and does not seem to increase or decrease. It is possible that more data would be required
to get a better conclusion through the plot.

You might also like