You are on page 1of 6

AMA Individual

Assignment

Utsav Vadgama
2014IPM106 Section B
1. Use regression data. This dataset talks about the sales of different cereals and also explains the
amount of calories, protein, fats, etc. in each cereal. Further, it provides the insights on where
these cereals are located (variable-shelf) and the advertising amount spent on them. We also
know the weight and the cups available. (Total points=40)

a. Estimate and interpret a regression model with sales as DV and shelf, calories, protein, fat,
sodium, fiber, carbo, sugars, potass, vitamins, weight, cups, and adv as IVs (consider 0.05
significance level). Report the significance and the performance of the model.

After checking the normality of dependent variable sales, the following results were obtained:
Shapiro-Wilk normality test

data: reg_data$sales
W = 0.94507, p-value = 0.002304
Thus, normality assumption stays due to significance of P-value and high S-W stat value.

The regression model for the above specification was done and the results obtained are as follows:
Call:
lm(formula = sales ~ shelf + calories + protein + fat + sodium
+
fiber + carbo + sugars + potass + vitamins + weight + cups +
adv, data = reg_data)

Residuals:
Min 1Q Median 3Q Max
-443561 -165896 -10392 116442 409273

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 182906.0 269565.0 0.679 0.50009
shelfMiddle 117127.0 84512.0 1.386 0.17099
shelfTop 71963.7 81271.4 0.885 0.37950
calories -12187.0 5798.8 -2.102 0.03986 *
protein 101834.2 41679.7 2.443 0.01757 *
fat 149428.6 60355.6 2.476 0.01618 *
sodium 575.0 383.2 1.501 0.13876
fiber 63904.1 35734.5 1.788 0.07886 .
carbo 84887.3 27295.3 3.110 0.00288 **
sugars 84335.0 25929.1 3.253 0.00189 **
potass -1305.8 1285.2 -1.016 0.31378
vitamins -1059.6 1483.5 -0.714 0.47786
weight -804621.2 420162.9 -1.915 0.06034 .
cups -152483.3 149148.5 -1.022 0.31078
adv 805.4 701.5 1.148 0.25561
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 225000 on 59 degrees of freedom


(3 observations deleted due to missingness)
Multiple R-squared: 0.2409, Adjusted R-squared: 0.06082
F-statistic: 1.338 on 14 and 59 DF, p-value: 0.2139

The model exhibits very low R^2 of 0.2409 and even lower adjusted R^2 of 0.06082 at 21.39% of
significance with F stat of 1.388 which is very low. Thus, the model doesn’t establish the relationship
among sales and other independent variables.
b. How is “fat” related to sales? (consider 0.1 significance level).

Fat is related to sales in a positive manner such that if model would have been significant then 1 unit
of increase in fat would affect sales by 149428.6 units.

c. Is there any difference in sales if a product is kept in different shelf?

There is no difference in the level if a product is kept in different shelves. This is evident by the
results obtained from regression as there is no significant relationship among sales and different
levels of shelf.

Further, this can be seen by Anova on the mentioned variables. Following results were obtained
after Anova:
Df Sum Sq Mean Sq F value Pr(>F)
shelf 2 1.149e+11 5.746e+10 1.037 0.359
Residuals 74 4.099e+12 5.539e+10

d. Please provide the R code for (a), (b), and (c).

d <- file.choose() #To choose the file path

reg_data<-read.csv(d)

str(reg_data) #To check if all the variables are specified correctly or not

ggqqplot(reg_data$sales) # To check the normality assumption

shapiro.test(reg_data$sales) # To check the normality assumption

reg_mod<-lm(sales~shelf + calories + protein + fat + sodium + fiber + carbo + sugars + potass +


vitamins + weight + cups + adv, data=reg_data) #Initial model

summary(reg_mod)

library(ggpubr)

res.aov <- aov(sales ~ shelf, data = reg_data) #Anova for sales and shelf level

summary(res.aov) # Summary of the analysis

2. Use logit data.

a. Estimate and interpret a logit model (Model a) where dv=coke_selection and rest are IVs.

The following results were obtained after performing logit regression:


Call:
glm(formula = coke.selection ~ ., family = binomial(link = "logit"),
data = log_data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.9664 -0.4746 -0.3500 0.6083 2.4225

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7125 0.6021 1.183 0.236625
gender1 2.9010 0.2204 13.160 < 2e-16 ***
occupation1 0.8143 0.2203 3.696 0.000219 ***
country_of.origin1 -2.6236 0.3501 -7.493 6.71e-14 ***
price -0.1598 0.3846 -0.416 0.677771
distribution 0.4770 0.3714 1.284 0.199036
adv_ratio -0.3347 0.3670 -0.912 0.361731
satisfaction_avg -0.6086 0.3752 -1.622 0.104775
competition 0.2463 0.3724 0.661 0.508386
storevisit_perweek -0.4684 0.3707 -1.264 0.206378
health.conciousness -0.3605 0.3782 -0.953 0.340464
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1025.81 on 749 degrees of freedom


Residual deviance: 570.67 on 739 degrees of freedom
AIC: 592.67

Number of Fisher Scoring iterations: 5

In the model, gender, occupation and country of origin are significant independent variables. The
model has value of AIC: 592.67. Further, the confusion matrix exhibited mentioned prediction
accuracy.
> table(log_mod ,log_data$coke.selection)

log_mod 0 1
No 352 46
Yes 74 278
> (352+278)/750
[1] 0.84

b. Estimate a logit model (Model b) with dv=coke_selection and IVs=gender, occupation,


country_of_origin, price, distribution, and adv_ratio. Compare the performance of the (Model a)
and (Model b).

Model B:
Call:
glm(formula = coke.selection ~ gender + occupation + country_of.origin +
price + distribution + adv_ratio, family = binomial(link = "logit"),
data = log_data)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.8476 -0.4558 -0.3696 0.6734 2.4219

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.1272 0.4658 0.273 0.784810
gender1 2.8555 0.2169 13.165 < 2e-16 ***
occupation1 0.8065 0.2163 3.728 0.000193 ***
country_of.origin1 -2.6284 0.3490 -7.531 5.05e-14 ***
price -0.1793 0.3803 -0.472 0.637211
distribution 0.5174 0.3707 1.396 0.162750
adv_ratio -0.2997 0.3634 -0.825 0.409519
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1025.81 on 749 degrees of freedom


Residual deviance: 576.32 on 743 degrees of freedom
AIC: 590.32

Number of Fisher Scoring iterations: 5

In the model, gender, occupation and country of origin are significant independent variables. The
model has value of AIC: 590.32. Further, the confusion matrix exhibited mentioned prediction
accuracy.
> table(log_mod2 ,log_data$coke.selection)

log_mod2 0 1
No 348 40
Yes 78 284
> (348+284)/2
[1] 316
> (348+284)/750
[1] 0.8426667

The models are slightly different in terms of AIC and model 2 seems to be better in terms of AIC
because it’s AIC is slightly lower than Model A’s AIC. Also, accuracy of Model B is slightly higher than
the accuracy of Model A. We can also perform anova to compare two models.

c. Please provide the R code.

p <- file.choose()

log_data<-read.csv(p, header = T)

log_data$gender <- as.factor(log_data$gender) ## To convert gender into factor variable

log_data$occupation <- as.factor(log_data$occupation)

log_data$country_of.origin <- as.factor(log_data$country_of.origin)

log_data$coke.selection <- as.factor(log_data$coke.selection)

log_mod <- glm(coke.selection∼., data= log_data ,family = binomial(link = "logit"))

log_mod2 <-glm(coke.selection~ gender + occupation + country_of.origin + price + distribution +


adv_ratio , data= log_data ,family = binomial(link = "logit"))

mod.probs <- predict(log_mod,type="response")

log_mod<-rep("No",750)

log_mod[mod.probs >.5]="Yes" ## Assuming more than 50% probability means the person will buy
coke
table(log_mod ,log_data$coke.selection) ## To generate confusion matrix

mod.probs2 <- predict(log_mod2,type="response") ## To check for classification efficiency

log_mod2<-rep("No",750)

log_mod2[mod.probs2 >.5]="Yes" ## Assuming more than 50% probability means the person will
buy coke

table(log_mod2 ,log_data$coke.selection) ## To generate confusion matrix

anova(log_mod, log_mod2, test="Chisq") ## To compare two models

You might also like