Statistics - Assignment

CBA Batch 8/ TERM- II/ AY 2017-18
Statistical Analysis 2: Assignment 1

Ravinderpal S Wasu (71710004)
1. (a)
Reasons of disagreement with Jacks comments :
 Though the R-squared value is low, we can still get some explanation from the other
variable
 The coefficients of other predictor variables could also have significant statistically
implications
 Other response variable could also indicate the trend and provide good prediction
 Some precision in the prediction could be affected
(b) Situations as give below could show high R-squared.

 Weather forecasting data – temperature rising
 Speed of vehicles in the city – average speed decreasing due to increase in traffic
 Now plotting daily change in temperature rise and decrease in average speed might show
correlation and give a good to linear regression line but these could still not be having any
cause / effect formula
(c) Various techniques are:
Residual vs Fitted plot

To view if data is normal or not
Residuals plot – Root values
Residuals plot – Standardized values

2. (a)
Name SAT Age Tenure MBA (yes = GRI
1, No = 0)
Bob 1042 35 5 0 1
Putney 1355 32 2 1 1
Summary from linear model regression

For 95 % CI  SAT and GRI are significant
Name RET
Bob -2.642+(0.005735*1042)+(-2.110*1) = 1.2287
Putney -2.642+(0.005735*1355)+(-2.110*1) = 3.0189
For 80 % CI  GRI, SAT, AGE and TENURE are significant

Name RET
Bob -2.642+(0.005735*1042)+(-0.0688*35)+(-0.1187*5) +(-2.110*1) = -1.78181
Putney -2.642+(0.005735*1355)+(-0.0688*32)+(-0.1187*2) +(-2.110*1) = 6.5647
(b) Putney is expected to obtain higher returns compared to Bob as the Return values for
Putney are definitely higher compared to Bob.
3. (a) As seen above, at 95% CI ( 5% significance level), we see that SAT and GRI show significant
effect on the RET.
If Bob would have attended Princeton, probably his SAT score could have been 1355.
His RET would be better, at 3.0189 compared to current RET of 1.2287
(b) At 10 % significance level, GRI and SAT show significant effect on the RET.
While managing Growth fund instead of Growth and Income, Bob’s return would be:
RET BOB = -2.642+(0.005735*1042)+(-2.110*0) =3.3387
Hence, it is 3.3387 – 1.2287 (at 95%) = 2.11 higher
4. (a)
The coefficient of MBA = 18.1 % and has negative impact
Compared to other coefficients this is high %, but on its own - not very high.
It does indicates that managers having an MBA degree would perform less than those who do
not have. Other factors like Age, Tenure, SAT score are constant.
(b) No, non-MBA managers are not taking more risks comparatively, to get higher returns. The
coefficient would not be negative in the table.
5. (a)
As per the regression, the p value coefficient for Age = 10.005 %
So the lowest level of significance is 10.006%
(b)
As per the regression, Age has a negative effect. Hence probability of younger managers
delivering better return is high.
Surely, the survivor bias would influence / dampen the affect seen in 5 (a)
6. (a) As MBA and Tenure are not significant at 15 % significance level, eliminating them
New Regression line is y = -2.5839+(-2.111*GRI)+(0.006242*SAT)+(-0.09595*AGE)
Linearity: In the Residual vs fitted plot below the Red line almost parallel to axis, indicating
linearity
Homoskedasticity: Also, there is very minimal variation for residuals vs fitted. Hence our
assumption of constant variance in the residuals is correct.
(b) Regression 1 : With all 5 variables
Regression 2 - MBA and Tenure

Observations:
i. Age  negative effect on Returns. Here the significance is lower = 0.01 compared to
original 0.106
ii. It is more impactful with the 5 variables
7. (a)
As per the R code output, Growth funds have higher Returns by 2.312 % compared to Growth
and Income funds
Also seen is that variation in the Residuals is constant about the fitted line. There is no trend.
Therefore homoskedasticity is proved.
(b) T-test using excel
t-Test: Two-Sample Assuming Equal

Variances
GFund GIFund
Mean 0.395924 -1.91593
Variance 79.86433 57.7226
Observations 327 213
Pooled Variance 71.13933
Hypothesized Mean Difference 0
df 538
t Stat 3.112946
P(T<=t) one-tail 0.000975
t Critical one-tail 1.647691
P(T<=t) two-tail 0.001951
t Critical two-tail 1.964383
Null hypothesis:
H0: mu GRI –mu GR =0

H1: mu GRI – mu GR !=0
The Null hypothesis is that the average return of Growth and Growth and Growth and income
funds are same.
Here we reject the NULL hypothesis as the p-value of the two tailed test is 0.0019 and is less
than significance level of 5 % (which is 0.05)
For 1 tailed test

H0: mu GR – mu GRI <=0
H1 : H0: mu GR – mu GRI >0.
Here the Null hypothesis that the average returns of Growth funds is less than the Average
return of Growth and income funds
Here too, we reject the Null hypothesis as p-value is 0.0009 and hence at 5 % significance level,
implying that Growth funds have better return than Growth and income funds
8. (a)
SAT score from Princeton Alum = 1355; for GRI = 0;
Estimated RET : y = -6.461+(0*-2.278)+(0.006*1355) = 1.669
(b) Yes, with 95 % CI, the Standard Error = 8.401;

DF = 537
Tcrit = 1.964 (TINV(0.05,537))
LL = 1.669 – (1.964*8.40) = -14.8286
UL = 1.669 + (1.964*8.40) = 18.166
(c)
Below is the calculations for return, for 1.5% probabilty
t value for 1.5 = 1.5 - 1.673/0.7232 = -0.239

p value for t>=0.239 = 1-p(t<-0.239) = 0.5944
There is 0.5944 probability (59.44%) of mean return of the funds to be greater than 1.5 % of the
bench mark.
9. Larger sample implies more Degrees of Freedom. The Residual error reduces, precision for
prediction improves. Overall this leads to more efficiency in capturing variance in response
variable and prediction.
10.
Here with GRI as predictor, and regression on AGE, indicates GRI significance = 0.05
Age = 43.2446+1.732*GRI
This implies that if the GRI is 0 for Growth funds, and 1 for Growth and Income funds, the
average Age of fund managers managing Growth and Income funds increases by 1.732. (All
other variables kept constant)
(b)
With GRI, TENURE, SAT constant; and GRI=0; MBA=0; TENURE=0; SAT=0
Age = 32.186 + (1.424*GRI) + (-1.879*MBA) + (0.9424*TENURE) + (0.0077*SAT)
Average Age of Fund Manager with MBA = 30.307
Observation  Age of Managers with MBA is less compared to who did not by 1.879 Years.
Constants are : SAT, TENURE
11.
As seen in above regression, excluding MBA as it is not significant at 80% CI
Also Tenure and Age has negative impact on Return
SAT has positive impact on Return
CONCLUSION:
Ms. Putney is the right choice for selection and is expected to deliver at a higher rate of
Return.
Q2. Nano Project
Identify a small problem related to day today work, in which you want to either understand the
relationship between two variables or want to predict one of the variable. Either case formulate
your problem which you want to attack. Collect the necessary dataset, to answer the question.
Apply tools and techniques discusses in the class (Regression Analysis). You have to discuss the
results both in statistical and business framework.
Please submit problem description, data description, R file used to analyse the data, along with
results and discussion. You may write the problem description, results, and discussion on a paper
and submit scanned copy of it. But you have to submit data description and data file (in excel or
csv or txt file) along with running R code.
Solution:
Driving from Hyderabad to Pune and from Pune to Hyderabad.
Description: The idea was to observe whether distance travelled is dependent on :
 Road conditions
 Traffic conditions
 Driver gender
 Stops taken on the route
Data captured and legend :
1) Driver – Male = 1, Female = 0

2) Road condition – Good = 1 and Bad = 0
3) Traffic condition – Heavy =1 and Light = 0
4) Stops taken – Stopped = 1, No stop = 0
While driving to and fro, I have collected data for the below metrics :
1) KM reading ever time duration (generally 15 mins)

2) Max speed touched in that 15 mins stretch
3) Calculating average speed and distance covered in that 15 mins stretch
Data excel file  PuneTrip.xlsx
Linear Regression perform with R code  SA2_Assignment1_71710004_NanoProject.R
Response Variable = DistanceTravelled
Predictor Variables  Driver + TrafficCond + RoadCond + Stop + X.Avg.Speed

Observations
Trip1 distance = 0.86020+(-2.22841*Stop)+(0.2205*X.Avg.Speed)

At 10% significance, only Stops and Avg speed are statistically significant
The assumption of linearity and homoscedasticity is satisfied.

Data is normal
Running the regression, by dropping the co-efficients that are not significant
Conclusion:
The Plots for the Distance travelled are linear.
Average speed and stops are significant.
Variance in residuals is constant and homoscedasticity is satisfied

Statistics - Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics - Assignment

Uploaded by

Copyright:

Available Formats

CBA Batch 8/ TERM- II/ AY 2017-18

Statistical Analysis 2: Assignment 1

(b) Situations as give below could show high R-squared.

(c) Various techniques are:

Residual vs Fitted plot

Residuals plot – Root values

Residuals plot – Standardized values

Summary from linear model regression

Bob -2.642+(0.005735*1042)+(-2.110*1) = 1.2287

Putney -2.642+(0.005735*1355)+(-2.110*1) = 3.0189

For 80 % CI  GRI, SAT, AGE and TENURE are significant

Bob -2.642+(0.005735*1042)+(-0.0688*35)+(-0.1187*5) +(-2.110*1) = -1.78181

Putney -2.642+(0.005735*1355)+(-0.0688*32)+(-0.1187*2) +(-2.110*1) = 6.5647

RET BOB = -2.642+(0.005735*1042)+(-2.110*0) =3.3387

Hence, it is 3.3387 – 1.2287 (at 95%) = 2.11 higher

So the lowest level of significance is 10.006%

New Regression line is y = -2.5839+(-2.111*GRI)+(0.006242*SAT)+(-0.09595*AGE)

(b) Regression 1 : With all 5 variables

Regression 2 - MBA and Tenure

t-Test: Two-Sample Assuming Equal

H0: mu GRI –mu GR =0

For 1 tailed test

Estimated RET : y = -6.461+(0*-2.278)+(0.006*1355) = 1.669

(b) Yes, with 95 % CI, the Standard Error = 8.401;

Below is the calculations for return, for 1.5% probabilty

t value for 1.5 = 1.5 - 1.673/0.7232 = -0.239

Description: The idea was to observe whether distance travelled is dependent on :

Data captured and legend :

1) Driver – Male = 1, Female = 0

1) KM reading ever time duration (generally 15 mins)

Data excel file  PuneTrip.xlsx

Linear Regression perform with R code  SA2_Assignment1_71710004_NanoProject.R

Response Variable = DistanceTravelled

Predictor Variables  Driver + TrafficCond + RoadCond + Stop + X.Avg.Speed

Trip1 distance = 0.86020+(-2.22841*Stop)+(0.2205*X.Avg.Speed)

The assumption of linearity and homoscedasticity is satisfied.

Variance in residuals is constant and homoscedasticity is satisfied

Trip1 distance = 0.86020+(-2.22841*Stop)+(0.2205*X.Avg.Speed)

Trip2 distance = 1.859+(-3.493*Stop)+(0.22635*X.Avg.Speed)

You might also like

Bob -2.642+(0.0057351042)+(-2.1101) = 1.2287

Putney -2.642+(0.0057351355)+(-2.1101) = 3.0189

Bob -2.642+(0.0057351042)+(-0.068835)+(-0.11875) +(-2.1101) = -1.78181

Putney -2.642+(0.0057351355)+(-0.068832)+(-0.11872) +(-2.1101) = 6.5647

RET BOB = -2.642+(0.0057351042)+(-2.1100) =3.3387

New Regression line is y = -2.5839+(-2.111GRI)+(0.006242SAT)+(-0.09595*AGE)

Estimated RET : y = -6.461+(0-2.278)+(0.0061355) = 1.669

Trip1 distance = 0.86020+(-2.22841Stop)+(0.2205X.Avg.Speed)

Trip1 distance = 0.86020+(-2.22841Stop)+(0.2205X.Avg.Speed)

Trip2 distance = 1.859+(-3.493Stop)+(0.22635X.Avg.Speed)