You are on page 1of 10

Carter Moebius

Prof. Boyd
5/2/16
GPA Project
I. Before Estimation
A) Other than SAT score (SAT) being a good variable in predicting ones
college GPA, I also chose High School GPA (HSGPA), Average hours
spent studying per week (HRSTD), Double Major (DOUBLE), Average
hours spent in extracurricular activities per week (HREXTRA), Average
hours of sleep per week (SLEEP), Greek affiliation (GREEK), Excessive
drinker (DRINK), and Intercollegiate Athlete (ATHLETE) as my other
independent variables for my model. I made dummy variables for if a
student is a double major or not (1 = double Major, 0 = single Major)
participation in Greek life (1 = In Greek life, 0 = Not in Greek life),
excessive drinker (1 = excessive (15 or more drinks in a week) 0 = not
excessive (under 15 drinks in a week)), and Intercollegiate Athlete (1 =
Athlete, 0 = non-Athlete).
B) Predicted model:
COLLGPAi = Constant + ln(SATi) + HSGPAi + HRSTDi DOUBLEi
HREXTRAi + SLEEPi - GREEKi EXCESS_DRINKi - ATHLETEi
SAT The coefficient sign for SAT will be positive. This is because the
SAT is considered a good indication of College preparedness. A higher
SAT score will then lead to a higher predicted College GPA. I took the
natural logarithm of SAT because the relationship between College GPA
and SAT is not linear.
HSGPA The coefficient for High School GPA will also be positive. A
students High School GPA indicates their work ethic and skill at
obtaining information. A student with a good High School GPA will tend
to also have a good College GPA since their study habits and other
positive GPA factors carry over into college. Also through personal
observation a classmates High School GPA and College GPA do not
differ significantly.
HRSTD The coefficient for average hours spent studying per week will
be positive. This is due to the observation that the more you study the
more you will be prepared and understand the information. This leads
to doing better in the course and on exams leading to a higher overall
College GPA.

DOUBLE The coefficient for whether or not student is a double major


will be negative. This is because a double major entails more high-level
courses compared to a single major, who only compete the degree
requirements for one major. Students that double major take more
difficult courses through out their college careers than single major
students. In addition, since most of our sample size is juniors or
seniors, these high level courses for double majors will come into effect
in a students final two years at Denison.
HREXTRA The coefficient for average hours spent on extracurricular
activities per week will be negative. If a student spends more time on
extra curricular activities, they will have less time to study and to
spend on their homework leading to a lower overall College GPA.
SLEEP The coefficient for average hours of sleep per week will be
positive. A student who gets more sleep will be better rested. This
should lead to stronger academic performance because they are not
hindered by their lack of sleep.
GREEK The coefficient for Greek affiliation will be negative. This is
because a student involved in Greek life will have less time to work on
schoolwork. In addition, Greek life stereotypically involves more
partying, which will overall decrease a students college GPA. However,
I do not expect it to be a large negative coefficient because Greek life
tends to create a community of scholars that allow students to have a
support group in their academic pursuits (i.e. group studying).
EXCESS_DRINK The coefficient for excessive drinking is negative. I
created a dummy variable for drinking because up until a certain
amount, drinking does not dramatically affect a students GPA.
Therefore, I made students who consume 15 or more drinks a week the
limit where drinking starts to negatively affect GPA. Those who
consume less than 15 drinks will not see a detrimental effect on their
GPAs.
ATHLETE The coefficient for intercollegiate varsity athletes is
negative. Students who are involved in intercollegiate sports have less
time to spend on studying and other stress-relieving activities that
would normally boost GPA. However, athletes have more discipline and
stronger time-management skills than non-athletes, which could
increase College GPA. Therefore, I believe that our Athlete dummy
variable will have a negative coefficient but it will not be too impactful.
II. Estimation (Table)

Equation 1: CollGPAi = -3.988 + 0.809ln(SATi) + 0.353HSGPAi +


0.008HRSTDi + 0.049DOUBLEi + 0.00003HREXTRAi + 0.002SLEEPi
0.029GREEKi 0.119EXCESS_DRINKi 0.084ATHLETEi
III/IV. Discussion of Results/Modifications on the Model:
A) This model is ok because my adjusted R2 is above 0.3 (0.405) which
makes it a good fit for cross-sectional data. There are also a large
number of observations (165), which makes it a big enough sample
size to be considered significant. The MSE is relatively low (0.312)
which is what we are looking for in our model and the F-ratio is
significant at the 1% level, which means our overall model is
statistically significant. However, there were some issues with the
coefficients. Our Double Major and Hours spent on extracurricular
activities variables had opposite signs than what was originally
predicted. This could be due to the fact that Double Majors have a
better work ethic than other students since they have a bigger
workload this heavier course load forces them to manage their time
around academic assignments and thus they are better students in
general and receive higher GPAs. For extracurricular activities, it could
be a positive coefficient because extracurricular activities allow a
student to de-stress from their workload and then return to their work
with a clear head. Our variables Hours spent studying per week and
average hours of sleep per week have the coefficient sign we expected
but yield very low coefficients showing that these variables do not
have a strong effect on our overall model. In addition, we have 5
variables that are not statistically significant since they fail at the
unconventional levels. These variables are, Double Major, Hours spent
on extracurricular activities per week, average hours spent sleeping
per week, and Greek affiliation. All of the tolerance values are above
0.2 and the VIF values are below 5, so our model shows no signs of
multicollinearity. Therefore, I will remove Hours spent on
Extracurricular activities because it is the least statistically significant,
being at the 95% level and has the opposite sign than predicted. I
would assume that after removing this variable, I will continue to
remove the variables Double Major, Sleep, and Greek separately if they
do not show signs of becoming more statistically significant or if they
end up showing signs of multicollinearity.
B) I am removing the variable HrExtra, which corresponds to average
hours spent on extracurricular activities in a week, because it is the
most statistically insignificant in my model; it is not significant even at
the 95% level. It also has the opposite sign of what I predicted.

Equation 2: CollGPAi = -4.044 + 0.819ln(SATi) + 0.350HSGPAi +


0.008HRSTDi + 0.053DOUBLEi + 0.002SLEEPi 0.031GREEKi 0.121EXCESS_DRINKi 0.086ATHLETEi
A) Our new model got slightly better than our old. The value of
adjusted R2 increased (0.405 to 0.410) which means that hours spent
on extracurricular activities was indeed a bad variable. Furthermore,
the MSE fell very slightly (0.312 to 0.311) indicating that this model is
better than the first. The F-Ratio is significant at the 1% level; therefore
the overall model is highly significant. The problem with our updated
model is not there are still variables that are not significant at the
conventional levels. Although, our excessive drinker dummy variable
increased significance from the 0.10 percentile to the 0.05 percentile
making it a more accurate variable, our Double major variable,
average hours slept in a week variable, Greek affiliation variable, and
Athlete variable were still not significant at the 0.10, 0.05, or 0.01
levels. These variables are not only insignificant at the conventional
levels but also fail at the 25% level proving that they are still bad
variables in this second model. In addition, our Double Major variable is
the opposite sign of what we predicted so it is further a bad variable
for this model. When checking for multicollinearity our VIF values and
the tolerance values are acceptable for all our variables. Since our
Greek dummy variable has the highest p-value (0.557) compared to
Sleep (0.533) and Double Major (0.301) we will remove this variable
from our next model. If our Sleep variable and Double Major variable
do not improve their significance or their sign does not become as
predicted then I will remove them also in the fourth model.
B) I am removing the variable Greek affiliation dummy variable, which
corresponds to if the student is part of Greek life or not, because it is
insignificant even at the 50% level (0.533). This means it is statistically
insignificant compared to our other independent variables and by
removing it we should see an improvement in our next model.
Equation 3: CollGPAi = -4.131 + 0.829ln(SATi) + 0.349HSGPAi +
0.008HRSTDi + 0.054DOUBLEi + 0.002SLEEPi 0.135EXCESS_DRINKi
0.085ATHLETEi
A) Our third model was again slightly better than our second but not by
much, however it is still our best model to date. The value of adjusted
R2 increased (0.410 to 0.412) proving that Greek affiliation is a bad
variable in our model. The MSE decreased (0.311 to 0.310) which
shows that our overall model increased in significance. The F-Ratio is
still significant at the 1% level, additionally proving the statistical
significance of our new model. In addition, when checking for

multicollinearity the VIF values and tolerance values are acceptable for
all variables leading to no issues of multicollinearity. However our
Double major dummy variable and our sleep variable are still
insignificant at the conventional levels.. In addition, our double major
dummy variable still shows an opposite sign than what we predicted.
However, our Double major dummy variable is becoming more
significant since it is becoming more significant near the 20% level
with a p-value of 0.294. Thus we will keep this in our model and look at
our Sleep variable, which has a p value of 0.512 and is not even
significant at the 50% level. All the other variables in our model are
significant except for Athlete (p-value of 0.122) at the 10% level.
Athlete is quite close so we will consider it significant as well and I
predict it will be at 10% significance in our next model. Therefore since
our Sleep variable is not even significant at the 50% level we will
remove it in our next model. In addition when looking at our Sleep
samples I notice that a good amount of individuals have at least 49
hours of sleep which is considered a healthy amount. This could mean
that our sample size is not varied enough and does not have a
statistically significant impact on our model.
B) I am removing the Sleep variable because it is still highly
insignificant (0.512). In addition, I find that our Sleep sample is not
varied enough not making it a good variable to run in our model. We
hope to observe our Athlete variable becoming significant at the 10%
level and our Double Major dummy variable to become more
statistically significant.

Equation 4: CollGPAi = -4.172 + 0.848ln(SATi) + 0.352HSGPAi +


0.008HRSTDi + 0.060DOUBLEi 0.143EXCESS_DRINKi 0.111ATHLETEi
A) For this model our adjusted R2 has increased more than the other
model changes (0.412 to 0.417), which approves our prediction that
average hours slept a week was a bad variable. However, our MSE
increased (0.310 to 0.318) meaning that our overall model is worse
than our previous models all the way to model 1 because model1_MSE
= 0.312 and model4_MSE = 0.318. The F-ratio for this equation is still
significant at the 1% level and, in addition, all the variables except for
our Double Major dummy variable, barely significant at 25% level, are
significant at, at least, the 5% level. Leaving us with Excess Drink and
Athlete being the only variables not significant at the 1% level. Again
there are no issues of multicollinearity. Our sign for Double Major is still
not what we expected it to be (It is positive rather than negative)
however, the justification for this sign misinterpretation was explained
in our first models estimation and makes sense (Double majors have a

higher work ethic with harder classes thus are overall better students).
However I will choose to remove this variable for my final model
because it is barely significant at the 25% level and is the only variable
in our model without any asterisks.
B) I am choosing to remove the dummy variable Double Major because
it is the only variable without statistical significance at the
conventional levels and is the opposite sign then what we originally
thought. I expect our final model to have a higher R2 than our previous
model however, I do not know if this model will be better than the rest
because we were surprised with a higher MSE in our 4th model
compared to all our others. Our number of observations has increased
across models so this might be a causation of this. I also expect my
other variables to become more significant and their coefficients to
slightly change. If our Excess Drink and Athlete variables become
significant at the 1% level, I expect our constant to do the same and it
will result in all our variables including our constant being significant at
all the conventional levels.
Equation 5: CollGPAi = -4.365 + 0.877ln(SATi) + 0.351HSGPAi +
0.008HRSTDi 0.145EXCESS_DRINKi 0.113ATHLETEi
A) This model is a good model but is not our best and has the same
MSE as our 4th model (0.318) which was the worst model we have
estimated thus far. Our adjusted R2 has decreased from (0.417 to
0.416) meaning the dummy variable Double Major that we removed
was not necessarily a bad variable but was not significantly good for
the model either. The F-Ratio is still significant at the 1% level, which
means that it is still an overall good fit for predicting College GPA. In
addition, our constant went from being significant at the 5% level in all
our other models to being significant at the 1% level for our final
model. This shows that the removal of dummy variable Double Major
caused our constant to better fit our model. Excess drink and Athlete
did not increase from the 5% significance level to the 1% as I had
predicted. Finally, there are no issues of multicollinearity in our model
further solidifying our conclusion that this is a good model but not
better than our previous models.
B) At this point I have no other variables to remove that I believe would
improve my model as all my variables are statistically significant at the
5% level with only Excess Drink and Athlete not being significant at the
1% level. However, as explained above, this is not our best model and
will not be chosen as the final.

V. Post Estimation:
A) The best model that I created to predict CollegeGPA is Equation 3.
CollGPAi = -4.131 + 0.829ln(SATi) + 0.349HSGPAi + 0.008HRSTDi +
0.054DOUBLEi + 0.002SLEEPi 0.135EXCESS_DRINKi 0.085ATHLETEi
This is because we have the lowest MSE in Equation 3 (0.310). This
model predicts collegeGPA best because all variables are significant at
the 5%, with most being significant at 1%, except for excess drink and
Athlete. Therefore we are confident that the variables predict the
model well. Moreover there are no issues of multicollinearity in all of
our models including our best model, Equation 3. Lastly, our F-ratio for
this model and all our other models is overall significant at the 1%
making it an overall statistically significant model.
B) The variable that has the greatest impact on college GPAs is our
ln(SAT) variable. This variable has the highest standardized beta
coefficient, which is 0.829 in our 3rd model.
C) ln(SAT): (lin-log) College GPA increases by 0.008 points for every 1%
increase in SAT score holding high school GPA, average hours studied
per week, Double Major, average hours of sleep per week, Excess
drinking, and intercollegiate varsity athlete constant.
HSGPA: (lin-lin) College GPA increases by 0.349 points for every one
point increase in High school GPA holding ln(SAT), average hours
studied per week, Double Major, average hours of sleep per week,
Excess drinking, and intercollegiate varsity athlete constant.
HRSTD: (lin-lin) College GPA increases by 0.008 points for every one
hour increase in time spent studying per week holding ln(SAT), High
School GPA, Double Major, average hours of sleep per week, Excess
drinking, and intercollegiate varsity athlete constant.
DOUBLE: (lin-lin) Double majors have a 0.054 point increase in College
GPA compared to single majors holding ln(SAT), High School GPA,
average hours spent studying per week, average hours of sleep per
week, Excess drinking, and intercollegiate varsity athlete constant.
SLEEP: (lin-lin) College GPA increases by 0.002 points for every one
hour increase in sleep per week holding ln(SAT), High School GPA,
average hours spent studying per week, Double Major, Excess drinking,
and intercollegiate varsity athlete constant.

EXCESS_DRINK: (lin-lin) Students who consume 15 or more drinks a


week have a 0.135 point decrease in College GPA compared to
students who consume less than 15 drinks a week holding ln(SAT),
High School GPA, average hours spent studying per week, Double
Major, average hours of sleep per week, and intercollegiate varsity
athlete constant.
ATHLETE: (lin-lin) Students who are involved in an intercollegiate
varsity sport have a 0.085 point decrease in College GPA compared to
students who are not athletes holding ln(SAT), High School GPA,
average hours spent studying per week, Double Major, average hours
of sleep per week, and excess drinking constant.
D) CollGPAi = -4.131 + 0.829(7.244) + 0.349(3.4) + 0.008(7) +
0.054(0) + 0.002(49) 0.135(1) 0.085(0)
CollGPAi = -4.131 + 6.005 + 1.187 + 0.056 + 0.098 0.135
CollGPAi = 3.08
My model predicts my college GPA to be 3.08, which is quite close to
my actual college GPA.
VI. Boyd Model:
A) CollGPAi = 0.141 + 0.001SATi +1.179ln(HSGPAi) + 0.022HRSTDi
0.0002(HRSTDi2) 0.086DECISIONi + 0.682FEMALEi
0.0004(SAT*FEMALEi) 0.003ECONi
B) SAT: (lin-lin) College GPA increases by 0.001 points for every one
point increase in SAT score holding ln(HSGPA), average hours spent
studying per week, average hours spent studying per week squared,
Early Decision, Gender, and major constant.
High School GPA: (lin-log) College GPA increases by 0.012 points for
every 1% increase in High School GPA holding SAT, average hours
spent studying per week, average hours spent studying per week
squared, Early Decision, Gender, and major constant.
Hours Studied:
Slope = bHRSTD + 2bHRSTDHRSTDmean
Avg. of HRSTD = 23.131 = 23 hours
0.022 + 2(-0.0002)23.131 = 0.022

First beta coefficient is average hours spent studying a week while the
second beta coefficient is average hours spent studying a week
squared. The exponent is just the mean of average hours studied per
week rounded to the nearest half hour
Fore every additional hour spent studying a week, college GPA
increases by 0.022 holding SAT, ln(HSGPA), Early Decision, Gender,
and Major constant.
C)
H0: 1=0
HA: 1 != 0
T-ratio = (SAT 0)/SE(SAT)
(0.001 0)/.0002
T-ratio = 5.0

DF = 161

Tc0.05 = 1.972 (We use the tc0.05 value for DF = 200 because we round up
from 161)
|5.0| > 1.972
since 5.0 > 1.972 we can reject our null hypothesis. Thus we are 95%
certain that a students SAT score significantly affects their college
GPA.
D) The Boyd model is overall worse than our best model, Equation 3.
The Boyd model has a larger MSE (0.314), while our model has an MSE
(0.310) meaning that it is an overall better model than the Boyd
model. However, the adjusted R2 in the Boyd model is higher (0.427)
than the adjusted R2 in our model (0.412). Even if this is the case the
MSE is the best indicator for a models statistical significance and, since
our models MSE is lower than the Boyd model, our model is a better
model. Furthermore, not all of the independent variables in the Boyd
model are statistically significant at the conventional levels. This can
be said about Equation 3 as well, however, there is more independent
variables that are not statistically significant at the conventional levels
(Boyd Model = 4 variables not statistically significant at conventional
levels. My Model = 3 variables not statistically significant at
conventional levels). Lastly, the Boyd model encounters problems with
multicollinearity between the SAT*Female variable and the Female
variable as well as multicollinearity issues between the average hours
spent studying per week and the average hours spent studying per

week squared. In contrast, our best model encounters no problems


with multicollinearity as can be seen in my regression output.

You might also like