Professional Documents
Culture Documents
Prof. Boyd
5/2/16
GPA Project
I. Before Estimation
A) Other than SAT score (SAT) being a good variable in predicting ones
college GPA, I also chose High School GPA (HSGPA), Average hours
spent studying per week (HRSTD), Double Major (DOUBLE), Average
hours spent in extracurricular activities per week (HREXTRA), Average
hours of sleep per week (SLEEP), Greek affiliation (GREEK), Excessive
drinker (DRINK), and Intercollegiate Athlete (ATHLETE) as my other
independent variables for my model. I made dummy variables for if a
student is a double major or not (1 = double Major, 0 = single Major)
participation in Greek life (1 = In Greek life, 0 = Not in Greek life),
excessive drinker (1 = excessive (15 or more drinks in a week) 0 = not
excessive (under 15 drinks in a week)), and Intercollegiate Athlete (1 =
Athlete, 0 = non-Athlete).
B) Predicted model:
COLLGPAi = Constant + ln(SATi) + HSGPAi + HRSTDi DOUBLEi
HREXTRAi + SLEEPi - GREEKi EXCESS_DRINKi - ATHLETEi
SAT The coefficient sign for SAT will be positive. This is because the
SAT is considered a good indication of College preparedness. A higher
SAT score will then lead to a higher predicted College GPA. I took the
natural logarithm of SAT because the relationship between College GPA
and SAT is not linear.
HSGPA The coefficient for High School GPA will also be positive. A
students High School GPA indicates their work ethic and skill at
obtaining information. A student with a good High School GPA will tend
to also have a good College GPA since their study habits and other
positive GPA factors carry over into college. Also through personal
observation a classmates High School GPA and College GPA do not
differ significantly.
HRSTD The coefficient for average hours spent studying per week will
be positive. This is due to the observation that the more you study the
more you will be prepared and understand the information. This leads
to doing better in the course and on exams leading to a higher overall
College GPA.
multicollinearity the VIF values and tolerance values are acceptable for
all variables leading to no issues of multicollinearity. However our
Double major dummy variable and our sleep variable are still
insignificant at the conventional levels.. In addition, our double major
dummy variable still shows an opposite sign than what we predicted.
However, our Double major dummy variable is becoming more
significant since it is becoming more significant near the 20% level
with a p-value of 0.294. Thus we will keep this in our model and look at
our Sleep variable, which has a p value of 0.512 and is not even
significant at the 50% level. All the other variables in our model are
significant except for Athlete (p-value of 0.122) at the 10% level.
Athlete is quite close so we will consider it significant as well and I
predict it will be at 10% significance in our next model. Therefore since
our Sleep variable is not even significant at the 50% level we will
remove it in our next model. In addition when looking at our Sleep
samples I notice that a good amount of individuals have at least 49
hours of sleep which is considered a healthy amount. This could mean
that our sample size is not varied enough and does not have a
statistically significant impact on our model.
B) I am removing the Sleep variable because it is still highly
insignificant (0.512). In addition, I find that our Sleep sample is not
varied enough not making it a good variable to run in our model. We
hope to observe our Athlete variable becoming significant at the 10%
level and our Double Major dummy variable to become more
statistically significant.
higher work ethic with harder classes thus are overall better students).
However I will choose to remove this variable for my final model
because it is barely significant at the 25% level and is the only variable
in our model without any asterisks.
B) I am choosing to remove the dummy variable Double Major because
it is the only variable without statistical significance at the
conventional levels and is the opposite sign then what we originally
thought. I expect our final model to have a higher R2 than our previous
model however, I do not know if this model will be better than the rest
because we were surprised with a higher MSE in our 4th model
compared to all our others. Our number of observations has increased
across models so this might be a causation of this. I also expect my
other variables to become more significant and their coefficients to
slightly change. If our Excess Drink and Athlete variables become
significant at the 1% level, I expect our constant to do the same and it
will result in all our variables including our constant being significant at
all the conventional levels.
Equation 5: CollGPAi = -4.365 + 0.877ln(SATi) + 0.351HSGPAi +
0.008HRSTDi 0.145EXCESS_DRINKi 0.113ATHLETEi
A) This model is a good model but is not our best and has the same
MSE as our 4th model (0.318) which was the worst model we have
estimated thus far. Our adjusted R2 has decreased from (0.417 to
0.416) meaning the dummy variable Double Major that we removed
was not necessarily a bad variable but was not significantly good for
the model either. The F-Ratio is still significant at the 1% level, which
means that it is still an overall good fit for predicting College GPA. In
addition, our constant went from being significant at the 5% level in all
our other models to being significant at the 1% level for our final
model. This shows that the removal of dummy variable Double Major
caused our constant to better fit our model. Excess drink and Athlete
did not increase from the 5% significance level to the 1% as I had
predicted. Finally, there are no issues of multicollinearity in our model
further solidifying our conclusion that this is a good model but not
better than our previous models.
B) At this point I have no other variables to remove that I believe would
improve my model as all my variables are statistically significant at the
5% level with only Excess Drink and Athlete not being significant at the
1% level. However, as explained above, this is not our best model and
will not be chosen as the final.
V. Post Estimation:
A) The best model that I created to predict CollegeGPA is Equation 3.
CollGPAi = -4.131 + 0.829ln(SATi) + 0.349HSGPAi + 0.008HRSTDi +
0.054DOUBLEi + 0.002SLEEPi 0.135EXCESS_DRINKi 0.085ATHLETEi
This is because we have the lowest MSE in Equation 3 (0.310). This
model predicts collegeGPA best because all variables are significant at
the 5%, with most being significant at 1%, except for excess drink and
Athlete. Therefore we are confident that the variables predict the
model well. Moreover there are no issues of multicollinearity in all of
our models including our best model, Equation 3. Lastly, our F-ratio for
this model and all our other models is overall significant at the 1%
making it an overall statistically significant model.
B) The variable that has the greatest impact on college GPAs is our
ln(SAT) variable. This variable has the highest standardized beta
coefficient, which is 0.829 in our 3rd model.
C) ln(SAT): (lin-log) College GPA increases by 0.008 points for every 1%
increase in SAT score holding high school GPA, average hours studied
per week, Double Major, average hours of sleep per week, Excess
drinking, and intercollegiate varsity athlete constant.
HSGPA: (lin-lin) College GPA increases by 0.349 points for every one
point increase in High school GPA holding ln(SAT), average hours
studied per week, Double Major, average hours of sleep per week,
Excess drinking, and intercollegiate varsity athlete constant.
HRSTD: (lin-lin) College GPA increases by 0.008 points for every one
hour increase in time spent studying per week holding ln(SAT), High
School GPA, Double Major, average hours of sleep per week, Excess
drinking, and intercollegiate varsity athlete constant.
DOUBLE: (lin-lin) Double majors have a 0.054 point increase in College
GPA compared to single majors holding ln(SAT), High School GPA,
average hours spent studying per week, average hours of sleep per
week, Excess drinking, and intercollegiate varsity athlete constant.
SLEEP: (lin-lin) College GPA increases by 0.002 points for every one
hour increase in sleep per week holding ln(SAT), High School GPA,
average hours spent studying per week, Double Major, Excess drinking,
and intercollegiate varsity athlete constant.
First beta coefficient is average hours spent studying a week while the
second beta coefficient is average hours spent studying a week
squared. The exponent is just the mean of average hours studied per
week rounded to the nearest half hour
Fore every additional hour spent studying a week, college GPA
increases by 0.022 holding SAT, ln(HSGPA), Early Decision, Gender,
and Major constant.
C)
H0: 1=0
HA: 1 != 0
T-ratio = (SAT 0)/SE(SAT)
(0.001 0)/.0002
T-ratio = 5.0
DF = 161
Tc0.05 = 1.972 (We use the tc0.05 value for DF = 200 because we round up
from 161)
|5.0| > 1.972
since 5.0 > 1.972 we can reject our null hypothesis. Thus we are 95%
certain that a students SAT score significantly affects their college
GPA.
D) The Boyd model is overall worse than our best model, Equation 3.
The Boyd model has a larger MSE (0.314), while our model has an MSE
(0.310) meaning that it is an overall better model than the Boyd
model. However, the adjusted R2 in the Boyd model is higher (0.427)
than the adjusted R2 in our model (0.412). Even if this is the case the
MSE is the best indicator for a models statistical significance and, since
our models MSE is lower than the Boyd model, our model is a better
model. Furthermore, not all of the independent variables in the Boyd
model are statistically significant at the conventional levels. This can
be said about Equation 3 as well, however, there is more independent
variables that are not statistically significant at the conventional levels
(Boyd Model = 4 variables not statistically significant at conventional
levels. My Model = 3 variables not statistically significant at
conventional levels). Lastly, the Boyd model encounters problems with
multicollinearity between the SAT*Female variable and the Female
variable as well as multicollinearity issues between the average hours
spent studying per week and the average hours spent studying per