Professional Documents
Culture Documents
5B, AP Stats
Abstract
This project was done with the intention of figuring out if there was a correlation between hours
spent doing homework and hours spent doing extracurricular activities among students of
varying grade levels at LASA and if varying levels of stress would correspond to different
correlations. After analysis, there is little to no correlation between hours spent doing homework
and hours spent doing activities and levels of stress do not affect the correlations in any visible
way.
Introduction
The reason I chose to compare time spent on average daily working on homework versus time
spent on average daily doing extracurricular activities with the categorical variable of stress was
so that I could see how the time spent doing homework and doing other activities affected the
amount of stress someone experienced. My original hypothesis was that more hours spent
doing homework on average daily and less hours spent doing other activities on average daily
would correlate with high levels of stress because it makes sense that the more time you spend
doing homework or work that could be tedious, challenging or boring could lead to more stress
and would lessen the time left to do other activities that could be fun, interesting and less
stressful. I thought this would be interesting to teachers or even administration as it could help
them determine whether or not to lessen the amounts of homework that is assigned to students
depending upon how much stress their students are experiencing because of the work
assigned. If teachers lessened the amount of homework assigned, then students could spend
more time doing other activities that could lower their stress levels.
Data Collection
I collected data for this project by first creating an anonymous Google Form with 3 questions on
it: “How many hours on average do you spend working on homework daily? (1-10)”, “How many
hours on average do you spend doing extracurricular activities daily? (0-10)”, and “On a scale of
teachers in the LASA Math Department and I chose five teachers. I then emailed all five of them
the link to the Google Form and requested that each one of them assign their classes a number
from 1-6 and roll a dice to see which class gets to take the survey to prevent the bias of
choosing the most studious class since the academic standing of a class could affect the
amount of time spent doing homework and also affect the amount of stress experienced thereby
introducing a potential lurking variable that I don’t want. However, one of the teachers offered to
provide the survey to all of his/her classes which also prevents the bias of choosing the most
studious class as providing the survey to all classes ensures that not only studious students
take the survey and but also normal students as well. I ended up, however, throwing out 4
survey responses as they were unreasonable and would only negatively affect my data.
Analysis
The quantitative variable homework hours is fairly symmetrical with a single peak and a slight
skew to the right with. The variable has no clear outliers but has some noise in the form of gaps
between 3 and 4 hours and a gap between 5 and 6 hours. The center of this data set is at
around 3 hours with a spread of 5 hours (1 to 6 hours). Hours in between whole numbers or
fractional hours (such as 2.5) have far less counts than the whole number counts themselves
to the right and has a few outliers with values of 6 and 7 hours. There is noise in the form of
several gaps in between 2 and 3, 3 and 4, 5 and 6, and 6 and 7 hours. The center of this
variable is around 2 hours with a spread of 7 hours. The count of fractional hours in between
whole numbers is also far less than the count of whole numbers such as in between 0 and 1, 1
The value of the intercept 2.228 means that if the hours spent doing homework is 0, then the
model predicts that the hours spent doing activities will be 2.228 hours. The value of the slope
-0.1063 means that for every extra hour spent doing homework, the model predicts that the
There are a few outliers with data points (5, 5), (2.5, 6) and (1, 7). These outliers points, if
removed would most likely increase the correlation value from -0.112 to something larger and
thereby show a stronger relation between homework hours and activity hours and the linear
model would become less steep as the outliers would be pulling the model upwards. After
removing the outliers and recalculating the correlation, the correlation value became -0.118
which is barely larger than the correlation value with the outliers suggesting that the outliers had
minimal effect on the overall correlation and barely made the linear model less steep as seen in
the graph below where the model has equation y=2.04803 - 0.09277x whose slope of -0.09277
The value of r (or correlation) for Activity Hours vs Homework Hours is -0.112 which means that
there is a very weak negative correlation between hours spent doing activities and hours spent
doing homework. The value of r^2 is about 0.0125 which means that around 1.25% of the error
between the differences between the actual points and the linear model can be described by the
homework hours, it suggest that the linear model was the best model as it fit the data points well
Conclusion
After completing an analysis between the variables homework hours and activity hours, there is
little to no correlation between the two variables and thereby the variables are not related. The
varying levels of stress also did not cause any visible changes in relations between activity
hours and homework hours as the stress levels were very scattered according to the scatter
plot. Since my scatter plot was extremely random with no clear pattern, a linear model was used
that had an r^2 value of 0.0125 which is very weak as it only covered 1.25% of the error.
However, since the residual plot was very random and scattered, the linear model was the best
choice which allows me to conclude that there is definitely no clear relationship between activity
hours and homework hours. Both the histograms most likely had lower values or no values at all
for fractional numbers (such as 2.5) thereby creating gaps or dips in between whole numbers
due to the fact that my survey didn’t specify that fractional hours could be reported. As a most
result people might have rounded hours to the nearest whole number instead of reporting
fractional hours causing the huge decrease in fractional hours in comparison to whole numbers
which could have caused the low correlation in homework hours and activity hours.
Appendix
Spreadsheet link: (Red Rows are data points that I threw out, Blue Rows are outliers that I
https://docs.google.com/spreadsheets/d/1TEg00y1N_9kS2R5eNa0j9XA6C7AV4tIlG5ji0bBOSM
s/edit?usp=sharing
Form Link:
https://goo.gl/forms/QLA8mPky2kgpBRTL2
R Code:
> library("ggplot2")
>
>
> #Histogram for Activity Hours
> b + geom_histogram(binwidth = 0.5) + labs(x = "Activity Hours", title = "Histogram for Activity
>
> #Scatterplot comparing Activity Hours to Homework Hours with the categorical variable of
Stress Levels
"Homework Hours"))
= "Activity Hours", title = "Activity Hours vs. Homework Hours", colour = "Stress Levels")+
>
= "Activity Hours", title = "Activity Hours vs. Homework Hours (Outliers Removed)", colour =
= seq(0, 7, by = 1))
>
Call:
(Intercept) Survey$Homework.Hours
2.2228 -0.1063
> lm(SurveyNoOutliers$Activity.Hours~SurveyNoOutliers$Homework.Hours)
#y=2.04803-0.09277x
Call:
Coefficients:
(Intercept) SurveyNoOutliers$Homework.Hours
2.04803 -0.09277
[1] -0.1121026
[1] -0.1179951
Call:
Residuals:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> abline(0,0)
>
Works Cited
“Create Elegant Data Visualisations Using the Grammar of Graphics.” Create Elegant Data