Stats Fall Semester Project

Robin Sam
5B, AP Stats
Fall Semester Project
Abstract
This project was done with the intention of figuring out if there was a correlation between hours
spent doing homework and hours spent doing extracurricular activities among students of
varying grade levels at LASA and if varying levels of stress would correspond to different
correlations. After analysis, there is little to no correlation between hours spent doing homework
and hours spent doing activities and levels of stress do not affect the correlations in any visible
way.
Introduction
The reason I chose to compare time spent on average daily working on homework versus time
spent on average daily doing extracurricular activities with the categorical variable of stress was
so that I could see how the time spent doing homework and doing other activities affected the
amount of stress someone experienced. My original hypothesis was that more hours spent
doing homework on average daily and less hours spent doing other activities on average daily
would correlate with high levels of stress because it makes sense that the more time you spend
doing homework or work that could be tedious, challenging or boring could lead to more stress
and would lessen the time left to do other activities that could be fun, interesting and less
stressful. I thought this would be interesting to teachers or even administration as it could help
them determine whether or not to lessen the amounts of homework that is assigned to students
depending upon how much stress their students are experiencing because of the work
assigned. If teachers lessened the amount of homework assigned, then students could spend
more time doing other activities that could lower their stress levels.
Data Collection
I collected data for this project by first creating an anonymous Google Form with 3 questions on
it: “How many hours on average do you spend working on homework daily? (1-10)”, “How many
hours on average do you spend doing extracurricular activities daily? (0-10)”, and “On a scale of
1 to 5, how much stress do you experience on average?” where 1 corresponds to “Little to no
stress” and 5 corresponds to “Unbearable amounts of stress”. I then proceeded to contact
teachers in the LASA Math Department and I chose five teachers. I then emailed all five of them
the link to the Google Form and requested that each one of them assign their classes a number
from 1-6 and roll a dice to see which class gets to take the survey to prevent the bias of
choosing the most studious class since the academic standing of a class could affect the
amount of time spent doing homework and also affect the amount of stress experienced thereby
introducing a potential lurking variable that I don’t want. However, one of the teachers offered to
provide the survey to all of his/her classes which also prevents the bias of choosing the most
studious class as providing the survey to all classes ensures that not only studious students
take the survey and but also normal students as well. I ended up, however, throwing out 4
survey responses as they were unreasonable and would only negatively affect my data.
Analysis
The quantitative variable homework hours is fairly symmetrical with a single peak and a slight
skew to the right with. The variable has no clear outliers but has some noise in the form of gaps
between 3 and 4 hours and a gap between 5 and 6 hours. The center of this data set is at
around 3 hours with a spread of 5 hours (1 to 6 hours). Hours in between whole numbers or
fractional hours (such as 2.5) have far less counts than the whole number counts themselves
such as in between 1 and 3, 2 and 3, and 4 and 5 hours.

The quantitative variable activity hours is not symmetrical with a single peak and a strong skew
to the right and has a few outliers with values of 6 and 7 hours. There is noise in the form of
several gaps in between 2 and 3, 3 and 4, 5 and 6, and 6 and 7 hours. The center of this
variable is around 2 hours with a spread of 7 hours. The count of fractional hours in between
whole numbers is also far less than the count of whole numbers such as in between 0 and 1, 1
and 2, and 4 and 5 hours.

Best-fit linear model equation (blue line on scatterplot): y = 2.228 - 0.1063x (correlation of
-0.112) where y is the predicted activity hours given x homework hours.
The value of the intercept 2.228 means that if the hours spent doing homework is 0, then the
model predicts that the hours spent doing activities will be 2.228 hours. The value of the slope
-0.1063 means that for every extra hour spent doing homework, the model predicts that the
number of hours spent doing activities will decrease by 0.1063 hours.
There are a few outliers with data points (5, 5), (2.5, 6) and (1, 7). These outliers points, if
removed would most likely increase the correlation value from -0.112 to something larger and
thereby show a stronger relation between homework hours and activity hours and the linear
model would become less steep as the outliers would be pulling the model upwards. After
removing the outliers and recalculating the correlation, the correlation value became -0.118
which is barely larger than the correlation value with the outliers suggesting that the outliers had
minimal effect on the overall correlation and barely made the linear model less steep as seen in
the graph below where the model has equation y=2.04803 - 0.09277x whose slope of -0.09277
is slightly less than the original equation’s slope of -0.1063.
The value of r (or correlation) for Activity Hours vs Homework Hours is -0.112 which means that
there is a very weak negative correlation between hours spent doing activities and hours spent
doing homework. The value of r^2 is about 0.0125 which means that around 1.25% of the error
between the differences between the actual points and the linear model can be described by the
linear model given by the equation y = 2.228 - 0.1063x.

Since this residual plot is fairly random and scattered with a very slight dip at around 4
homework hours, it suggest that the linear model was the best model as it fit the data points well
and the residuals didn’t form a visible pattern.
Conclusion
After completing an analysis between the variables homework hours and activity hours, there is
little to no correlation between the two variables and thereby the variables are not related. The
varying levels of stress also did not cause any visible changes in relations between activity
hours and homework hours as the stress levels were very scattered according to the scatter
plot. Since my scatter plot was extremely random with no clear pattern, a linear model was used
that had an r^2 value of 0.0125 which is very weak as it only covered 1.25% of the error.
However, since the residual plot was very random and scattered, the linear model was the best
choice which allows me to conclude that there is definitely no clear relationship between activity
hours and homework hours. Both the histograms most likely had lower values or no values at all
for fractional numbers (such as 2.5) thereby creating gaps or dips in between whole numbers
due to the fact that my survey didn’t specify that fractional hours could be reported. As a most
result people might have rounded hours to the nearest whole number instead of reporting
fractional hours causing the huge decrease in fractional hours in comparison to whole numbers
which could have caused the low correlation in homework hours and activity hours.
Appendix
Spreadsheet link: (Red Rows are data points that I threw out, Blue Rows are outliers that I
removed to make a second data set to calculate correlation without outliers)
https://docs.google.com/spreadsheets/d/1TEg00y1N_9kS2R5eNa0j9XA6C7AV4tIlG5ji0bBOSM
s/edit?usp=sharing
Form Link:
https://goo.gl/forms/QLA8mPky2kgpBRTL2
R Code:
> library("ggplot2")
>
> #Histogram for Homework Hours
> a <- ggplot(Survey, aes(Homework.Hours))
> a + geom_histogram(binwidth = 0.5) + labs(x = "Homework Hours", title = "Histogram for
Homework Hours")+scale_x_continuous(breaks = seq(0, 6, by = 1))
>
> #Histogram for Activity Hours
> b <- ggplot(Survey, aes(Activity.Hours))
> b + geom_histogram(binwidth = 0.5) + labs(x = "Activity Hours", title = "Histogram for Activity
Hours")+scale_x_continuous(breaks = seq(0, 7, by = 1))
>
> #Scatterplot comparing Activity Hours to Homework Hours with the categorical variable of
Stress Levels
> p <- ggplot(Survey, aes(Homework.Hours, Activity.Hours, colour = Average.Stress, xlab =
"Homework Hours"))
> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y
= "Activity Hours", title = "Activity Hours vs. Homework Hours", colour = "Stress Levels")+
scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks = seq(0, 7, by = 1))
>
> p <- ggplot(SurveyNoOutliers, aes(Homework.Hours, Activity.Hours, colour = Average.Stress,
xlab = "Homework Hours"))
> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y
= "Activity Hours", title = "Activity Hours vs. Homework Hours (Outliers Removed)", colour =
"Stress Levels")+ scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks
= seq(0, 7, by = 1))
>
> #Linear Model description and r and r^2 values
> lm(Survey$Activity.Hours~Survey$Homework.Hours) #y=2.2228-0.1063x
Call:
lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)

Coefficients:
(Intercept) Survey$Homework.Hours
2.2228 -0.1063
> #Linear Model description when outliers are removed
> lm(SurveyNoOutliers$Activity.Hours~SurveyNoOutliers$Homework.Hours)
#y=2.04803-0.09277x
Call:
lm(formula = SurveyNoOutliers$Activity.Hours ~ SurveyNoOutliers$Homework.Hours)
Coefficients:
(Intercept) SurveyNoOutliers$Homework.Hours
2.04803 -0.09277
> cor(Survey$Activity.Hours,Survey$Homework.Hours) #correlation between homework hours
and activity hours
[1] -0.1121026
> cor(SurveyNoOutliers$Activity.Hours,SurveyNoOutliers$Homework.Hours) #correlation
between homework hours and activity hours with outliers removed
[1] -0.1179951
> summary(lm(Survey$Activity.Hours~Survey$Homework.Hours)) #r=1.303 r^2=0.01257
Call:
lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)
Residuals:
Min 1Q Median 3Q Max
-2.1165 -0.9039 -0.0368 0.3086 4.8835

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2228 0.3060 7.265 1.3e-10 ***
Survey$Homework.Hours -0.1063 0.0993 -1.070 0.287
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.303 on 90 degrees of freedom
Multiple R-squared: 0.01257, Adjusted R-squared: 0.001596
F-statistic: 1.145 on 1 and 90 DF, p-value: 0.2874
>
> #Residual Plot
> plot(Survey$Homework.Hours, resid(lm(Survey$Activity.Hours~Survey$Homework.Hours)),
xlab = "Homework Hours", ylab = "Residual", main = "Residual Plot")
> abline(0,0)
>
Works Cited
“Create Elegant Data Visualisations Using the Grammar of Graphics.” Create Elegant Data
Visualisations Using the Grammar of Graphics • ggplot2, ggplot2.tidyverse.org/.

Stats Fall Semester Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stats Fall Semester Project

Uploaded by

Copyright:

Available Formats

Robin Sam

Fall Semester Project

1 to 5, how much stress do you experience on average?” where 1 corresponds to “Little to no

stress” and 5 corresponds to “Unbearable amounts of stress”.​ I then proceeded to contact

such as in between 1 and 3, 2 and 3, and 4 and 5 hours.

and 2, and 4 and 5 hours.

-0.112) where y is the predicted activity hours given x homework hours.

number of hours spent doing activities will decrease by 0.1063 hours.

is slightly less than the original equation’s slope of -0.1063.

linear model given by the equation y = 2.228 - 0.1063x.

and the residuals didn’t form a visible pattern.

removed to make a second data set to calculate correlation without outliers)

> #Histogram for Homework Hours

> a <- ggplot(Survey, aes(Homework.Hours))

> a + geom_histogram(binwidth = 0.5) + labs(x = "Homework Hours", title = "Histogram for

Homework Hours")+scale_x_continuous(breaks = seq(0, 6, by = 1))

> b <- ggplot(Survey, aes(Activity.Hours))

Hours")+scale_x_continuous(breaks = seq(0, 7, by = 1))

> p <- ggplot(Survey, aes(Homework.Hours, Activity.Hours, colour = Average.Stress, xlab =

> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y

scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks = seq(0, 7, by = 1))

> p <- ggplot(SurveyNoOutliers, aes(Homework.Hours, Activity.Hours, colour = Average.Stress,

xlab = "Homework Hours"))

> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y

"Stress Levels")+ scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks

> #Linear Model description and r and r^2 values

> lm(Survey$Activity.Hours~Survey$Homework.Hours) #y=2.2228-0.1063x

lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)

> #Linear Model description when outliers are removed

lm(formula = SurveyNoOutliers$Activity.Hours ~ SurveyNoOutliers$Homework.Hours)

> cor(Survey$Activity.Hours,Survey$Homework.Hours) #correlation between homework hours

and activity hours

> cor(SurveyNoOutliers$Activity.Hours,SurveyNoOutliers$Homework.Hours) #correlation

between homework hours and activity hours with outliers removed

> summary(lm(Survey$Activity.Hours~Survey$Homework.Hours)) #r=1.303 r^2=0.01257

lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)

Min 1Q Median 3Q Max

-2.1165 -0.9039 -0.0368 0.3086 4.8835

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.2228 0.3060 7.265 1.3e-10 ***

Survey$Homework.Hours -0.1063 0.0993 -1.070 0.287

Residual standard error: 1.303 on 90 degrees of freedom

Multiple R-squared: 0.01257, Adjusted R-squared: 0.001596

F-statistic: 1.145 on 1 and 90 DF, p-value: 0.2874

> #Residual Plot

> plot(Survey$Homework.Hours, resid(lm(Survey$Activity.Hours~Survey$Homework.Hours)),

xlab = "Homework Hours", ylab = "Residual", main = "Residual Plot")

Visualisations Using the Grammar of Graphics • ggplot2​, ggplot2.tidyverse.or​g/.

You might also like

stress” and 5 corresponds to “Unbearable amounts of stress”. I then proceeded to contact

Visualisations Using the Grammar of Graphics • ggplot2, ggplot2.tidyverse.org/.