You are on page 1of 11

Robin Sam

5B, AP Stats

Fall Semester Project

Abstract

This project was done with the intention of figuring out if there was a correlation between hours

spent doing homework and hours spent doing extracurricular activities among students of

varying grade levels at LASA and if varying levels of stress would correspond to different

correlations. After analysis, there is little to no correlation between hours spent doing homework

and hours spent doing activities and levels of stress do not affect the correlations in any visible

way.

Introduction

The reason I chose to compare time spent on average daily working on homework versus time

spent on average daily doing extracurricular activities with the categorical variable of stress was

so that I could see how the time spent doing homework and doing other activities affected the

amount of stress someone experienced. My original hypothesis was that more hours spent

doing homework on average daily and less hours spent doing other activities on average daily

would correlate with high levels of stress because it makes sense that the more time you spend

doing homework or work that could be tedious, challenging or boring could lead to more stress

and would lessen the time left to do other activities that could be fun, interesting and less

stressful. I thought this would be interesting to teachers or even administration as it could help

them determine whether or not to lessen the amounts of homework that is assigned to students

depending upon how much stress their students are experiencing because of the work
assigned. If teachers lessened the amount of homework assigned, then students could spend

more time doing other activities that could lower their stress levels.

Data Collection

I collected data for this project by first creating an anonymous Google Form with 3 questions on

it: “​How many hours on average do you spend working on homework daily? (1-10)”, “How many

hours on average do you spend doing extracurricular activities daily? (0-10)”, and “On a scale of

1 to 5, how much stress do you experience on average?” where 1 corresponds to “Little to no

stress” and 5 corresponds to “Unbearable amounts of stress”.​ I then proceeded to contact

teachers in the LASA Math Department and I chose five teachers. I then emailed all five of them

the link to the Google Form and requested that each one of them assign their classes a number

from 1-6 and roll a dice to see which class gets to take the survey to prevent the bias of

choosing the most studious class since the academic standing of a class could affect the

amount of time spent doing homework and also affect the amount of stress experienced thereby

introducing a potential lurking variable that I don’t want. However, one of the teachers offered to

provide the survey to all of his/her classes which also prevents the bias of choosing the most

studious class as providing the survey to all classes ensures that not only studious students

take the survey and but also normal students as well. I ended up, however, throwing out 4

survey responses as they were unreasonable and would only negatively affect my data.
Analysis

The quantitative variable homework hours is fairly symmetrical with a single peak and a slight

skew to the right with. The variable has no clear outliers but has some noise in the form of gaps

between 3 and 4 hours and a gap between 5 and 6 hours. The center of this data set is at

around 3 hours with a spread of 5 hours (1 to 6 hours). Hours in between whole numbers or

fractional hours (such as 2.5) have far less counts than the whole number counts themselves

such as in between 1 and 3, 2 and 3, and 4 and 5 hours.


The quantitative variable activity hours is not symmetrical with a single peak and a strong skew

to the right and has a few outliers with values of 6 and 7 hours. There is noise in the form of

several gaps in between 2 and 3, 3 and 4, 5 and 6, and 6 and 7 hours. The center of this

variable is around 2 hours with a spread of 7 hours. The count of fractional hours in between

whole numbers is also far less than the count of whole numbers such as in between 0 and 1, 1

and 2, and 4 and 5 hours.


Best-fit linear model equation (blue line on scatterplot): y = 2.228 - 0.1063x (correlation of

-0.112) where y is the predicted activity hours given x homework hours.

The value of the intercept 2.228 means that if the hours spent doing homework is 0, then the

model predicts that the hours spent doing activities will be 2.228 hours. The value of the slope

-0.1063 means that for every extra hour spent doing homework, the model predicts that the

number of hours spent doing activities will decrease by 0.1063 hours.

There are a few outliers with data points (5, 5), (2.5, 6) and (1, 7). These outliers points, if

removed would most likely increase the correlation value from -0.112 to something larger and

thereby show a stronger relation between homework hours and activity hours and the linear

model would become less steep as the outliers would be pulling the model upwards. After

removing the outliers and recalculating the correlation, the correlation value became -0.118
which is barely larger than the correlation value with the outliers suggesting that the outliers had

minimal effect on the overall correlation and barely made the linear model less steep as seen in

the graph below where the model has equation y=2.04803 - 0.09277x whose slope of -0.09277

is slightly less than the original equation’s slope of -0.1063.

The value of r (or correlation) for Activity Hours vs Homework Hours is -0.112 which means that

there is a very weak negative correlation between hours spent doing activities and hours spent

doing homework. The value of r^2 is about 0.0125 which means that around 1.25% of the error

between the differences between the actual points and the linear model can be described by the

linear model given by the equation y = 2.228 - 0.1063x.


Since this residual plot is fairly random and scattered with a very slight dip at around 4

homework hours, it suggest that the linear model was the best model as it fit the data points well

and the residuals didn’t form a visible pattern.

Conclusion

After completing an analysis between the variables homework hours and activity hours, there is

little to no correlation between the two variables and thereby the variables are not related. The

varying levels of stress also did not cause any visible changes in relations between activity

hours and homework hours as the stress levels were very scattered according to the scatter

plot. Since my scatter plot was extremely random with no clear pattern, a linear model was used

that had an r^2 value of 0.0125 which is very weak as it only covered 1.25% of the error.

However, since the residual plot was very random and scattered, the linear model was the best

choice which allows me to conclude that there is definitely no clear relationship between activity
hours and homework hours. Both the histograms most likely had lower values or no values at all

for fractional numbers (such as 2.5) thereby creating gaps or dips in between whole numbers

due to the fact that my survey didn’t specify that fractional hours could be reported. As a most

result people might have rounded hours to the nearest whole number instead of reporting

fractional hours causing the huge decrease in fractional hours in comparison to whole numbers

which could have caused the low correlation in homework hours and activity hours.

Appendix

Spreadsheet link: (Red Rows are data points that I threw out, Blue Rows are outliers that I

removed to make a second data set to calculate correlation without outliers)

https://docs.google.com/spreadsheets/d/1TEg00y1N_9kS2R5eNa0j9XA6C7AV4tIlG5ji0bBOSM

s/edit?usp=sharing

Form Link:

https://goo.gl/forms/QLA8mPky2kgpBRTL2

R Code:

> library("ggplot2")

>

> #Histogram for Homework Hours

> a <- ggplot(Survey, aes(Homework.Hours))

> a + geom_histogram(binwidth = 0.5) + labs(x = "Homework Hours", title = "Histogram for

Homework Hours")+scale_x_continuous(breaks = seq(0, 6, by = 1))

>
> #Histogram for Activity Hours

> b <- ggplot(Survey, aes(Activity.Hours))

> b + geom_histogram(binwidth = 0.5) + labs(x = "Activity Hours", title = "Histogram for Activity

Hours")+scale_x_continuous(breaks = seq(0, 7, by = 1))

>

> #Scatterplot comparing Activity Hours to Homework Hours with the categorical variable of

Stress Levels

> p <- ggplot(Survey, aes(Homework.Hours, Activity.Hours, colour = Average.Stress, xlab =

"Homework Hours"))

> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y

= "Activity Hours", title = "Activity Hours vs. Homework Hours", colour = "Stress Levels")+

scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks = seq(0, 7, by = 1))

>

> p <- ggplot(SurveyNoOutliers, aes(Homework.Hours, Activity.Hours, colour = Average.Stress,

xlab = "Homework Hours"))

> p + geom_point() + geom_smooth(se = FALSE, method = lm) + labs(x = "Homework Hours", y

= "Activity Hours", title = "Activity Hours vs. Homework Hours (Outliers Removed)", colour =

"Stress Levels")+ scale_x_continuous(breaks = seq(0, 6, by = 1))+ scale_y_continuous(breaks

= seq(0, 7, by = 1))

>

> #Linear Model description and r and r^2 values

> lm(Survey$Activity.Hours~Survey$Homework.Hours) #y=2.2228-0.1063x

Call:

lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)


Coefficients:

(Intercept) Survey$Homework.Hours

2.2228 -0.1063

> #Linear Model description when outliers are removed

> lm(SurveyNoOutliers$Activity.Hours~SurveyNoOutliers$Homework.Hours)

#y=2.04803-0.09277x

Call:

lm(formula = SurveyNoOutliers$Activity.Hours ~ SurveyNoOutliers$Homework.Hours)

Coefficients:

(Intercept) SurveyNoOutliers$Homework.Hours

2.04803 -0.09277

> cor(Survey$Activity.Hours,Survey$Homework.Hours) #correlation between homework hours

and activity hours

[1] -0.1121026

> cor(SurveyNoOutliers$Activity.Hours,SurveyNoOutliers$Homework.Hours) #correlation

between homework hours and activity hours with outliers removed

[1] -0.1179951

> summary(lm(Survey$Activity.Hours~Survey$Homework.Hours)) #r=1.303 r^2=0.01257

Call:

lm(formula = Survey$Activity.Hours ~ Survey$Homework.Hours)

Residuals:

Min 1Q Median 3Q Max

-2.1165 -0.9039 -0.0368 0.3086 4.8835


Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.2228 0.3060 7.265 1.3e-10 ***

Survey$Homework.Hours -0.1063 0.0993 -1.070 0.287

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.303 on 90 degrees of freedom

Multiple R-squared: 0.01257, Adjusted R-squared: 0.001596

F-statistic: 1.145 on 1 and 90 DF, p-value: 0.2874

>

> #Residual Plot

> plot(Survey$Homework.Hours, resid(lm(Survey$Activity.Hours~Survey$Homework.Hours)),

xlab = "Homework Hours", ylab = "Residual", main = "Residual Plot")

> abline(0,0)

>

Works Cited

“Create Elegant Data Visualisations Using the Grammar of Graphics.” ​Create Elegant Data  

Visualisations Using the Grammar of Graphics • ggplot2​, ggplot2.tidyverse.or​g/.

You might also like