You are on page 1of 12

Analyzing New York Metro

Udacity Intro to Data Science


Final Project

by Kevin Hung
*proud member of

@
.

kevhung11@gmail.com
Kevin Hung 2015

Image credits
"NYC subway-4D" by CountZ at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:NYC_subway4D.svg#/media/File:NYC_subway-4D.svg
https://c1.staticflickr.com/3/2188/2363525822_9285c44cd7_b.jpg

Question of Interest|
How to Model Hourly Ridership Entries?

Image credits
http://i.dailymail.co.uk/i/pix/2012/11/05/article-2227401-15DC2645000005DC-20_634x407.jpg

sec3: Clue #1| Hourly Schedule

Do people follow a predictable timetable or itinerary in ridership?


Peak hours seem intuitive for plausible reasons

sec3: Clue #2 | Work Week

Whopping 10 Million Difference in our subset!

Found a Great Feature

sec3: Clue #3 | Is it Raining?

Do people tend to ride less on rainy days, or is it because there are more non-rainy days than rainy ones in our particular
dataset?

Next: Lets test this!

sec1 | Statistical Test


Q: Why use statistical significance test?
A1: Draw valid inferences!
A2: Formal framework to compare & evaluate data
A3: Tell us if perceived effects are reflective as a whole

Single-Sided Mann-Whitney U-Test[1]: Are 2 populations the same?


H0: The distributions of rainy and non-rainy ridership populations are equal!

HA: No! Ridership of one population tends to be bigger than the other

sec1 | Mann-Whitney U-Test


Result

Reject the Null Hypothesis!


H0: The distributions of rainy and non-rainy ridership populations are equal!
HA: No! Ridership of one population tends to be bigger than the other
Rain may be a good feature

sec2 | Building Our Model

Well use the Normal Equation to Find our Solution!


Easy as
1
2
3

Design (Data Features Matrix)


Target (Ridership Entries as Integer Vector)
Parameters (Solution Vector that
Minimizes Squared Error)

sec2 | Linear Regression


Our Model

Coefficient of Determination

Interpretation
~ 53% of the variation in ridership entries is
explained by our model

sec2 | Model Appropriateness

Residual Plots Show that our model often under


predicts ridership for entries 2000+
Using Hour, Weekday, UNIT, Rain may not be adequate!

Suggestions: High Bias Model Find more Features

sec4 | Conclusion

Mann-Whitney U-Test & Paired Histogram Show Possibility of People


Tending to Ride the Metro More on Non-Rainy days

Rainy Feature Contributes to 1% Increase in R2

Need More Features: Incorporate Weather Factors? Foggy? Thunder?


Temperature Values?
Image credits
http://pix.avaxnews.com/avaxnews/6a/ab/0001ab6a_medium.jpeg

sec0 | References
[1] "MannWhitney U Test." Wikipedia . Wikimedia Foundation, n.d. Web.
[2] "CS220 Lecture notes." Andrew Ng .
http://cs229.stanford.edu/notes/cs229notes1.pdf
[3] Frost, Jim. "Regression Analysis: How Do I Interpret Rsquared and
Assess the GoodnessofFit? Minitab, 30 May 2013. Web. 15 Sept. 2015.
[4] NIST/SEMATECH eHandbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook/pri/section2/pri24.htm
[5] "GraphPad Curve Fitting Guide." GraphPad Curve Fitting Guide .
GraphPad, n.d. Web. 16 Sept. 2015.
<http://www.graphpad.com/guides/prism/6/curvefitting/index.htm?reg_a
nalysischeck_linearreg.htm>
* data sources: <http://web.mta.info/developers/developer-dataterms.html#data>

You might also like