Professional Documents
Culture Documents
1
NYC Vision Zero Dataset
The NYC Vision Zero Open Dataset was utilized for this exhaustive analysis and machine learning model.
The data contains details of motor vehicle collisions in the city of New York provided by the city Police
Department (NYPD). There are 999020 data points (rows) with 29 attributes (columns) at the time of
download. The dataset is updated weekly on the website. Only 1162 out of 999020 collisions resulted in
fatalities.
Location: BOROUGH, ZIP CODE, LATITUDE, LONGITUDE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME
Injuries/Fatalities: NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF
PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF
MOTORIST KILLED
Contributing Factors: CONTRIBUTING FACTOR VEHICLE 1, CONTRIBUTING FACTOR VEHICLE 2, CONTRIBUTING FACTOR VEHICLE 3,
CONTRIBUTING FACTOR VEHICLE 4, CONTRIBUTING FACTOR VEHICLE 5, UNIQUE KEY
Vehicle Type in Collision: VEHICLE TYPE CODE 1, VEHICLE TYPE CODE 2, VEHICLE TYPE CODE 3, VEHICLE TYPE CODE 4, VEHICLE
TYPE CODE 5.
2
Geocoding & reverse-geocoding
There was a significant lack of information on the dataset, including borough names and latitude and
longitude values. As fatalities were our priority, we geocoded the intersections to obtain the latitude and
longitudes for the intersections where fatalities have occurred. The address of the intersections were
obtained by joining the ON STREET NAME and the CROSS STREET NAME wherever the CROSS
STREET NAME was non-empty. Once the latitude and longitude values were geocoded, we reverse
geocoded to obtain the Borough names corresponding to these lat-long values.
There were still a few rows where Borough names were not available even though the latitude and
longitude values were available. We reverse geocoded to obtain the Borough names for these rows. Once
this was done, we visualized the Injuries and Fatalities for different years.
3
Injury and fatality visualizations
The visualizations below show the injuries and fatalities that have occurred in 2016.
4
Fatalities resulting from vehicle
collision
Visualization of the fatalities since 2012 indicate that Manhattan has the highest fatality occurrence per unit area, while Queens has a
number of locations that have have more than one fatality occurring at the same spot.
5
31 most frequent collision spots in
NYC in 2017
The 31 most frequent collision locations (from
Jan 1, 2017 - Mar 24 2017) were manually
reviewed:
7
DBSCAN algorithm
DBSCAN is a density based clustering algorithm.
For our implementation, we selected a radius of 2 kilometers and a minimum of 3 fatal accidents occurring
within this radius. 8
Clustering Visualizations are shown below:
Note: The black points represent the outliers (i.e. they do not lie in any cluster)
9
Clustering Fatalities in Boroughs
We clustered traffic fatalities within each borough for each year. We selected a minimum radius of 1.2
kilometers and a minimum of 2 fatal accidents.
10
Crimes and collisions - correlated?
Is there a correlation between crimes and traffic accidents in NYC during the year 2015?
An Example of what we aim to see: Are there a lot of traffic collisions happen within 30 minutes and a
10 mile radius of a crime scene?
Assumption
The haversine formula is used to compute the distance from a crime scene
11
Crimes and collisions - correlated?
12
Police reports on traffic accidents
The police reports on all traffic accidents in New York City were collected. A word cloud was made on
the police reports to find out the most frequent cause of collisions and fatalities.
Wordcloud of reasons for all traffic collisions Wordcloud of reasons for all traffic fatalities
13
Top contributing factors of Collisions
14
Annual collision trends at different
boroughs
The trends show declining injuries from
collisions in Brooklyn, Manhattan and Staten
Island from 2013, with Brooklyn having a
significant 15% reduction in injuries.
15
Injury frequency vs. streets &
boroughs
Broadway and Atlantic Avenue have the
most collision injuries
16
Monthly injuries trends at different
boroughs
Brooklyn & Queens seem to be more injury-
prone.
18
Vehicle type contributing to collisions
19
Machine learning on traffic data
Predictive Model
Predict fatality
Data on Traffic Collisions
given a collision
20
Handling class imbalance using
SMOTE
For the machine learning approach adopted, a very small amount of the collision data led to a fatality - (1
fatality in every 10000 collisions). Due to this imbalance, the ML process is difficult and inaccurate.
THE SOLUTION?
Synthetic Minority Oversampling Technique (SMOTE) was adopted to handling this imbalance by creating
synthetic samples of the fatalities. The algorithm selects two or more similar instances from the minor
class (using a distance measure) and perturbs an instance one attribute at a time by a random amount
within the difference to the neighboring instances.
21
Accuracy of logistic regression
We use the Receiver
Operating Characteristic
Curve (ROC) to measure
the accuracy of the model
The Area Under the
Curve is 77% for this
model.
22