You are on page 1of 22

NYC VISION ZERO

Anirudh Madhusudan, Harshad Rai, Kaushik Krishnan


Faculty Sponsor: Prof. Richard Sowers
University of Illinois, Urbana Champaign

1
NYC Vision Zero Dataset
The NYC Vision Zero Open Dataset was utilized for this exhaustive analysis and machine learning model.
The data contains details of motor vehicle collisions in the city of New York provided by the city Police
Department (NYPD). There are 999020 data points (rows) with 29 attributes (columns) at the time of
download. The dataset is updated weekly on the website. Only 1162 out of 999020 collisions resulted in
fatalities.

The attributes available on the dataset are:-

Date & Time: DATE, TIME

Location: BOROUGH, ZIP CODE, LATITUDE, LONGITUDE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME

Injuries/Fatalities: NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF
PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF
MOTORIST KILLED

Contributing Factors: CONTRIBUTING FACTOR VEHICLE 1, CONTRIBUTING FACTOR VEHICLE 2, CONTRIBUTING FACTOR VEHICLE 3,
CONTRIBUTING FACTOR VEHICLE 4, CONTRIBUTING FACTOR VEHICLE 5, UNIQUE KEY

Vehicle Type in Collision: VEHICLE TYPE CODE 1, VEHICLE TYPE CODE 2, VEHICLE TYPE CODE 3, VEHICLE TYPE CODE 4, VEHICLE
TYPE CODE 5.
2
Geocoding & reverse-geocoding
There was a significant lack of information on the dataset, including borough names and latitude and
longitude values. As fatalities were our priority, we geocoded the intersections to obtain the latitude and
longitudes for the intersections where fatalities have occurred. The address of the intersections were
obtained by joining the ON STREET NAME and the CROSS STREET NAME wherever the CROSS
STREET NAME was non-empty. Once the latitude and longitude values were geocoded, we reverse
geocoded to obtain the Borough names corresponding to these lat-long values.

There were still a few rows where Borough names were not available even though the latitude and
longitude values were available. We reverse geocoded to obtain the Borough names for these rows. Once
this was done, we visualized the Injuries and Fatalities for different years.

3
Injury and fatality visualizations
The visualizations below show the injuries and fatalities that have occurred in 2016.

4
Fatalities resulting from vehicle
collision
Visualization of the fatalities since 2012 indicate that Manhattan has the highest fatality occurrence per unit area, while Queens has a
number of locations that have have more than one fatality occurring at the same spot.

5
31 most frequent collision spots in
NYC in 2017
The 31 most frequent collision locations (from
Jan 1, 2017 - Mar 24 2017) were manually
reviewed:

61% of these occurred at intersections

55% of these have an overpass near them


( bridge, road or similar structure that crosses over another road)

41% of these have park(s) near them


Image rendered using Tableau for the 31 most frequent collision spots. The
larger the circle, higher the number of collisions at the location6
Fatality clustering
Visualizations for injuries do not divulge a lot of information . They are spread out all across the city. We
can cluster the fatalities to understand the danger zones (i.e. places where fatal accidents tend to occur
the most).

We used the DBSCAN algorithm to cluster fatalities for different years.

7
DBSCAN algorithm
DBSCAN is a density based clustering algorithm.

1. Start with a set of points S = {x1,....,xN} R2

2. Select a value for r > 0 and > 0

3. For each xn, we consider a disc of radius and set An = { x S : |x-xn| }

4. If |An| < r, then we shall not involve An in any further step

5. Take the union of An and An if An An

6. Repeat until no unions take place

For our implementation, we selected a radius of 2 kilometers and a minimum of 3 fatal accidents occurring
within this radius. 8
Clustering Visualizations are shown below:

Note: The black points represent the outliers (i.e. they do not lie in any cluster)

9
Clustering Fatalities in Boroughs
We clustered traffic fatalities within each borough for each year. We selected a minimum radius of 1.2
kilometers and a minimum of 2 fatal accidents.

10
Crimes and collisions - correlated?
Is there a correlation between crimes and traffic accidents in NYC during the year 2015?

An Example of what we aim to see: Are there a lot of traffic collisions happen within 30 minutes and a
10 mile radius of a crime scene?

Data (Year 2015):

Location(latitude and longitude) of crimes and collisions

Time of crimes and collisions

Assumption

The haversine formula is used to compute the distance from a crime scene

11
Crimes and collisions - correlated?

12
Police reports on traffic accidents
The police reports on all traffic accidents in New York City were collected. A word cloud was made on
the police reports to find out the most frequent cause of collisions and fatalities.

Wordcloud of reasons for all traffic collisions Wordcloud of reasons for all traffic fatalities
13
Top contributing factors of Collisions

14
Annual collision trends at different
boroughs
The trends show declining injuries from
collisions in Brooklyn, Manhattan and Staten
Island from 2013, with Brooklyn having a
significant 15% reduction in injuries.

All 5 Boroughs have had a high number of


injury incidents in 2013. There has been a
reasonable reduction in incidents since then.

15
Injury frequency vs. streets &
boroughs
Broadway and Atlantic Avenue have the
most collision injuries

Most of highly injury-prone areas pass


through Brooklyn and (or) Queens

The streets that pass through more than


one borough (and are thus typically longer)
tend to have higher injuries from collision.

16
Monthly injuries trends at different
boroughs
Brooklyn & Queens seem to be more injury-
prone.

The trends show that injuries are more likely


during spring and summer months than during
winter months. This difference is more
significant in Brooklyn and Queens. The
decrease in injuries during winter months could
be accounted by fewer vehicles on the streets,
or more cautious driving (as a result of snow).

Matching trend lines in all the Boroughs


suggests that the seasons (across the months)
have a potential role to play in the occurrences
of the collisions
17
Collision occurrences based on
hour/day of week The heat map records the collisions over the
span of years from 2012 - 2017 based on the
day of week and hour of day

We infer that the collisions are more likely on


weekdays during the morning peak hours of 7-
10am, and evening peak hours from 3-10pm.

The collision rates are at their highest at 8-9am


and 5-7pm on Weekdays.

On weekends (saturday and sunday), the


frequency of collisions between 12 midnight - 6
am are more than that on weekdays.

18
Vehicle type contributing to collisions

19
Machine learning on traffic data
Predictive Model
Predict fatality
Data on Traffic Collisions
given a collision

Logistic Did collision result


Regression in a fatality? (y/n)?
Time of day
Day of week
Month
Number of vehicles involved in
collision Very low event rate.1 fatality
Accident at an intersection? (y/n)? per 1000 collisions!
Inappropriate turn? (y/n)?
SMOTE(Synthetic Minority Oversampling
Technique) used to get a balanced dataset

20
Handling class imbalance using
SMOTE
For the machine learning approach adopted, a very small amount of the collision data led to a fatality - (1
fatality in every 10000 collisions). Due to this imbalance, the ML process is difficult and inaccurate.

THE SOLUTION?

Synthetic Minority Oversampling Technique (SMOTE) was adopted to handling this imbalance by creating
synthetic samples of the fatalities. The algorithm selects two or more similar instances from the minor
class (using a distance measure) and perturbs an instance one attribute at a time by a random amount
within the difference to the neighboring instances.

21
Accuracy of logistic regression
We use the Receiver
Operating Characteristic
Curve (ROC) to measure
the accuracy of the model
The Area Under the
Curve is 77% for this
model.

22

You might also like