You are on page 1of 6

Corelational analysis of the phenomenon “road accident” using Data Mining techniques

January 2010 Eng. Răzvan PENEŞEL

Corelational analysis of the phenomenon “road accident” using Data


Mining techniques

Eng. Razvan PENESEL


Prof. Dr. Eng. Stefan HOLBAN – Scientific Coordinator
Faculty of Automation and Computers, University Politehnica Timisoara

Abstract
The purpose of this paper is to use data exploration techniques (construction of co-
relational models) for analyzing data that can be obtained from a road accident which
demonstrates that this type of analysis can provide useful information in granting the full or
partial amount for which the vehicle was insured. In the paper it is presented how this type of
analysis can be used in the analysis of a road accident. To carry out this task, we used the
Weka utility from the data mining field. This tool is currently used in faculties around the
world for studies in data mining.

Introduction
Recent years have marked a shift in the use of volumes of data accumulated from a
research process back to one that addresses several issues related to the future, changes that
were imposed following the maturation of technologies related to data mining.
The concept of Data Mining, in essence, defines a process of extracting new
information from existing data collections, information that can respond not only to question
what happens, but why it happens? A Data Mining study is generally composed of two
objectives: one that describes what is significant to determine variables and their influences,
and one that predicts. By contrast to the usual queries on databases using a query language
like SQL, Data Mining classifies and groups data, from different systems and possibly
incompatible, seeking new associations. The used in data mining enable users to analyze data
at different levels of detail and abstraction, which facilitate decision making.
In the study Corelational analysis of the phenomenon “road accident” using Data
Mining techniques a method was proposed for determining the profile of the person who is
prone to cause a traffic accident and a method for determining the relationships that occur
between people who have caused accidents in order to find the most relevant causes that led
to those events.

Creating the prototypes

In an existing data structure where each instance (object of the database) defines road
traffic accident data which was obtained from the forms that are completed at the insurance
company where the car is insured. In the database there are a total of 19 accidents each
characterized by a total of 18 attributes. The statistical results obtained from each attribute
present in the database will lead us to draw conclusions relating to the road traffic accident
(Figure 1).

1
Corelational analysis of the phenomenon “road accident” using Data Mining techniques
January 2010 Eng. Răzvan PENEŞEL

Figure 1. Statistical analysis of the attributes.

First we had to determine the center of gravity of the sample so that the following adjustments
were made for the K-Means algorithm:

o We used an unsupervised method of work, in which case we took into account


all the attributes that characterize an instance of the sample;
o A data set is viewed as a single cluster whose center of gravity is calculated;
o To calculate the center of gravity (prototype sought) an Euclidean metric is
used to calculate the distances.
We will obtain a regresional model of the phenomenon road accident; we will also obtain the
prototypes of people who produce road traffic accidents. Such a prototype that has been
obtained from the application of CLUSTERING algorithms is as follows (Figure 2):

Figure 2. Clustering the data

The data obtained was then analyzed for later adjustments to the algorithm.

2
Corelational analysis of the phenomenon “road accident” using Data Mining techniques
January 2010 Eng. Răzvan PENEŞEL

In order to obtain a prototype of those men and women who cause a traffic accident
we had to determine the two centers of gravity of the sample, for which purpose the following
adjustments were made for the K-Means algorithm:

o We use a method of supervised work, in which case we took into account all the
attributes that characterize a sample except the attribute "sex" which is the class
around which we formed the groups
o A set of data is seen as consisting of two clusters whose centers of gravity are
calculated;
o To calculate a center of gravity (prototype sought) we used an Euclidean metric to
calculate the distances.

Following the calculations we obtained two prototypes. The prototypes that resulted from
the application of the K-Means algorithm will be a generic person - man and woman - whose
characteristics encompass all those features of the data set and that will ultimately define the
prototypes of the man or woman who will have an accident (Figure 3).

Figure 3. Prototypes: Cluster 0 – Male; Cluster 1 - Female

3
Corelational analysis of the phenomenon “road accident” using Data Mining techniques
January 2010 Eng. Răzvan PENEŞEL

Corelational analysis

The next step in the study was the corelational analysis of the phenomenon road
accident which resulted in abstract elements that characterized the entire set of data, this time
determining the relations which appear in the sample under analysis.
A corelational model will be built under the form:

Damage Score = f (accident characteristics)

It was considered that the result of an accident is the amount of damage produced to
the car, which is why it was selected as the dependent variable.
We obtained a correlation matrix from which we can detect the relationships that exist in the
dataset. These relations can be extended to define relations that characterize the phenomenon
of "road traffic accident", their credibility is that of the built model’s credibility;
To achieve this goal we had to encode the parameters involved in creating the model.
To build the model we used Weka in mode Classify - functions – LeastMedSq which
implements a multiliniar regression (Figure 4). We used a graphic visualization to see the
result and the accuracy obtained. We analyzed the correlation coefficients which reflect the
ultimate importance of each parameter in the adopted model, and thus the contribution of each
of the 14 parameters taken into account in determining the Damage Score which happens in
an accident. Another very important step is the analysis of the intercorrelation coefficients, in
this stage we looked at each trait with the remark that the analysis is made only for
contributions above 25% which parameter has in modifying another parameter.

Figure 4. Linear regression model obtained

4
Corelational analysis of the phenomenon “road accident” using Data Mining techniques
January 2010 Eng. Răzvan PENEŞEL

Conclusions

The resulting conclusions concern the emergence of a hierarchy of factors that compete to
obtain a certain score of damage. Looking in descending order we can observe that:

o The largest contribution to the production of damage with a high Damage score is
vehicle speed (70%). The amount of damage increases with the speed;
o Second place is the driving age at a rate of 40%. It is noted that a higher driving age
leads to higher production of damage in the event of an accident;
o In third place is the time when the accident occurred at a rate of 39%. Interesting
analysis of the sign ‚-’ in front of this parameter. Its meaning is: damage increased
with decreasing the time that the accident occurs. In other words, increasing damage
into the morning hours.
o The fourth place is held by the ninth parameter Accident occurred involving. It shows
that in the accident there were involved only one car or two or more. This parameter
indicates that the damage increases if there was a collision in traffic and not when a
car is parked.
o In fifth place with a 20% rate is the parameter Vehicle that was hit. It is noted that the
importance of damage increases with increasing speed of the second vehicle involved
in the accident.

We would like to present some of the relationships that we discovered between the different
attributes.
1. Attribute 1 - Sex
• The lower the hour the greater the chance that the accident was made by a man;
• Accidents involving women are higher during the week-end.
2. Attribute 2 – Hour of the accident
• The lower the hour the higher the chance that the accident was produced by only
one car;
• The higher the hour the greater the chance of a collision;
• Accidents that are in the early hours of the morning happen towards the week-end.
3. Attribute 3 – Head lights
• Cars that have caused crashes and have the head lights on have had high speed
crashes.
4. Attribute 4 – Speed of the crashed vehicle
• The higher the speed the greater the chance of a front/back collision;
• The higher seniority in driving experience the greater the speed at which the crash
occurs;
5. Attribute 5 – Speed of the vehicle crashed into
• The lower the driving experience the greater the speed of the car crashed into.
6. Attribute 6– Encoding the cause of the crash
• The higher the cause (quantified from 1= didn’t give way to 6=incorrect
overtaking) the higher the chance of a frontal collision.
7. Attribute 7– Visibility conditions
• Accidents that happen during poor visibility conditions happen towards the week-
end.
8. Attribute 8– Road Conditions
• In poor road conditions people with lower ages are prone to accidents

5
Corelational analysis of the phenomenon “road accident” using Data Mining techniques
January 2010 Eng. Răzvan PENEŞEL

9. Attribute 9– Accident involved with the damage of …


• The higher the cause (quantified from 1= didn’t give way to 6= incorrect
overtaking) the greater the chance of 2 cars being involved in the crash.
10. Attribute 10– Type of collision
• The higher the seniority in driving experience the greater the chance that the crash
will be from a collision.
11. Attribute 11– Age
• The higher the age of the driver the greater the chance of the crash occurring at the
end of the week.
12. Attribute 12– Day of the crash
• Accident that happen towards the end of the week start to increase towards the end
of the year.
These conclusions have been compared with the Timisoara Traffic Police office’s annual
review (2008) and most of the conclusions were compatible.

References

[1] Ian H. Witten şi Eibe Frank “Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations“.
[2] M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
[3] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify
the clustering structure, SIGMOD’99.
[4] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
[5] W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB’97.
[6] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very
large databases. SIGMOD'96.
[7] G.Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. în Proc 1998 Int. Conf. Very Large Data Bases, pages 428-
439, New York, NY, August 1998
[8] G. Karypis, E.-H. Han, and V. Kumar. CHAMELLEON: A hierarchical clustering algoritm using
dynamic modeling. COMPUTER, 32:68-75, 1999
[9] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. ROCK: A Robust Clustering Algotithm for
Categorical Attributes
[10] M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery în large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
[11] IEEE Transactions on Pattern Analysis and Machine Inteligence, Vol 24, No. 7, July 2002, An
Efficient k-Means Clustering Algorithm: Analysis and Implementation
[12]A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. L. Kaufman and
P. J. Rousseeuw. Finding Groups în Data: an Introduction to Cluster Analysis. John Wiley & Sons,
1990.
[14] P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
[15] R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
[16] E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
[17] Khaled Alsabti, Sanjay Ranka, Vineet Singh. An Efficient K-Means Clustering Algotithm
[18] Miao Chong, Ajith Abraham, Marcin Paprzycki. Traffic Accident Data Mining Using Machine
Learning Paradigms
[19] Ing. Răzvan PENEŞEL. Corelational analysis of the phenomenon “road accident” using
Data Mining techniques, Bachelor examinations, July 2009