Professional Documents
Culture Documents
Statistics
Abstract
The meaning of robustness and its meaning in different settings are
explained. The importance of robust statistics is shown through
comparison of the classical statistical methods and basic parallel robust
methods for the same data. A few examples using such methods are
mentioned to highlight their use in current research in various fields of
study.
Keywords: Robustness, Robust Statistics
2. Robustness in Statistics
Robustness in statistics, as described by Huber signifies insensitivity to
small deviations from the assumptions [4]. When using classical
statistical methods, we take a few assumptions. These assumptions
include: 1) Normality Assumption (All the populations being analysed are
considered to normally distributed.) 2) The normal distributions have the
same variances and standard deviation. The problem with these
assumptions is that they are frequently violated when considering real
world data. So, when using robust statistics we try to reduce the influence
of these outliers on the estimators representing the location and the scale
of the data points. The data could be skewed, contain outliers and have
thick tails and when such data sets are studied using classical statistical
methods, false results are obtained which are highly influenced by the
presence of these departures from the assumptions of normality and
symmetry. The other areas of violations have not been study as deeply as
the presence of outliers.
When representing a data set, we use two primary statistical values i.e.
the Mean and the variance. The mean is used to show the central
tendency or the location and the variance or standard deviation which are
used to represent the distribution, dispersion, spread or the scale of the
data points about the location.
=
1
x
n ;
( xx )2
S=
2
n1
Here we can see, a single value can have a massive impact on the mean
or any other normal statistic values as a matter of fact. If the value of the
last point was or -, then the mean also would have been infinity or
minus infinity. Hence the mean can be called an unbounded statistic and
even a single value can result in aggressive fluctuation of the value itself.
Again the median is impervious to such effects of bad data [6].
MAD( x )
0.6745
Again comparing the variations of the scale parameters, MAD value for
the dataset is 1.47 and 1.35 after removal of the big data point.
The problem arises when we try to use these robust methods for an actual
standard normal distribution. The efficiency of the robust methods is low
when compared to the classical methods. Robust methods behave in an
optimal manner when the distribution has thicker tails or when outliers
exist in the distribution and the classical methods are the better option for
an optimal result for a normal distribution and even a small deviation from
it can cause sub optimality[6].
Next we discuss an important part of statistics, regression. Regression
analysis is used to understand the relation between two variables using
various methods and fit a curve to the data to predict future values. The
most ordinary method is the Ordinary Least Squares (OLS) method which
minimizes the value of the residual from the data points to the fitted curve
or line. The most simple is linear fitting of data to a line. Again the
problem arises due to the outliers. The model is based on finding the
minimum value for sum of square of residuals of the data point form the
mean, hence in turn the model is highly sensitive e to outliers.
For an example, we take a data set containing the weights and the heights
of
10 children [8] and create a linear regression model.
Heights -> (65.78, 71.52, 69.40, 68.22, 67.79, 68.70, 69.80, 70.01, 67.90,
66.78)
Weights>(112.99,136.49,153.03,142.34,144.30,123.30,141.49,136.46,112.37,120
.67)
(ri )
i=1
Hubers M-estimator[9]:
1
1
( z i) = z i2 , if | z i |<c (or) ( z i) =c |z i| c i2 ,if | z i | c
2
2
Where
1 6
( c ( c 2z 2i ) ) , if |z i|<c
6
(Or)
( z i) =0, if zi c
M i=
Y iY
s
0.6745( x i x )
MAD
A few basic formal outlier tests have been given below[11]. All the
following tests assume that the data approximately follows the Normal
distribution.
1) Grubbs Test:
This test is used when we are trying to find a single outlier. It is
through Hypothesis testing, with the Null hypothesis representing
No Outliers and the alternative hypothesis being the presence of
Y Y min
. or
s
Y maxY
s
Figure 6: Scatter plots of the modelling results and observations of total phosphorus load in
classical analysis and in robust statistics for the Middle Warta river in the validation process. [15]
References:
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]