O U T L I e R S

O U T L I E R S
1
Occasionally, a dataset may contain a value that is far
greater (or less) than, or doesnt display the same
characteristics as the other values. This anomalous value
is termed an influential observation. If the influential
observation is not representative of the population being
sampled, it is called an outlier.
CAUSES
Influential observations and outliers occur for a variety
of reasons. Some are straightforward data generation or
reporting errors. ab results that are off by a factor of
ten are often identified as outliers. Outliers occur in
business data for a variety of reasons. !eporting
deadlines may be missed, weather or construction may
prevent customers from shopping, and there may be one"time corrections for past errors.
Sometimes, there are deterministic influences that s#ew some measurements. $or
e%ample, aberrant measurements may be caused by instrument error or miscalibration.
Some outliers arent errors but instead are the result of inherent variability or a natural
cause. So, if you run into outliers, try to figure out why they e%ist. They may mean
nothing so that you can delete them from the analysis, or they may be critical to your
interpretation of a dataset. &oull probably find that, most of the time, the causes of
outliers will be un#nown.
IDENTIFICATION
Influential observations and outliers are generally not difficult to detect. Sorting and
listing the data will often reveal 'uestionable values, though the best way to identify a
potential outlier is by graphing the data. (istograms, bo% plots, probability plots, time"series
plots, or scatter plots of the data will usually reveal any aberrant values.
)raphs are particularly effective in identifying three patterns of outliers*
Cross-trend Outliers. +ross"trend outliers lie a substantial distance away from the
rest of the data in positions that do not fall on the trend of the data. ,s a
conse'uence, they can substantially reduce !
-
, inflate variances, and change
regression model coefficients. They are usually easy to identify in graphs, however,
their cause is usually difficult to ascertain.
In-Trend Outliers. In"trend outliers lie a substantial distance away from the rest of
the data in positions that do fall on the trend of the data. i#e cross"trend outliers,
they are usually easy to identify in graphs. They can substantially inflate !
-
but not
change regression e'uations, which leads some analysts to include the outlier despite
.
This idea for this title came from* /ec#man, !.0. and +oo#, !.1. (.234). Outlier. . ..s, Technometrics, -5,
..2".64. If you dont get the 7o#e, dont worry about it.
Im being an outlier.
evidence of 'uestionable validity. Their cause is often easy to ascertain because of
the uni'ue conditions the outlier represents.
Fringe Outliers. $ringe outliers lie a relatively small distance away from the rest of
the data in positions that parallel the trend of the data. They are not always easy to
identify in graphs. They can deflate !
-
and change regression e'uations. Their cause
is usually difficult to ascertain but may be the result of some bias in the data
collection.
There are many statistical tests for identifying outliers. Outlier tests follow one of several
strategies. Deviation/spread tests are li#e simple t"tests. They are calculated as the difference
between the outlier value and the mean (or other measure of central tendency), divided by
the standard deviation (or other measure of data dispersion). Excess/spread tests, also called
Dixon-type tests, are calculated as the difference between the outlier and the ne%t closest
value (or other observation in the dataset), and the dataset range (or other dispersion
statistic). Some statisticians prefer this type of approach because it is not necessary to have
good estimates of the mean and variance. Other outlier tests e%amine sums"of"s'uares,
s#ewness, and location relative to the center of the dataset.
The truth of the matter is that outlier tests are often superfluous. If you can see it in a graph,
the test will usually confirm what you see. Tests are often convenient for convincing
reviewers that what you thin# is an outlier, really is. If you cant see it in a graph but an
outlier test is significant, it may be an outlier 8 or not. The real issue, in most cases, is what
you do if you find a value you thin# is an outlier.
TREATMENT
There are five options for treating outliers*
Inclusion 9 Inclusion involves #eeping the outlier in the dataset. This approach
would ma#e sense to use if youre loo#ing to assess the effects of the anomalies.
Sometimes youre forced to ta#e this approach because an unenlightened reviewer
thin#s you are trying to :pull something.; In cases li#e this, it might be beneficial
to run your analyses both with and without the outlier so that everyone can
understand its effect.
Correction 9 +orrection involves changing the outlier to the correct value. This
doesnt happen often. &ou might find an outlier to be an error but you cant
correct it because you dont #now what the true value should be. In that case,
deletion is probably a better option. If youre luc#y, though, you might find an
outlier to be an error and be able to correct it.
Replacement 9 !eplacement involves changing the outlier to a contingency
value. This approach is li#e the replacement options for missing data. <sing the
mean or median in place of an outlier will bias the dataset, but not nearly as much
as the outlier. This is often the best approach to use for comple% statistical
calculations.
Accommodation 9 ,ccommodation involves #eeping the outlier in the dataset
but using :robust; statistical procedures that are less sensitive to outliers.
=onparametric statistics are often used for this purpose.
Deletion 9 1eletion is simply removing the outlier from the dataset. This
approach would ma#e sense if youre loo#ing to assess general trends. Once
again, it might be beneficial to run your analyses both with and without the
outlier.
The option you select should depend on whether you believe the aberrant observation is
representative of the population you are investigating. &our ob7ective and the type of
analysis you plan to do will also be considerations in this decision.
WHAT SHOULD YOU DO?
If a statistical graphic or an outlier test suggests that a data value may be an influential
observation or an outlier, follow these steps*
.. >%amine a variety of graphical depictions of all the data
points including bo% plots, probability plots, bivariate plots,
time"series plots, and contour maps to assess possible reasons
for the aberrant observation.
-. !eview notes and metadata concerning the sample or
measurement to determine if any irregularities in the
sampling or data collection processes may be responsible for
the discordant value.
4. !eview documentation related to data 'uality for the sample or
measurement to determine if any irregularities in the collection,
pac#aging, transport, and analysis or measurement and recording
processes may be responsible for the discordant concentration.
?. If any information indicates that the sample is probably not representative
of the population being sampled, consider the sample or measurement to
be an outlier and replace or delete it from further analysis. If possible,
collect a new sample or measurement.
5. If any information indicates that the sample should be representative of
the population, review results for other measurements from the same
source to determine if other results support the legitimacy of the
suspected outlier. ,lso, review results for the same variable that may have
been generated during previous sampling efforts.
6. If prior results for the variable or results for other variables are consistent
with results for the suspect sample or measurement, retain the value and
evaluate it as an influential observation.
@. If prior results for the parameter or results for other parameters are not
consistent with results for the suspect sample or measurement, consider
the value to be an outlier and replace or delete it from further analysis. If
possible, collect a new sample or measurement.
This procedure wor#s best if both data analysts and reviewers can somehow be involved
in the e%amination process. /e sure to document all findings and decisions during this
process.
If you decide to retain the outlier, consider using a nonparametric alternative to the
procedure you planned to conduct. If for some reason this is not feasible, consider
analyAing the dataset twice, once with the outlier and once without the outlier. +aveat
your conclusions on the basis of the outlier and recommend collecting additional samples
or measurements to assess its validity. +onsultants always recommend additional wor#
anyway, so this should come as no surprise to either clients or reviewers.
If you are assessing data trends, you will probably want to delete or replace any outliers.
>ven a single outlier can mas# significant trends. /e aware however, that this action
could bias predicted values and the prediction error if the cause of the outlier is natural.
)iven the choice to replace or delete an outlier, consider the number of samples you have
and the importance of the variable the outlier is a measure of. !emember, if you delete
the outlier you will end up having to delete either the sample or the variable to conduct
your statistical analysis. If the variable is important and you dont have many samples,
consider replacing the outlier.
There is also a psychological component to consider when replacing or deleting outliers.
Scientists and engineers are taught that it is unethical to delete or change data that might not
fit with their e%pectations. Outliers challenge that notion. Statisticians and reviewers become
highly suspicious of each other when the need to 7udge an outlier arises. +onse'uently, it is
sensible to have a procedure for evaluating outliers in place that everyone agrees to before
the need arises. >ven so, somebody will criticiAe you no matter what you do. Its the way
things wor#.
Read more about using statistics at the Stats with Cats blog. Join other fans at
the Stats with Cats Facebook group and the Stats with Cats Facebook page.
Order Stats with Cats: The Domesticated Guide to Statistics, Models, Graphs,
and Other Breeds of Data Analysis at amazon.com, barnesandnoble.com, or
other online booksellers.

O U T L I e R S

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

O U T L I e R S

Uploaded by

Copyright:

Available Formats

O U T L I E R S

You might also like