You are on page 1of 17

Computer Networks 135 (2018) 15–31

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

A BasisEvolution framework for network traffic anomaly detection


Hui Xia a,1, Bin Fang b,∗, Matthew Roughan c, Kenjiro Cho d, Paul Tune e
a
School of Accounting, Chongqing University of Technology, Chongqing 400044, China
b
School of Computer Science, Chongqing University, Chongqing 400044, China
c
ARC Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS), School of Mathematical Science, University of Adelaide, Adelaide, SA 5005,
Australia
d
IIJ, Japan
e
Image Intelligence, Australia

a r t i c l e i n f o a b s t r a c t

Article history: Traffic anomalies arise from network problems, and so detection and diagnosis are useful tools for net-
Received 13 January 2017 work managers. A great deal of progress has been made on this problem so far, but most approaches
Revised 16 October 2017
can be thought of as forcing the data to fit a single mould. Existing anomaly detection methods largely
Accepted 17 January 2018
work by separating traffic signals into “normal” and “anomalous” types using historical data, but do so
Available online 31 January 2018
inflexibly, either requiring a long description for “normal” traffic, or a short, but inaccurate description.
Keywords: In essence, preconceived “basis” functions limit the ability to fit data, and the static nature of many
Anomaly detection algorithms prevents true adaptivity despite the fact that real Internet traffic evolves over time. In our
Basis approach we allow a very general class of functions to represent traffic data, limiting them only by in-
Evolution variant properties of network traffic such as diurnal and weekly cycles. This representation is designed to
Low false-alarm probability evolve so as to adapt to changing traffic over time. Our anomaly detection uses thresholding approxima-
SVD
tion residual error, combined with a generic clustering technique to report a group of anomalous points
as a single anomaly event. We evaluate our method with orthogonal matching pursuit, principal com-
ponent analysis, robust principal component analysis and back propagation neural network, using both
synthetic and real world data, and obtaining very low false-alarm probabilities in comparison.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction a need for anomaly detection, which estimates typical behaviour


of network traffic and detects deviations from this [1,6].
Conceptually, a network traffic anomaly can be defined as an Anomaly detection schemes need to have low false-alarm prob-
intentional or unintentional attack, fault, or defect, which perturbs abilities, at least of the order of 10−4 , because of the large num-
“normal” network traffic behaviour [1]. The causes could be traf- ber of tests running in even a moderately sized network, and the
fic outages, configuration changes, network attacks (e.g., DDoS at- amount of human work required to process each anomaly. Most
tacks), flash crowds, network worms, and so on [2–4]. As the Inter- existing techniques do not achieve such low false-alarm probabili-
net continues to grow in size and complexity, traffic anomalies are ties [7,8], at least in part because they are restricted in their rep-
becoming more common and. diverse. They need to be detected resentation of typical traffic. Existing techniques represent typical
and analysed effectively. traffic in terms of a set of functions of historical data – we refer
Most Intrusion Detection Systems (IDS) are knowledge-based to this as a “basis”, though it may not be a strict mathematical ba-
detection, which search for instances of known network problems sis for the space. The requirement that we accurately model typical
by attempting to match with pre-determined representations [5]. traffic either results in a large error in representation, or a large set
Such systems can be trained against known records, but cannot of functions which is undesirable for reasons described best under
detect unknown problem types, i.e., anomalies [6]. Thus, there is the heading of the “Principle of Parsimony” [9]; principally, prob-
lems of estimation and generalization.

In our work, we also search for basis functions, but we do so
Corresponding author.
from a potentially infinite set, allowing a very accurate representa-
E-mail addresses: summertulip@126.com (H. Xia), fb@cqu.edu.cn,
fangb@comp.hkbu.edu.hk (B. Fang), matthew.roughan@adelaide.edu.au (M. tion with a small number of functions. However, unlike Principal
Roughan), kjc@wide.ad.jp (K. Cho), paul@imageintelligence.com (P. Tune). Components Analysis (PCA), which also performs such a search,
1
The majority of this work was conducted while H. Xia was visiting the Univer- we do so under constraints that match known behaviour of real
sity of Adelaide.

https://doi.org/10.1016/j.comnet.2018.01.025
1389-1286/© 2018 Elsevier B.V. All rights reserved.
16 H. Xia et al. / Computer Networks 135 (2018) 15–31

traffic: for instance, cyclicity. Current anomaly detection ap-


proaches usually ignore the nature of network traffic, resulting in
a hard explanation on the derived basis.
Also, real network traffic patterns do change over time, but
slowly. For example, the peak time of traffic might shift from 8pm
to 7pm due to the change of summer time; and traffic will change
as an ISP recruits or loses customers [10]. We also present an up-
date algorithm that improves the basis with new data, allowing the
algorithm to adapt to network changes, without a computational
explosion. The evolved basis usually remains similar to the old ba-
sis, but if the reconstruction error of the evolved basis is outside a
preset threshold, we reinitialize.
Once we have a basis for typical traffic, anomalous points are
easy to detect by examining the reconstruction error.
An additional contribution of this paper is to go beyond simple
Fig. 1. A simplified picture of the problem of anomaly detection. Conceptually,
metrics for anomaly detection performance (such as false alarms there is a space of traffic signals which are “typical”, τ , within the larger space
and missed detection probabilities). Real anomalies persist, and of all traffic signals, . We seek to estimate this region from historical data, and
better metrics for anomaly detection also determine the extent or approximate it by τˆ , so that we can classify new signals. Inaccuracies in the ap-
duration of the anomaly. In this paper, we use a clustering algo- proximation result in errors of the two types indicated.
rithm and associated quality metrics to compare detection algo-
rithms. Hence, anomaly detection in our framework includes two
components: anomalous-point detection and anomaly clustering. Moreover, Teodoroa et al., [6] provided a comprehensive review
The first is used for detecting anomalous points, and the second on commonly used techniques, platforms, systems and projects;
is for analysing anomalies. while Ahmed et al., [22] presented an in-depth analysis of four ma-
We use both synthetic and real world data to assess the perfor- jor anomaly detection techniques: classification, statistical, infor-
mance of our framework. The synthetic data is less real, but is nec- mation theory and clustering; In addition, Chandola et al., [23] dis-
essary to provide accurate metrics [11]: there are intrinsic difficul- cussed anomaly types and analyzed the anomaly detection algo-
ties of obtaining large enough sets of accurately marked anomaly rithms in six classes: classification, nearest neighbor, clustering,
data, given that the nature of an “anomaly” is intrinsically defined statistics, information theoretic and spectral-based techniques. As
only by its absence in typical traffic. We use real data to illus- the conventional distance and similarity measurements lose effi-
trate the relationship between anomalies detected by our approach cacy as the dimensionality increasing, Wellerfahy et al., [24] dis-
and those of PCA, Robust Principle Component Analysis (RPCA), Or- cussed various types of distance and similarity measurements so
thogonal Matching Pursuit (OMP), and Neural Network using Back that the most appropriate evaluation metrics can be used in detec-
Propagation (NNBP) detection models. We show that our approach tion.
has substantially better detection characteristics than these alter- Instead of reviewing the above detection methods, we seek
natives, including false-alarm probabilities as low as 7.8 × 10−5 . to describe a framework within which we can place various ap-
proaches, and we use a few exemplar techniques to illustrate this
framework, and against which we compare our approach.
2. Background and related work In general, anomaly detection algorithms use historical records
to establish a baseline for typical behaviour, and then separate un-
In many cases, network traffic records contain a number of usual, or anomalous traffic from this baseline. Conceptually this
properties, such as source IP addresses, destination IP addresses, can be illustrated by Fig. 1. The figure shows how we form an
which lead the context of anomaly detection expanding to the approximation for classification. Some techniques treat this as a
high-dimensional space. In view of this, anomaly detection ap- simple classification problem, while others project into the typi-
proaches first process attributes of data because they are of differ- cal space, and look for outliers in the residual, but the underlying
ent types. Examples include calculating the entropy of some fea- problem is the same.
tures of data, e.g., the number of observed host addresses [4]; re- There is a tension in all techniques in the number of functions
moving mean or median values of the data [12,13]; measuring the or features used to approximate the region τ . If more features are
mutual information of each feature for selection [14]. used, we might obtain a more accurate approximation, however,
In real detection scenarios, higher-dimensional data require this works against the “Principle of Parsimony” [9], which argues
more sources in calculating parameters, and may cause overfit- for smaller numbers of features to be used. The principle expresses
ting in training stage, even though the results are more explain- the fact that larger numbers of features are harder to accurately
able. In view of this, researchers adopt attributes selection (or estimate (from the historical data), and more importantly, larger
extraction) to reduce dimensionality. For example, Huang et al., numbers of features may not be generalizable. That is they may
[15] chose Self-Organizing Map (SOM) to extract principal compo- have great explanatory power for the historical data, but do not
nents as a compressed representation of a “normal picture”; Chen work well for data on which they have not been trained. A com-
et al., [16] filtered irrelevant attributes by utilizing the maximal in- mon expression is that we are “fitting the noise”.
formation coefficient (MIC) to increase the classification accuracy. The result is that although some approaches can approximate a
One explicit characteristic of high-dimensional network traffic region very accurately, the region they match is polluted by noise
data is that almost all anomalies are embedded in some lower- (for instance, undiagnosed anomalies in the training data, which is
dimensional subspaces [17]. Existing detection methods for those quite common). Illustrative examples are the poor performance of
“subspace anomalies” include subspace projected-based model PCA when the traffic measurements are polluted with anomalies
[17] and PCA [13,18]. Other techniques such as Artificial Neural [25], thereby distorting the approximation region τˆ . So there is a
Networks (ANNs) classifiers [5,15], Fourier analysis [19], wavelet tradeoff, for any one technique, but better approaches can make
analysis [20] and “BasisDetect” [21] perform better in lower- or unilateral improvements by choosing a better set of features of
single-dimensional environments. functions with which to represent τ .
H. Xia et al. / Computer Networks 135 (2018) 15–31 17

Past data
Data Basis Data Anomaly Anomalous niques seek ways to pick better functions, so that the reconstructed
transformaon derivaon reconstrucon detecon points traffic τˆ is closer to τ . Our approach tackles these problems by:
New data
1. allowing a wider class of basis functions, constrained by known
Fig. 2. The general anomaly detection process. (Data transformation) is a prepro- traffic features in order to avoid an explosion of possibilities;
cessing procedure. (Basis derivation) generates a basis from the historical data. (Data and
reconstruction) approximates the new data with the basis. (Anomaly detection) de-
2. an adaptive mechanism to allow the representation to easily
tects anomalies from the residual series, which records the difference between the
new data and its approximation. evolve.

The latter has two advantages. Firstly, it reduces the com-


putational overhead. Secondly, and perhaps more importantly, it
also constrains the basis functions. The constraint forces them to
The tradeoff can be seen in most approaches described above. be more meaningful than if we trained the features arbitrarily.
For instance, Fourier analysis often either directly filters low- We have deliberately chosen four exemplar approaches that cover
frequency components of the signal in the Fourier domain, or ap- a very wide range of ideas for anomaly detection as shown in
proximates this through a low-pass filter, which requires to trade Table 1, and in doing so we aim to illustrate the general improve-
off the number of frequencies used in approximation. A larger ment that our approach obtains in the following section.
number leads to a better approximation, but the approach then
becomes more subject to noise. Similarly, PCA constructs “normal 3. BasisEvolution framework
subspace” to approximate the region τ , which has to tradeoff the
number of principal components [26]. Techniques in Table 1 derive a basis that preserves character-
Many approaches choose a hybrid way to address the trade- istics in either time or frequency or the spatial domain. The un-
off problem. They combine PCA with multi-scale methods so that derlying assumption is that we know little about the input data,
a moderate number of features can approximate typical traffic and must discover its characteristics in these domains. However,
space τ more accurately without losing the generalization and ef- we know that in the frequency domain, typical traffic has domi-
ficiency. e.g., Some researchers utilized Empirical Mode Decom- nant cycles with 24 h and 1 week periods [29]. Moreover, it has
position (EMD) on the extracted principal components and col- less variation over short periods than these dominant cycles. We
lected Intrinsic Mode Functions (IMFs) as multiscale features for do not need Fourier analysis, or the like, to tell us this. While tech-
network traffic [27]. Similarly, Jiang et al., [8] applied wavelet niques such as PCA may appear to be sensitive to these cycles, PCA
transformation as multiscale analysis, through which the anoma- is actually quite agnostic to the period. Since it only responds to
lous traffic can be identified using mapping functions on the ex- the strength of variation across the period, a fact which can be
tracted principal components for feature under each scale. Novakov seen if we permute the traffic in time (PCA will produce identical,
et al., [26] introduced a cooperational detection approach with two albeit permuted, results).
methods: PCA- and wavelet-based detection algorithms, in which Roughly speaking, the idea of BasisEvolution is to allow for
the former examines the entire set of data to rough detect the great flexibility in discovering basis functions so as to be able to
anomalous time bins, while the latter applies a detailed localized find a very small set of such, but to use cyclicity and other traffic
frequency analysis to identify potential anomalous time bins. Ad- properties to restrict our search to make it practical. We do this
ditionally, Chen et al., [16] proposed a multiscale PCA (MSPCA) through two approaches:
model to decrease the typical traffic reconstruction error by com-
bining PCA and Discrete Wavelet Transformation (DWT). The im- 1. Initializing the basis from the typical traffic
proved version involves Bayesian PCA [28]. Given a small set of typical (cleaned) data, we apply Singular
Within these approaches BasisDetect [21] chooses an interest- Value Decomposition (SVD) [30] in each cycle to find a set of
ing path. It seeks to use a broader set of potential features, select- functions (referred as a function family) preserving the largest
ing a small number of those that match the data to create repre- energy. After finishing all cycles, the function families are gath-
sentations. By allowing a broader class of feature vectors, it allows ered together forming a final basis. Through initialization, the
a better representation with fewer features, and this is a general functions in the basis are all “normal patterns”, and easier to
property. However, if we allow that class of features to become be understood, since each element in the basis matches a spe-
too wide, we hit computational barriers. Techniques such as PCA cific cycle. Besides, the size of the basis is usually very small.
select from a much larger (infinite) set of possible feature vectors As we all know that, small anomalies are hidden by a larger
for representation, but do so in a manner that assumes an under- anomaly, or anomalies are of different scales. The above initial-
lying model. We argue here that the patterns extracted by PCA do ization finds basis functions of different scales (periods). This
not represent what we see in real traffic data: it assumes indepen- multi-scale representation of the traffic would definitely lead a
dent and identically distributed Gaussian samples, and it has long better detection performance, with more anomalies of different
been known that the Internet traffic measurements (at least) show scales being found.
strong cyclical dependencies. 2. Evolving the old basis for the new traffic.
We describe these processes more formally by suggesting that Approaches from Table 1 adapt to the new data by either using
the region τˆ is described by a set of basis functions. They are a fixed basis, which may enlarge the approximation error, or in
not truly a basis, as the region may not be a vector space (it the case of PCA-like approaches, the basis is derived again for
may not even be convex), but this terminology seems to be help- the new data with extra computational cost.
ful for a uniform description of approaches. Given this terminol- BasisEvolution addresses this issue by adapting the old basis
ogy, Fig. 2 shows how approaches can be roughly separated into into the new one. More specifically, once having initialized the
four steps: data transformation, basis derivation, data reconstruc- basis, for any new (dirty) traffic, we can get the correspond-
tion and anomaly detection. We use this structure to explain what ing new basis through evolution. The new basis minimizes the
is novel about our approach. approximation error to the new traffic and keeps the similarity
We will not dwell on the initial transformation process, as it is with the old basis. As time goes on, the continuous evolution of
about the type of anomalous behaviour observable, not the tech- the basis can better represent the statistical changes of “normal
nique for observing it. The basis derivation process forces tech- patterns” in the new traffic. Comparing with the basis deriva-
18 H. Xia et al. / Computer Networks 135 (2018) 15–31

Table 1
Description of anomaly detection in terms of candidate basis functions and techniques of finding a representation for a given data set. Here, the candidate
solution space describes the set of functions from which we can construct a typical traffic signal; and the No. is the number of functions in this space for a
signal of length n. The basis derivation refers to the technique used to refine this to a basis for the typical traffic. The basis size is the number of functions then
required to accurately represent the historical traffic.

Technique Candidate solution space No. Basis derivation Basis size

Fourier Sinusoidal basis n Predetermined low frequency functions Large


Wavelet [20] Over-complete wavelet functions kn Soft thresholding (denoising) Large
BasisDetect [21] Over-complete dictionary of sinusoids, deltas and/or wavelet functions. kn Improved OMP Moderate
PCA [14] Arbitrary ∞ Eigenanalysis Small

tion process, the evolution approach can effectively reduce the Anomalous
Data Basis Anomaly
computational cost especially for large-scale data. T0 x0 point
cleaning generaon clustering
detecon
x0' r0=x0-W0 H0 P0 A0
To define the algorithm formally, we start by noting that we H0
sample traffic at times, tj , separated by intervals t. The variables Basis
Anomalous
Anomaly
T1 x1 point
xt j refer to the total traffic during the interval [t j , t j+1 ). We group update clustering

Time index
. detecon
these into blocks of b measurements, such that we obtain a new .
H1 r1=x 1 -W1 H1 . P1 .
A1
. .
process X = {x0 , . . . , xn } where . .
.
.
.
.
Anomalous
Basis Anomaly
xm = [xt j , ..., xt j+b−1 ]T , Tm xm point
update clustering
detecon
where j = bm, and the time intervals in this series are T = b t.
. Pm Am
. Hm . rm=xm-Wm Hm .
.
.
. . . .
In BasisEvolution, we use this data segment series as the input for . .
Anomalous
the whole framework, where T is usually several cycles long, e.g., Tn xn
Basis
point
Anomaly
update clustering
it might be several weeks. detecon
rn =xn-Wn Hn Pn An
In the framework, a small set of clean data is required for the
initialization of basis. However, the collected traffic measurements Fig. 3. A bird’s eye view of BasisEvolution framework for one single link. At time
are contaminated with anomalies, and would cost large amounts index T0 , we clean the data segment x0 , and get the “anomaly free” data segment
of resources to manually separate the typical part. We then design x0 . Then we derive a basis set H0 from x0 . Define r0 to be the approximation error
a data cleaning process on a data block as an alternative. Usually, of x0 reconstructed by H0 . Since x0 is totally normal, no anomaly is expected to be
found in r0 , we then utilize x0 instead of x0 for anomaly detection. The projection of
we choose the first collected data block x0 for cleaning.
x0 on H0 is W0 , and r0 is the difference between x0 and approximation series W0 H0 .
With the cleaned data segment x0 , the framework derives a ba- Through anomaly detection process, we get anomalous points P0 , and the clustered
sis for initialization. We name this procedure as the basis genera- anomalies A0 . At subsequent time indices (e.g., Tm ), we evolve the old basis H(m−1)
tion process and the initialized basis as H0 . into Hm , calculate the reconstruction error rm , and repeat anomaly detection process
For any new traffic (e.g., x1 ), we updating the old basis (e.g., H0 ) to find anomalous points and anomalies.

into a new one. This basis evolution process keeps repeating as


long as we have new traffic. However, if the approximation error t1
Simulation values

10
of the evolved basis for new traffic is outside a given bound, the
BasisEvolution will clean the data and initialize the basis instead
of evolving the old one. 5
The framework proceeds anomaly detection on the approxima- t4 t
tion error (e.g., r1 ). We identify the anomalous points by thresh- 2 t
0 3
olding, and cluster them together, with each cluster being referred
as an anomaly. 0 500 1000 1500 2000 2500 3000 3500
The complete algorithm is illustrated in Fig. 3, in which the Ba- Time
sisEvolution is composed of four components: data cleaning, basis
generation, basis update and anomaly detection (anomalous-point Fig. 4. Different anomaly types in the simulated time series. Traffic point at time
t1 is a spike. The context of point at t2 is different with that at t4 , even though the
detection and anomaly clustering). We describe each in detail be-
value at t2 is similar to that at t4 , so it is a contextual anomaly. Any point at time
low. interval t3 is normal, however, all points together is a collective anomaly [23]. For
most anomaly detection methods, e.g., threshold-based techniques, spike is easier
to be found than the rest, which indicates the rest anomalies can hide behind a
3.1. Data cleaning larger anomaly.

As for the whole BasisEvolution framework, we first need a


small set of “normal” historical data, so that the basis generation 1. Cleaning the data with multiple iterations.
process can access the “norm” patterns, and generate the primal Some small anomalies may be hidden by a larger anomaly. Ex-
basis functions. However, the data we collected are contaminated amples can be found in Fig. 4, in which the simulated time se-
by anomalies in practice. ries includes three types anomalies: spike (at time t1 ), contex-
One common method for this issue is to label the data based tual anomaly (at time t2 ) and collective anomaly (at time in-
on the experiences and professional advices. This job is time and terval t3 ) [23]. If we measure the normal region with thresh-
effort-consuming, especially when the size of data grows large, olding (such as [−2,2] in Fig. 4), only a spike anomaly can be
which results in a difficult implementation in reality. found, while the rest two anomalies are hidden.
Since the dataset we used has not been labeled yet, we then ap- Thus, successive layers of cleaning may be required as shown in
ply standard, existing anomaly detectors as alternative approaches Fig. 5, in which the anomaly detection techniques repeat many
to auto label the target dataset. However, there are many issues times. In each iteration, the detected points are removed and
left. We address them as list below: filled by “normal values” after detection. Then in the next iter-
H. Xia et al. / Computer Networks 135 (2018) 15–31 19

x0 fic. The basis generation problem can be expressed as the compu-


tation of a basis set that minimizes the reconstruction error:
 
Anomaly detection  L 
 (l ) (l ) 
algorithms argmin x − w h , (1)
L,w,h(l )  l=1

Anomalies? where x is the input signal. L is the number of basis vectors, w is
Yes a vector of weights, and h(l) are the basis vectors.
No Remove
Count > No The problem is too open however, in that we have complete
anomaly and
maxC? choice of basis, and therefore could achieve a trivial perfect recon-
Iterpolation
Yes struction with h(1 ) = x, w(1 ) = 1, and L = 1.
x 0' We know that traffic exhibits approximate periodicity (with
known periods), and so we insist that the basis vectors are peri-
Fig. 5. The data cleaning algorithm. In practice, we apply the combination detector odic. We list the periods ci , i = 1, 2, . . . , I, in increasing order in
(PCA+NNBP, seen in Section 4.3.2) on x0 to find anomalies. If anomalies are de-
terms of number of measurements. Theoretically, ci should repre-
tected, we remove them, and replace gaps using interpolation and prediction, and
then repeat detection until no more anomalies being found or the iteration count
sent integral multiple of cycles of network traffic. Here we only use
larger than the preset maxC (maxC = 5, according to Section 4.3.3). daily and weekly patterns for simplification. For instance, if there
were 24 measurements per day, the typical 24 h and 1 week cycles
would be represented by c1 = 24 and c2 = 168. We select the third
ation, the modified traffic data are more “normal” which result and last element to be the b largest
 number of weeks that the data
in more hidden anomalies being found. Repeat the whole pro- segment include, cI = 24 24 . We denote h(ci , j ) , j = 1, 2, . . . , Ji , to
cess until no more anomalies are found or the iteration count be the jth basis vector with period ci , i.e., for each element
exceeds the preset threshold (we set maxC = 5 according to
Section 4.3.3). hk(ci , j ) = hk(c+i ,acj ) ,
i
In each iteration, we remove all detected anomalous points
for all valid k, a ∈ Z+ . We write
to eliminate the impacts of anomalous points in the follow-  
ing iterations, but this leaves gaps that must be filled for H (ci ) = h(ci ,1) |h(ci ,2) | · · · |h(ci ,Ji ) ,
most algorithms. We fill the gaps using a weighted average
of periodically-adjusted linear interpolates and the predictions and the basis H = [H (c1 ) | · · · |H (cI ) ].
provided by NNBP. Appropriate weights were chosen by com- Then the reconstruction problem is defined by a new objective
parison of the interpolated data with non-missing points at function:
 
equivalent times of day and week in other segments of the  I  Ji 
 ( ci , j ) ( ci , j ) 
data. argmin x − w h , (2)
2. Applying a combination detector to find anomalies. L,w,h(l )  i=1 j=1

For each iteration in Fig. 5, we test the performance of three
detection approaches: PCA, ANN and the combination of them. where the size of basis is L = Ii=1 Ji .
The results indicate that no single anomaly detector is perfect, As in OMP, we progressively refine the traffic measurement
and the combination detector works better. More details can be approximation with an iterative procedure across all cycles that
seen in Section 4.3.2. So we combine the detection results of pursues a good solution. Our algorithm does so starting with the
several anomaly detection algorithms in each iteration. shortest period, and progressing through them in order. At each
Imagine the detected anomalies of the dth detection method is phase, we fix all of the variables except that of the period we are
Ad , for d ∈ {1, 2, ..., D}, then we choose our anomalies as the set working on. With the higher period functions implicitly set to zero,
D so the residual in phase n is defined by r(1 ) = x and for n > 1
d=1 Ad , the goal being to ensure that we have the largest pos-
sible intersection with the true set of anomalies. In this sense, n−1 
 Ji
we collect as many types of anomalies as possible so that the r (n ) = x − w ( ci , j ) h ( ci , j ) ,
cleaning process is less sensitive to the polluted data, and can i=1 j=1
revise the whole traffic more effectively.
and we optimize h(cn , j ) , and their respective weights by solving
Besides, the mechanism of the combination detector is simi-  
lar to the pooling strategy for information retrieval [31]. With  
 (n ) 
Jn
( cn , j ) ( cn , j ) 
enough systems and a sufficiently deep pool, each system con- argmin r − w h , (3)
tributes a number of documents into the pool for assessment, Jn ,w(cn , j ) ,h(cn , j )  j=1

and the whole pooling can provide more reliable and relevant
One tool for solving Eq. (3) is SVD. The first step of applying
documents for each query. As for our combination detector,
SVD is transforming the residual signal r(n) to a matrix, by taking
points in the pool are more likely to be real anomalies in man-
the new matrix’s rows to be sliding windows (with width of one
ual judgement, since the pool is the collection of detection re-
period cn ) of the signal, i.e.,
sults of each single method. Meanwhile, the rest part of the
⎡ ⎤
traffic is regarded as “typical”. rt(0n ) rt(1n ) ··· rt(cn )−1
n
⎢ rt(n) rt(2n ) ··· rt(cn ) ⎥
We do not have ground truth data in real network conditions, ⎢ 1 n ⎥
so it is hard to measure the performance of the data cleaning pro- R (n ) = ⎢ . .. .. .. ⎥. (4)
⎣ .. . . . ⎦
cess directly. However, we can assess it through the performance
of the overall algorithm (seen in Section 4.3.1). rt(b−c
n)
rt(b−c
n)
··· (n )
rtb−1
n +1 n +2

The new matrix will be (b − cn + 1 ) × cn . In all scenarios consid-


3.2. Basis generation
ered here b − cn + 1 > cn .
We then apply SVD to R(n) to obtain the decomposition
Our goal here is to find a small set of representative functions
to use as a basis for our approximation of the space of typical traf- R (n ) = U V T , (5)
20 H. Xia et al. / Computer Networks 135 (2018) 15–31

where V and U are unitary matrices, and  is a diagonal matrix Algorithm 1 SVD on specific cycles algorithm (SVD_SC).
containing the singular values σ j of R(n) in the cycle cn . The singu-
Input:
lar values are arranged in decreasing order.
The traffic segment, x;
To understand the SVD’s use in matrix approximations, consider
The traffic aggregation time, t days;
the following interpretation. The matrix  is diagonal, so the SVD
The size of x, b;
of a matrix R(n) can be rewritten
The threshold of basis energy proportion, γ .

 Output:
R (n ) = u j σ j vTj , (6) The basis H, and weight W ;
j=1 The final residual r;
1: Initialize the cycle set
where uj and vj are the jth columns of U and V respectively, and
 1 7 
 = min{cn , b − cn + 1}.
 by retaining only the terms c← , ,b ;
We then create an approximation R t t
corresponding to the largest Jn singular values:
2: Set r(1 ) ← x;

Jn
3: for n = 1, . . . , I do
=
R u j σ j vTj . (7)
4: // Create R(n ) from r (n ) according to Eq. 4;
j=1
5: for row = {1, 2, ..., b − cn + 1} do
The energy proportion of uj is 6: Slide windows on r(n ) of length cn to form matrix R row
by row,
σ j2  
E j =  , (8) R(n ) ← R(n ) ; [rt(row
n)
, ..., rt(row
n)
] ;
j=1 σ 2
j
−1 +c n −2

7: end for
and we choose Jn such that Ej is closest to but still larger than the 8: Apply SVD to matrix R(n ) , i.e., find U,  and V such that
preset threshold value γ = 0.3. R (n ) = U V T ;
Since vectors of the first Jn columns in U take the largest energy 9: Set j ← 1;
of the signal, we directly associate h(cn , j ) = u j , and obtain the cor- 10: while Energy rate E j ≥ γ do
responding weight 11: j ← j + 1;
12: end while
h ( cn , j ) r ( n )
w ( cn , j ) =   . (9) 13: Set Jn ← j;
r(n)  14: The first Jn columns in U constitutes the function family
 
Then we construct a new residual for the next period, and solve H (cn ) ← u1 , . . . , uJn ;
the new problem as above.
We call this algorithm SVD on Specific Cycles (SVD_SC), which
15: Calculate the corresponding weights according to Eq. (9)
is described in detail in Algorithm 1. One thing we should note is
 T
that in Step 6 and also in subsequent algorithms we use Matlab’s w(cn ) ← w(cn ,1) , . . . , w(cn ,Jn ) ;
notation for vector concatenation.
The entire algorithm is supposed to be applied to the first
cleaned signal, i.e., x = x0 . Moreover, when traffic patterns of one 16: r(n+1 ) ← r(n ) − H (cn ) w(cn ) ;
data segment vary too much, we will also generate a new basis 17: end for
through a basis generation process instead of updating the old one. 18: r ← rI+1 ;

The entire Algorithm 1 is based on SVD, and the computa-


tion time is O(I × Cn2 × (b − Cn + 1 )). In practice, I is the number
of levels of cyclicity (I = 3 for two-week data segment). According   
to Section 4.3.1, the data cleaning process determines the perfor-   (l ) (l ) 
 (k ) (k ) 
mance of the derived basis, and we should clean the data segment argmin  xm − wm hm − wm hm 
before generating a new one. For a two-week data segment, a new λd ,wm(k) ,hm(k)  l=k
 (10)
  
basis needs about 6.008 s on average in which the data cleaning (k )  (k )
+λd d hm H(m−1) ,
process takes about 6 s (the NNBP algorithm in the data clean-
ing process takes a large amount of time), and the basis gener-
(k )
ation process needs about 0.008 s. The experiment was repeated where hm is a new basis vector we are seeking, H((m k)
−1 )
repre-
100 times in Matlab R2012b on a Mac OS X 10.9.5 with a 2.7 GHz (k ) (k )
sents the history of basis vectors {h1 , . . . , h(m−1 ) }, d represents a
Intel Core i5. distance metric from the historical value to the present value, and
the parameter λd allows a trade off between a precise fit to the
3.3. Basis update measured data and the goal of remaining similar to the old basis.
The first term in Eq. (10) is the representation error ek = xm −
(l ) (l ) th basis is re-
For a new data segment xm we evolve the old basis H(m−1 ) to l=k wm hm , which indicates the error when the k
the new one Hm . By evolving it we avoid additional computational moved (and the other terms are kept fixed).
cost, and also, we want the basis to change only slowly, reflecting The following second term penalizes changes of the new basis.
slow changes in traffic patterns, thus avoiding the impact of new One best approach for measuring this change is the deviation to
anomalies in the data, and also avoiding the need to clean this new the historical mean. In addition, the new basis should also be sim-
data. ilar to the previous basis, since it directly evolves to the current
We formulate the basis update problem by retaining the same one. In practice, we define the second term distance to include two
number of basis vectors L, and then considering each in turn, keep- parts: the similarity with the immediate previous basis and the de-
ing the others fixed. That is, for k = 1, . . . , L we solve the problems viation to the historical mean. The mathematical form of d can be
H. Xia et al. / Computer Networks 135 (2018) 15–31 21

seen in Eq. (11), where the first term calculates the similarity with Algorithm 2 Basis update algorithm.
the previous basis, the second term evaluates the deviation to the
Input:
historical mean, and parameter λ1 and λ2 are weighted coefficients
A new traffic segment, xm ;
for trading off the similarity and deviation.
   The size of xm , b;
   
   The historical basis set, H(m−1 ) ;
(k )
d hm H((mk)−1) = λ1 hm(k) − h((km)−1)  + λ2 hm(k) − h((km)−1) , (11)
The historical minimum, Hmin(m−1 ) ;
where The historical maximum, Hmax (m−1 ) ;
(m−1 ) The historical mean, H(m−1 ) ;
1  (k )
h((km)−1) = hi . The old weight set, W(m−1 ) ;
m
i=0 The threshold of residuals, α ;
(k ) The maximum number of iterations, β and ζ .
The jth element in hm is also constrained to maintain smooth-
Output:
ness by enforcing
The new basis Hm , and weight set Wm ;
min(hi(k ) ( j )) ≤ hm
(k )
( j ) ≤ max(hi(k) ( j )), (12) The final residual r.
i i 1: Hm ← H(m−1 ) , Wm ← W(m−1 ) ;
where i ∈ [0, m − 1]. 2: Set the number of iterations n1 ← 0 and n2 ← 0;
Define Hmin max
(m−1 ) and H(m−1 ) as collections of historical minimum
3: Set the residual e1 ← xm ;
and maximum of basis vectors respectively. Hence, 4: while n1 ≤ β and r > α do
5: for k = 1,2,…,L do
(k )
(m−1 ) = {minh(m−1 ) }k∈[1,L] ;
Hmin (l ) (l )
6: Compute the residuals ek ← xm − l=k wm hm ;
7: while n2 ≤ ζ do
(k ) Calculate the residual matrix E according to Eq. 13
(m−1 ) = {maxh(m−1 ) }k∈[1,L] ,
Hmax 8:
(k )
9: Update hm according to rules Eq. (14);
where (k )
c ← t he lengt h o f hm ;
  10:
11: for j = 1,2,...,c do
minh((km)−1) = (k )
min (hi ( j )) ; 12:
(k )
if hm ( j ) < minh((km)−1) ( j ) then
i∈[0,m−1]
j∈[1,b] (k )
13: hm ( j ) ← minh((km)−1) ( j );
and (k )
  14: else if hm ( j ) > maxh((km)−1) ( j ) then
maxh((km)−1) = (k )
max (hi ( j )) . 15:
(k )
hm ( j ) ← maxh((km)−1) ( j );
i∈[0,m−1]
j∈[1,b] 16: end if
The result is a new basis that evolves slowly to match changes 17: end for
(k )
in traffic patterns, but is unresponsive to anomalies, which are in- 18: Update wm according to rules (15) and (16);
herently rare and erratic (or else they are not anomalous). 19: n2 ← n2 + 1;
(k ) (k )
We derive hm and wm through alternating the fixed variables 20: end while
(k ) (k )
iteratively [32,33]. We transform the final residual ek to a matrix 21: Hm ← Hm ∪ hm , Wm ← Wm ∪ wm ;
(k ) end for
form E whose rows have the same length as hm , i.e., c. The data 22:
segments cover traffic for several cycles, so the new matrix E has 23: Compute the approximation error
K = b/c rows, and we can write it 
L
⎡ ⎤ r ← xm − (k )
wkm hm ;
ek ( 1 ) ··· ek ( c )
⎢ ek ( c + 1 ) ··· ek ( 2c ) ⎥ k=1

E =⎢ .. .. ⎥ (13)
⎣ .
..
. .
⎦ and

1
b
ek ((K − 1 )c + 1 ) ··· ek (Kc )
r← r ( i );
(k ) b
For each basis vector hm , we initialize a vector i=1

(k ) n1 ← n1 + 1;
w = wm y, 24:
25: end while
where y is an IID standard normal variate of length K. We then
iteratively apply the following rules

(k ) (k )
wT E + λd (λ1 h((km)−1) + λ2 h((km)−1) ) Finally, if the residual error r cannot be reduced to a preset
hm ← hm ◦ (k ) (k ) (k )
, (14) bound (e.g., the average approximation error using the previous
wT whm + λd (λ1 hm + λ2 hm ) basis on new data) after a certain number of iterations, it means
the update algorithm can not converge. This non-convergence in-
(k )
Ehm dicates the behaviour patterns of the new data segment varies
w ← w◦ (k )
(k ) T
, (15)
whm (hm ) greatly from the past behaviour, and we should apply the data
cleaning and basis generation processes to produce a totally new
where ◦ and division are element-wise (Hadamard) multiplication
basis set instead of updating the old basis set.
and division, respectively. After the algorithm has converged, we
The time complexity for basis update algorithm is
take the average of w as the weight
O(L × c × β × ζ ). For any two-week data segment, L is small
κ
(k ) 1 (generally 3, corresponding to three different levels of cyclicity),
wm ← wi , (16)
κ c records the length of the basis (no larger than two weeks).
i=1
Meanwhile, we set β and ζ at small or moderate levels (in later
where wi ∈ w. Algorithm 2 shows more detai l. experiments, β and ζ are equal to 10 for simplicity). Hence, the
22 H. Xia et al. / Computer Networks 135 (2018) 15–31

capacity of time cost is not large. In practice, the basis update Algorithm 3 Bottom-up hierarchical clustering based anomaly de-
process takes about 0.6 s to evolve the old basis. The experimental tection.
environment is the same: the simulation code ran 100 times on Input:
Matlab R2012b on a Mac OS X 10.9.5 with a 2.7 GHz Intel Core Time series of the anomalous traffic measurements, t =
i5. This indicates that updating the old basis takes less time than [t1 , . . . , tn ];
generating a new one, which is more suitable for online anomaly A distance threshold, ι;
detection. The maximum number of iterations, η
Output:
3.4. Anomaly detection The set of anomalies (clusters) and the corresponding center
set, V and U;
Many researchers stop their work after detecting the anoma- 1: t(0 ) ← t;
lous points by thresholding the residual. However, many detected 2: k(0 ) ← n;
points may be associated with the same underlying anomaly. It is 3: m ← 1;
confusing to network operators to present a single event as many 4: while m ≤ η do
detections, and there are distinct advantages as well, in combining 5: k ← 1;
individual detections before deciding which anomalies are most 6: for i= 1, …, k(m−1 ) − 1 do
important: for instance, an anomaly that develops more slowly, but j ← i + 1;
7:  
over a long time may be more important that a sharp, but singular  
8: The distance d((i,mj−1
)
)
← ti(m−1 ) − t (j m−1 ) ;
event.
In this paper, we try to associate detections so that we can 9: if d((i,mj−1
)
)
≤ ι then
present a single anomalous event, along with information about (m−1 ) (m−1 )
ti +t j
the duration of the event. We do so through a clustering process 10: tk(m ) ← 2 ;
that we describe below. 11: V (m ) ← V (m ) ∪ {ti(m−1 ) , t (j m−1 ) };
  
3.4.1. Anomalous-point detection 12: t(m ) ← t(m ) , tk(m ) ;
We detect anomalous points by thresholding. For any approxi- 13: k ← k + 1;
mation residual error r, we set the upper and lower limits [ p − q ∗ 14: i ← i + 1;
σ , p + q ∗ σ ], where p and σ are the mean and standard deviation 15: else
of r, and q is a parameter chosen to fix the false-alarm probability 16: V (m ) ← V (m ) ∪ {ti(m−1 ) };
(assuming a normal distribution of the residual). Any points of r 17: tk(m ) ← ti(m−1 ) ;
that fall outside these thresholds are declared to be anomalous.   
18: t(m ) ← t(m ) , tk(m ) ;
3.4.2. Anomaly clustering 19: k ← k + 1;
Once we have anomalous data points, we use a bottom-up 20: end if
hierarchical clustering algorithm to group points, as shown in 21: end for
Algorithm 3. Similar to Algorithm 1, we also use the Matlab no- 22: if d((i,mj−1
)
)
> ι then
tation to represent the concatenate process of time series t(m) as
23: V (m ) ← V (m ) ∪ {t (j m−1 ) };
shown in steps 12,18 and 25.
For most anomaly types, the times of anomalous points caused 24: tk(m ) ← t (j m−1 ) ;
  
by the same events are close to each other, which means they oc- 25: t(m ) ← t(m ) , tk(m ) ;
cur at short-time anomalies. In clustering, we measure the distance
between clusters by the absolute time gaps between their centers. 26: k ← k + 1;
Anomalous points with smaller time gaps will tend to be grouped 27: end if
into one large cluster. This clustering method has the ability to find 28: k ( m ) ← k;
short-time anomalies. 29: m ← m + 1;
Since we have not included any specific properties of network 30: end while
traffic for clustering, this clustering algorithm is generally and 31: V ← V (m ) ; U ← t (m ) ;
equally applicable to other detectors given the time locations of
anomalous points, and we shall test it as part of these alternative
algorithms.
determine the duration of the anomalies. As far as we know, this
However, for anomalies such as FISSD type of anomalies, they
information has been labelled in no dataset.
often last long and do not have the short-time attribute we have
Our approach to simulation intends to highlight the features
mentioned above. In this condition, Algorithm 3 is no longer suit-
of different techniques. We make no claim that the simulation is
able. This long-time anomaly clustering problem has been left for
completely realistic, only that it illustrates the properties of differ-
further research in the future. One primary idea to address this
ent anomaly detection techniques.
problem is to combine multi scales into clustering, inspired by ex-
Our analysis uses three pairs of metrics, as described in
tracting multiscale features for anomaly detection [15,16,26,28].
Section 4.1, intended to explore the different characteristics of the
anomaly detection algorithms, and compares our algorithm to four
4. Experiments with synthetic data
other methods:
In this section, we apply several anomaly detection approaches
to synthetic data. The simulation of synthetic data is extremely im- 1. OMP-on-dictionary: This method, motivated by BasisDetect [21],
portant for the validation of anomaly detection algorithms where uses Orthogonal Matching Pursuit (OMP) on a dictionary of dis-
ground truth is difficult to find [25]. In our case we aim to assess crete cosine time and frequency atoms. Similar to BasisEvolu-
very small false-alarm probabilities, necessitating large amounts of tion, this model determines L atoms, L , with the largest coef-
tests. In addition, we want to test the ability of the algorithm to ficients αL at the first data segment x0 , Then, for the new data
H. Xia et al. / Computer Networks 135 (2018) 15–31 23

segment xm , we represent it with L by solving


   
argmin xm − αL L , (17)
D(r )  − G ∩ D(r ) 
αL F AP _NP = , (19)
N − |G|
where αL represents the projection magnitudes of xm on L . where N is the total number of data points, and | · | means the
We then get the residual series rOMP = xm − αL L , and detect cardinality of a set.
anomalous points as Section 3.4 describes. 2. Our second pair of metrics aims to assess whether a detected
2. PCA (Principle Components Analysis): This approach finds the k group of anomalies corresponds to a real anomaly group. As
eigenvectors with the largest eigenvalues for the primary data, long as the detected anomaly groups have real anomalous
and reconstructs the data segment xm with those eigenvec- points, we define them as truly detected anomalies. We call
tors. The residual series is rPCA = xm − x
m , where xm is the re- these the Consistency Degree of Anomaly groups in Amount
constructed signal and rPCA is used for detection as described (CD_AA) and the Aggregation Degree of Anomaly groups in
above. Amount (AD_AA). Given that the number of truly detected
3. RPCA (Robust PCA): This method is the robust version of PCA anomaly groups is NcD(r ) , and the number of hits against real
[12] and uses the median to centre the data in the transforma- anomaly groups is NcG(r ) , then we define these by
tion process. The method we used for RPCA is the Crux-Ruiz
(CR) algorithm [12]. Otherwise it is similar to PCA. NcG(r )
CD_AA = , (20)
4. NNBP (Neural Network using Back Propagation): The NNBP ap- |cG|
proach uses an artificial neural network algorithm to predict
traffic [5], and we calculate residuals as the difference between N
AD_AA =  cD(r)  . (21)
the predicted and actual traffic. The principle of choosing com- cD(r ) 
parison algorithms is to include as wide range of ideas as pos-
sible. Hence, we select this approach even though it is the least CD_AA measures the probability of truly detected anomaly
successful (as demonstrated below). groups, while AD_AA records the fraction of truly detected
anomaly groups across all detected anomalies, under the above
In each case the initial set of data is cleaned as described in definition.
our approach, and we use the same clustering process for each to 3. The above metric uses a rather simplistic notion of a truly de-
provide a fair comparison. tected anomaly group in which it is correct if it covers at least
We chose these four methods because they cover a wide range one true anomaly point. Here we aim to quantify the over-
of concepts used in anomaly detection field, and have been dis- lap between the true anomalies and their detections, but one
cussed in many references. The particular examples are relatively true anomaly group may generate multiple detected groups,
simple exemplars of the concept they espouse, so that the results and vice versa. To do so, we extend the idea of a purity in-
are not complicated by myriad implementation details. For exam- dex [34,35], to define Consistency Degree of Anomaly groups in
ple, we do not attempt to compare to wavelet detection methods Duration (CD_AD) and Aggregation Degree of Anomaly groups
as they require much fine tuning, for instance, of the particular in Duration (AD_AD) which are defined for each data point
wavelet basis being used. x ∈ G ∩ D(r) . We define cGx to be the cluster that x truly belongs
to, and cDx(r ) to be the cluster that x belongs to, detected by
4.1. Detection error metrics method r. Then, we define for each x
 
cGx ∩ cDx(r ) 
In many works on anomaly detection, the algorithms are as- CD_AD(x ) = , (22)
sessed by their false-positive and negative probabilities, considered |cGx |
point-wise on the original sequence, i.e., each point is either an
 
anomaly or not, and we compare our classification to the ground cGx ∩ cDx(r ) 
truth. However, in this work, we seek to understand performance AD_AD(x ) =  (r )  . (23)
in more practical terms: we want to detect the underlying cause of
cDx 
a sequence of anomalous points in the data, and we prefer to re-
We then average this metric over G ∩ D(r) (the intersection of
turn a single detection to an operator to reduce the amount of low-
detected and true anomalous points) to obtain global versions
level work that an operator needs to accomplish before he/she can
of the metrics defined by
start a diagnosis, thus facilitating the diagnosis. We want to tell the
operator about one anomaly rather than a set of anomalous points. 1 
C D_AD =   C D_AD(x ); (24)
To assess the quality of such results we need additional metrics. As G ∩ D(r ) 
x∈G∩D ( r )
with false-negative/positive probabilities, we wish to present these
in pairs that show the trade-off between sensitivity and specificity. 
1
We present two additional pairs of such metrics. AD_AD =   AD_AD(x ). (25)
We label the set of true anomalous time points G. They are
G ∩ D(r ) 
x∈G∩D ( r )
clustered into groups cG, where each group corresponds to one in-
Now, for all truly detected anomalies, CD_AD is a weighted-
jected anomaly.
average detection probability, while AD_AD is a weighted-
The set of detected anomalous points for method r is denoted
average proportion of real anomalous points on the detected
D(r) , and these are clustered into groups cD(r) using Algorithm 3.
anomalies, taking into account the group structure of the
The metrics we consider are listed below:
anomalous points.
1. We still include the standard point-wise metrics: Detection In order to analyze the performance of all five anomaly detec-
Probability on Anomalous Points (DP_AP) and False-Alarm Prob- tion methods and see the trade-offs over these pairs of metrics,
ability on Non-anomalous Points (FAP_NP): we vary the threshold parameter q from 1.0 to 4.5, and generate
 
G ∩ D(r )  Receiver Operator Characteristic (ROC) curves [36], or the equiva-
DP _AP = , (18) lent for the new metrics.
|G|
24 H. Xia et al. / Computer Networks 135 (2018) 15–31

4.2. Synthetic traffic data 0.7

0.6
Our goal here is to generate traffic with some of the important
observed features of real traffic. This will, of course, be a simplifi- 0.5

DP_AP
cation of real traffic, but we have a great deal more control over
its details and can see the effects, for instance, of changing the du- 0.4
ration of anomalies. 0.3
The simulation has two steps: first, we generate our typical BE on cleaned data
traffic, and second, we inject anomalies. The typical traffic is based 0.2 BE on uncleaned data
around a sine function with a 24-h period to represent the strong
0.1
diurnal cycles of Internet traffic [37]. Noise was added to repre- 0 0.02 0.04 0.06 0.08 0.1 0.12
sent statistical fluctuations around the mean. In this study, we use FAP_NP
white Gaussian noise. Gaussianity is a reasonable first approxi-
mation for Internet traffic on uncongested links [38], and we do Fig. 6. The performance of BasisEvolution on clean & unclean data. The effective-
ness is measured with FAP_NP (False-Alarm Probability on Non-anomalous Points)
not include correlations (though they are often present in traffic), and DP_AP (Detection Probability on Anomalous Points). The whole simulation ex-
because uncorrelated traffic represents a worst case for detecting periment repeats 100 times to get statistical results. The performance of BasisEvo-
anomalies [39]. We also control the relative size of the noise. In lution on clean data is better than which on uncleaned data.
what follows the standard deviation of the noise is 10% of the am-
plitude of the cycles.
The parameter ψ changes the width of anomalies. We dilate the
We generate a series containing eight weeks of data which is
anomaly type function by a factor W + ψ . In our experiments we
aggregated every 1 h, i.e., 1344 data points for each simulation.
consider a wide range of values ψ = {0, 4, 10, 20, 40}.
Each is broken into 4 data segments, to be analyzed in blocks as de-
The parameter φ updates the magnitude of anomalies: we scale
scribed above. We experimented with other data segment lengths,
the shape function of the anomaly type by this factor. In our ex-
and although the results and general trends were similar, a two-
periments we have considered φ = {0.1, 0.3, 0.5, 0.7, 1} resulting in
week segment produced the best results of the lengths tested.
magnitudes φ M.
The anomaly model in this work can be represented as A(τ , θ ,
ϑ, ψ , φ ), where τ , θ , ϑ, ψ and φ are parameters for generating and 4.3. Investigation on BasisEvolution
injecting anomalies.
The parameter τ indexes a set of anomaly types. Each anomaly Before experiments, we need to have a thorough investigation
type is defined by a shape function with width W, which repre- on the whole BasisEvolution framework. In data cleaning phase,
sents the minimum width for that anomaly type (otherwise, all we need to clarify the importance of this step, and should make
anomalies could collapse into spikes), and amplitude M, which is a proper selection on the parameter and detection method (seen
defined by the relative size to the typical traffic. in Fig. 5). In addition, we have to estimate a set of parameters in
1. FIFD: (Fast Increase, Fast Decrease): these are the classical basis update process.
anomalies most often examined in prior work. They might be As discussed before, in real network traffic it is difficult to find
caused by system or device outages, or DoS attacks. Often, past the ground truth, while the synthetic traffic volumes highlighting
literature has focused on “spikes” which are very short FIFD features of network traffic are suitable for analysis. Hence, all ex-
anomalies. periments at this section are on synthetic data.
2. SIFD (Slow Increase, Fast Decrease): these might arise through We first demonstrate the importance of data cleaning process
epidemic processes (e.g., worms or viruses [40]), which grow on the entire framework through the comparison of the perfor-
over time, and are eventually fixed, dropping the system back mance of BasisEvolution on cleaned and uncleaned data. Then we
into its standard state. confirm the number of maximum iterations and the detection al-
3. FISSD (Fast Increase, Stable State, Slow Decrease): these arise gorithm by examining the degree of cleanliness of the revised
typically as the result of phenomena such as flash crowds, traffic after data cleaning process. As for parameters (shown in
which appear very suddenly, are maintained for some time, and Eqs. (10) and (11)) in basis update process, we have found rough
then gradually disappear. parameter settings that are not too far from optimal.
4. FISD (Fast Increase, Slow Decrease): these represent, for com-
pleteness, a special case of the previous anomaly with no stable 4.3.1. The impact of data cleaning process
state. Fig. 6 compares the detection performance of BasisEvolution
under two conditions: with data cleaning process and without
We also include negative anomalies as well as positive, i.e., traf- data cleaning process. We see that the framework performs better
fic decreases as well as traffic increases. Each simulation includes when the data is cleaned, which means basis vectors derived from
only one type of anomaly. Most of our reported results here focus cleaned data can better describe the “normal region” than those
on FIFD anomalies, because most of the results for other anomaly from uncleaned data. If data is uncleaned, and contains traces of
types mirror the ones reported. intrusions, BasisEvolution may not detect future instances of those
Parameter θ determines the number of anomalies that are in- intrusions, since it will presume that they are normal, which leads
jected. We set θ following the Poisson distribution, θ ∼ Poisson(λ), a higher false-alarm probability and lower detection probability.
where λ = 3 per data segment, i.e., on average there will be 12 Hence, the data cleaning process is necessary for the entire work.
anomalies in each simulation. It is important that there are multi-
ple anomalies per segment, because one of the affects we seek to 4.3.2. Detector selection for data cleaning process
include in our simulations is interactions in the detector, between In our test, we compare the performance of single approaches
anomalies. and a combination method by choosing PCA- and ANN-based de-
The parameter ϑ controls the starting locations of anomalies. tectors. In this way, the single approaches are PCA- and ANN-based
Anomalies, by definition, could happen at any time with no pref- detectors, while the combination detector merges the detection re-
erence, so ϑ ∼ Uniform(0, b), where b is the length of the data seg- sults of each detector at each iteration. More specifically, we select
ment. NNBP method as an ANN algorithm.
H. Xia et al. / Computer Networks 135 (2018) 15–31 25

0.6 Table 2
We fix the value of DP_AP, and report the corresponding FAP_NP of
five methods. FAP_NP is the False-Alarm Probability on Non-anomalous
0.5
Points. DP_AP is the Detection Probability on Anomalous Points. DNE
DP_AP means “Does Not Exist”. BE is short for BasisEvolution. Note that very
0.4 PCA small FAP_NP values are reported (shown in bold), as required.
NNBP Method Fixed DP_AP
0.3 PCA+NNBP
0.3 0.5 0.6
0.1 0.2 0.3 0.4 0.5
BE 0.0 0 063 0.0073 0.020
FAP_NP OMP 0.0027 0.18 DNE
PCA 0.019 0.047 0.099
Fig. 7. The performance of three detectors in data cleaning process. Three detec-
RPCA 0.025 0.047 0.064
tors are PCA, NNBP and the combination of them. The effectiveness is measured
NNBP 0.27 DNE DNE
with FAP_NP (False-Alarm Probability on Non-anomalous Points) and DP_AP (De-
tection Probability on Anomalous Points). The whole simulation experiment repeats
100 times to get statistical results. The performance of combined approach is better
than each single one.
In Fig. 8 (a), we measure the performance of the data clean-
ing process through detecting the residuals between original traffic
and the cleaned traffic. More cleaner the traffic, higher detection
In PCA, we define the “normal subspace” with thresholding performance (lower false-alarm probabilities and higher detection
(e.g., [mean − 3 ∗ std, mean + 3 ∗ std], where mean and std are av- probabilities) on the residuals. From Fig. 8 (a), the false alarms are
erage and standard deviations of principal components), and find high when maxC is smaller than 5, however, when maxC is larger
anomalous points with SPE measurement [18]. Since PCA is sensi- than 5, the corresponding curves not only perform better, but also
tive to the polluted data [25], the “normal subspace” is distorting are close to each other. As for Fig. 8 (b), when maxC is larger than
at the first iteration, and the false alarms would be very high. As 3, the corresponding curves are also similar to each other, with
Fig. 5 shows, the continuous detection and gaps padding in the fol- higher detection performance. Based on these, we finally choose
lowing iterations bring the traffic closer to the normal range. maxC = 5 in later experiments.
NNBP finds anomalous points by comparing the differences be-
tween real traffic and predicted traffic volumes. If the differences
4.3.4. Parameter estimation in basis update
are out of the threshold (e.g., c∗ std, where c is a constant value,
The parameters we need to assess are λd , λ1 and λ2 . Given our
std is the standard deviation of test samples), the corresponding
motivation for distance measurement, i.e., the new basis should
points are identified as anomalies. In the first iteration, the data is
both maintain similarity to the immediately previous one and be
polluted and the corresponding predictions are not accurate. Sim-
close to the historical mean. Hence, the importance of two parts in
ilarly, the successive layers of detection and interpolation (seen in
Eq. (11) should be the same, and we set the weighted parameter
Fig. 5) can modify the traffic to the regular level.
As for the combination detector, the set of detected anoma-
λ1 = λ2 = 0.5.
Parameter λd determines the trade-off between the measure-
lous points is larger through combining the detection results of
ment constraint and the importance of changes. A larger λd leads
both PCA and NNBP. With those anomalous points being revised
to lower approximations, whereas a smaller λd means approxima-
by “normal values” in each iteration (seen in Fig. 5), the typical
tions are a better fit to the data. We set the candidate solution of
traffic can be reformed more effectively.
We measure the performance of three detectors by detect-
λd to be {0.01, 0.1, 1, 2, 10, 100}. Fig. 9 shows the detection perfor-
mance results with respect to different λd values. We find that an
ing the corresponding modified traffic. More specifically, we de-
optimal detection result (a higher detection probability and a lower
tect the residuals between original traffic and modified traffic
false-alarm probability) is around λd = 2 with the corresponding
through thresholding (seen in Section 3.4.1). If the detection ac-
curve tending to the top left of the plot. Hence, we finally choose
curacy (DP_AP) is high and the false-alarm probability (FAP_NP) is
low, larger part of the modified traffic will be in the normal region.
λd = 2, λ1 = 0.5, and λ2 = 0.5 in the following experiments.
Fig. 7 illustrates the performance of three detectors in data clean-
ing process. The combination detector performs best with the low- 4.4. Experiment results on synthetic data
est false-alarm probability. This is in accordance with the conclu-
sion discussed above. The lowest false-alarm probability means the We conducted several classes of synthetic experiments. In the
traffic cleaned by the combination detector is closest to the “nor- first class of experiments, we compare anomaly detectors on fixed
mal” traffic, from which we can extract the most accurate “normal width anomalies, but with anomalies that are not simple spikes.
patterns”. In the second class, we consider the affects of anomaly width on
detection performance. In the third and fourth classes, we consider
the same issues but for magnitude.
4.3.3. The estimation of parameter maxC in data cleaning process In each case below we generate traffic and inject anomalies as
In data cleaning process, parameter maxC is defined as the described above. We replicate the experiments 100 times, and re-
number of maximum iterations. Furthermore, both anomaly detec- port aggregated statistics.
tion and gap filling methods should repeat maxC times to clean the
data. We try to estimate a small value of maxC, so that the cleaned 4.4.1. Constant-width results
traffic is close to “normal”, and the BasisEvolution could achieve We start by considering anomalies with random magnitude,
the best detection performance. but fixed width ψ = 10. The resulting ROC curves are shown in
In the simulation, the candidate set for maxC is {1, 3, 5, 7, 9}. Fig. 10 (a), which shows that as the threshold q increases, both De-
we choose the optimal value of maxC through two comparisons: tection Probability on Anomalous Points (DP_AP), and False-Alarm
comparing the performance of the data cleaning process (similar Probability on Non-anomalous Points (FAP_NP) decreases. i.e., there
to Section 4.3.2); and comparing the performance of BasisEvolution is a trade-off between the two types of errors. Noteworthy is the
(similar to Section 4.3.1). The results can be seen in Fig. 8 (a) and fact that BasisEvolution (BE) outperforms the other methods (we
(b). prefer curves towards the top left of the diagram). Table 2 shows,
26 H. Xia et al. / Computer Networks 135 (2018) 15–31

0.6 maxC=1 0.6


maxC=7
0.55 maxC=9
0.5
maxC=3
maxC=5

DP_AP
0.5 0.4

TP_AP
maxC=9
0.45 0.3 maxC=7
maxC=5
0.4 0.2 maxC=3
maxC=1
0.35 0.1
0.14 0.15 0.16 0.17 0.18 0.19 0 0.01 0.02 0.03 0.04 0.05
FAP_NP FAP_NP

(a) The performance of the data cleaning (b) The performance of BasisEvolution.
process.
Fig. 8. The performance of data cleaning process (a) and the whole framework (b) under different values of maxC. maxC ∈ {1, 3, 5, 7, 9}. The effectiveness is measured with
FAP_NP (False-Alarm Probability on Non-anomalous Points) and DP_AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to
get statistical results. A better value for maxC is 5.

0.8 threshold q increases, AD_AA increases, however, CD_AA decreases,


showing the trade-off between the two types of errors. Once again
0.6 BasisEvolution is universally superior to other methods across all
=0.01
d threshold values.
DP_AP

=0.1 In Fig. 10(c), apart from the NNBP, as the threshold q in-
d
0.4
d
=1 creases, AD_AD (Aggregation Degree on Anomaly Duration) in-
=2 creases, while CD_AD (Consistency Degree on Anomaly Duration)
0.2 d
decreases, showing another trade-off. As in (b), we prefer curves
=10
d towards the top right. It is beyond any doubt that BasisEvolution
=100
0 d outperforms the rest four methods across all threshold values.
0 0.01 0.02 0.03 0.04 0.05 0.06
FAP_NP 4.4.2. The effect of width
We now test the five anomaly detection methods on anoma-
Fig. 9. Comparison of detection performance in terms of FAP_NP (False-Alarm Prob-
ability on Non-anomalous Points) and DP_AP (Detection Probability on Anomalous lies whose magnitudes are randomly distributed, but with fixed
Points) when λd ∈ {0.01, 0.1, 1, 2, 10, 100}. The entire simulation was repeated 100 width parameter ψ = {0, 4, 10, 20, 40}, to understand the affect of
times to get statistical results. A better performance can be found at λd = 2. anomaly width on detection. As for detecting the FIFD type of
anomalies, Fig. 11(a) shows the performance of BasisEvolution de-
creases as the width parameter ψ grows. That is, it is more diffi-
more precisely, a number of FAP_NP values for given DP_AP val- cult to correctly detect wider anomalies. This is to be expected:
ues. The important feature shown in this table, apart from Basi- as the anomaly width increases, there are more points in the
sEvolution’s superiority, is that it can achieve very low false-alarm anomaly, and thus these points are more likely to be considered
probabilities (as low as 0.0 0 063), which are necessary in practical “normal” by any detector, and by any metric. A single spike will
anomaly detection algorithms. always be the easiest type of event to detect, while although it
The trade-off is a relative low DP_AP, but this is less important. might be possible to detect the change into a broad flat anomaly,
In real anomaly detection settings, it is far more important to avoid it will always be difficult to correctly categorize the flat region.
reporting a large number of false alarms, while still providing use-
ful detections, than to detect all events at the cost of false alarms 4.4.3. Constant-magnitude results
that degrade the operators confidence in any of the detections. This We now consider the effect of magnitude of the anomalies on
also emphasizes the need for simulation studies, as it would have detection, starting with a set of tests with fixed magnitude, φ =
been impractical for us to measure such a low probability without 0.5, but random width. The ROC curves for metrics DP_AP (De-
a large set of data on which to perform tests. tection Probability on Anomalous Points) and FAP_NP (False-Alarm
In Fig. 10(b), we show the equivalent of an ROC curve for the Probability on Non-anomalous Points) are shown in Fig. 12.
pair of metrics AD_AA (Aggregation Degree on Anomaly Amounts) As in Fig. 10, both DP_AP and FAP_NP decrease with the incre-
and CD_AA (Consistency Degree on Anomaly Amounts), although ment of the threshold q. We prefer the curves toward top left of
in this plot we prefer curves to the top right in the diagram. As the the diagram. The results are similar to those for constant width,

1 1 0.9
0.9 0.9 0.8
OMP
0.8 0.8 0.7
PCA
0.7 RPCA 0.7 0.6
CD_AA

0.6
CD_AD

0.6 NNBP
DP_AP

0.5
0.5 BE 0.5
0.4
0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7 0.75 0.8 0.85 0.9 0.95 1
FAP_NP AD_AA AD_AD

(a) FAP NP versus DP AP. (b) AD AA versus CD AA. (c) AD AD versus CD AD.
Fig. 10. Performance of 5 methods for detecting FIFD anomalies. Here the width parameter is ψ = 10, and magnitude is random. BE is short for BasisEvolution. FAP_NP
is the False-Alarm Probability on Non-anomalous Points. DP_AP is the Detection Probability on Anomalous Points. AD_AA is the Aggregation Degree on Anomaly Amounts.
CD_AA is the Consistency Degree on Anomaly Amounts. AD_AD is the Aggregation Degree on Anomaly Duration. CD_AD is the Consistence Degree on Anomaly Duration.
H. Xia et al. / Computer Networks 135 (2018) 15–31 27

1 1
0.9 =0 0.9
0.8 =4 0.8
0.7 =10 0.7
=20 0.6

DP_AP
0.6

DP_AP
=40 =0.1
0.5 0.5
=0.3
0.4 0.4
=0.5
0.3 0.3
=0.7
0.2 0.2
=1
0.1 0.1
0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0.12
FAP_NP FAP_NP

(a) Variable width ψ. (b) Variable magnitude φ.


Fig. 11. Performance (FAP_NP versus DP_AP) of BasisEvolution in detecting FIFD anomalies with varying anomaly widths or magnitudes. FAP_NP is the False-Alarm Probability
on Non-anomalous Points. DP_AP is the Detection Probability on Anomalous Points. For BE-detected results, we note that performance decreases with increasing width, or
decreasing magnitude. The additional plots are omitted as they reinforce the same results.

1 1 1
0.9 OMP 0.9 0.9
0.8 PCA 0.8 0.8
0.7 RPCA 0.7 0.7

CD_AA
NNBP 0.6

CD_AD
0.6
DP_AP

0.6
BE 0.5
0.5 0.5
0.4 0.4
0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
0 0.1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99 1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
FAP_NP AD_AA AD_AD

(a) FAP NP versus DP AP. (b) AD AA versus CD AA. (c) AD AD versus CD AD.
Fig. 12. Performance for detecting FIFD anomalies with fixed magnitude φ = 0.5 and random width. FAP_NP is the False-Alarm Probability on Non-anomalous Points. DP_AP
is the Detection Probability on Anomalous Points. AD_AA is the Aggregation Degree on Anomaly Amounts. CD_AA is the Consistency Degree on Anomaly Amounts. AD_AD is
the Aggregation Degree on Anomaly Duration. CD_AD is the Consistence Degree on Anomaly Duration.

Table 3 Why the proposed approach BasisEvolution is better than oth-


We fix the value of DP_AP, and report the corresponding FAP_NP of five
ers? Two reasons can be concluded through analyzing the mech-
methods. FAP_NP is the False-Alarm Probability on Non-anomalous Points.
DP_AP is the Detection Probability on Anomalous Points. DNE means “Does anism. BasisEvolution starts extracting the basis vectors under the
Not Exist”. BE is short for BasisEvolution. Note that very small FAP_NP val- smallest period (24 h), and ends with basis under the largest cy-
ues are possible (shown in bold), as required. cle (e.g., weeks or months). In each cycle, we obtain directions
Method Fixed DP_AP of principal components as basis vectors. Since these directions
preserve the largest energy of traffic at each cycle, they are easy
0.3 0.5 0.7
to understand, and actually describe the network traffic at mul-
BE 7.8 × 10−5 0.0 0 02 0.0025 tiscales. Those multiscale basis vectors can formulate “typical re-
OMP 0.0026 0.009 0.066
gion” more accurately, which directly means that BasisEvolution
PCA 0.013 0.039 0.16
RPCA 0.033 0.079 DNE can find more types of anomalies (e.g., subtle anomalies), and are
NNBP 0.19 DNE DNE robust in anomaly detection. Moreover, BasisEvolution utilizes the
extracted basis vectors to construct traffic volumes through lin-
ear combination. This mechanism indicates that the extracted basis
vectors explain the components of network traffic.
though perhaps more extreme, e.g., Table 3 provides FAP_NP values
Compared with commonly used approaches (OMP and PCA), Ba-
for the five techniques corresponding to the fixed DP_AP values,
sisEvolution performs better according to the above experiments.
and we see that BasisEvolution achieves an even lower false-alarm
Specifically, PCA is sensitive to anomalous points, that is, the “typ-
probability of 7.8 × 10−5 .
ical region” (or the “normal subspace”) that PCA constructed is
easy to be contaminated when the number of anomalous points
4.4.4. Effect of varying magnitudes is large. BasisEvolution elevates this issue by extracting multiscale
In our final simulation results, we test anomalies whose features so that the “typical region” can be better approximated,
widths are randomly distributed, with magnitude parameter φ = with a comprehensive description of network traffic in different
{0.1, 0.3, 0.5, 0.7, 1}, to analyze the impact of anomaly magnitude views. Hence, BasisEvolution is less sensitive and can find more
on detection. anomalies against PCA (and PCA-type methods). As for OMP, it re-
Fig. 11(b) illustrates the results of BasisEvolution on anoma- constructs network traffic through the superposition of finite num-
lies with varying magnitudes, showing it is easier to distinguish bers of discrete cosine functions. For each basis vector, it has no
anomalies with larger magnitudes. This result is reasonable be- real value. In addition, finite numbers of functions may lose the
cause anomalies with larger magnitudes lead to corresponding details of a “typical region”, which leads to the decrement of de-
anomalous points having larger deviations, which are easier for any tection probability. Hence, BasisEvolution is better than OMP with
detector to see. However, there is a diminishing return for anoma- respect to the interpretability and accuracy.
lies above some threshold around φ = 0.5, where little improve-
ment is seen in comparison to the change in anomaly magnitude. 5. Experiments with real data
The figures presented in the above four classes show the re-
sults for FIFD anomalies, but we found the same types of results We test BasisEvolution on traffic data from the Internet Initia-
no matter which type of anomaly we considered. We also saw sim- tive Japan (IIJ), a Japanese ISP. The dataset [41] is a collection of
ilar results for the alternative performance metrics as well (in the time series data obtained from Japanese network backbone routers.
places where these plots are omitted). The collected data extends from April 4 to December 4, 2014. It
28 H. Xia et al. / Computer Networks 135 (2018) 15–31

c = 1 day c1 = 1 day
1
2 2

Values

Values
0 0

−2 −2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days
5 c2 = 1 week 5 c = 1 week
2
Values

Values
0 0

−5 −5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days
5 c = 2 weeks 5 c = 2 weeks
3 3
Values

Values
0 0

−5 −5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days

(a) The initial basis vectors on the first time (b) Evolved basis on the following time seg-
segment: April 4th-18th, 2014 ment.
Fig. 13. Basis vectors for the first two time segments of the second IIJ link. All functions are normalized to have zero mean.

0.7
0.7 0.4
OMP OMP
0.6 0.6 PCA
RPCA PCA
BE 0.5 BE 0.3 RPCA
0.5
OMP
DP_AP

0.4

DP_AP
DP_AP

0.4
0.2
0.3 0.3

0.2 0.2
0.1
0.1 0.1

0 0 0
0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08
FAP_NP FAP_NP FAP_NP

(a) PCA (b) RPCA (c) OMP


Fig. 14. Comparisons of methods on false alarm probability and detection probability with IIJ data. Anomalies detected by (a) PCA, (b) RPCA, and (c) OMP are treated as
“real” anomalies, respectively.

contains traffic measurements (in and out) on 10 links, at time in- False-Alarm Probability of PCA-detected Non-anomalous Points
tervals of 1 h for a total of 5857 observed data samples for each (FAP_NP) and Detection Probability of PCA-detected Anomalous
link in each direction. Points (DP_AP) are not with respect to ground truth, though, and
An intrinsic problem we have already discussed is that it is dif- thus the meaning of the graph is different. This shows how consis-
ficult to obtain the ground truth for anomaly detection. Even if tent the approaches are. Unsurprisingly, RPCA is the most consis-
we adopted expert advices to label the data, there should remain tent with PCA, in that it finds the largest share of the same anoma-
some anomalies that may not be seen by experts. This is because lies. OMP is the most different from PCA, and our approach is in
they are not traditional ones. Hence, our false-alarm probabilities between the two. Similar conclusions can be found in Fig. 14(b),
would be elevated. Thus, we used simulations to provide quantita- where RPCA-detected anomalies are regarded as real ones. How-
tive comparisons to understand the merits of the various anomaly ever, for the alternated OMP-detected anomalies (Fig. 14(c)), the
detection methods. entire PCA, RPCA and BasisEvolution share almost the same num-
However, we acknowledge that there is an artificiality in any ber of anomalies, which is in accordance with their underlying
simulation. Here we aim to at least comment on the ability of Ba- mechanisms. The alternative measures we considered above rein-
sisEvolution to find anomalies in real data by using common meth- force the same story.
ods (such as PCA) to find anomalies, which we then treat as if they One interesting thing in Fig. 14 is that the whole detection
were real. Using those “real” anomalies, we compare BasisEvolu- probabilities are not high, which means the overlap region among
tion with other widely used detection methods. In this section, those methods is low. One explanation is that different anoma-
however, we omit NNBP results, as this method performed very lies are usually found by different methods because of inherent
poorly in simulations. mechanisms. Hence, the performance of RPCA and PCA at Fig. 14(a)
We divide each series into segments of two weeks in length, and (b) are the best respectively. OMP, PCA and BasisEvolution dis-
as in the simulations. We repeat the process for each algorithm cover different types of anomalies. This is what we might expect,
as described above, for each link. For example, Fig. 13 shows our given the different assumptions and algorithms upon which they
discovered basis functions for the first two time segments, the ini- are based.
tial segment, and its evolved form in the second time segment. In To further analyze the performance of four methods, we ran-
each, we can see that the first two basis vectors show the daily and domly present the detection results of ten weeks data as shown in
weekly cycles in the data, with the third containing more noise. Fig. 15. Since we do not have the ground truth for the real data,
Next, we compare the anomaly detectors against each other by the possible method of identifying anomalous points is through ei-
treating the anomalies found by one method as if they were real, ther expert advice or voting. In view of real network scenarios and
and then seeking to detect these using an alternative method. holiday influences in Japan, we can label several anomalous points,
Take PCA-detected anomalies as an example. Fig. 14(a) il- such as A(624,2.12e+08). In addition, we adopt a voting strategy
lustrates the performance of the left three alternative methods to estimate the reliability of each detected anomalous points. For
(OMP, RPCA, and BasisEvolution) in comparison. Note that here, any detected anomalous point, it will win one vote if another tech-
H. Xia et al. / Computer Networks 135 (2018) 15–31 29

8
10
3.5
Network traffic on Link 4
3 OMP
Network traffic volume PCA
BE
2.5 C(1179,1.411e+08)
RPCA A (624,2.12e+08) B(983,2.32e+08)
2

1.5

0.5

4.5 4.12 4.19 4.26 5.3 5.10 5.17 5.24 5.31 6.7 6.14
Date (Starts with Sunday, Apr. 5th 2014)

Fig. 15. Detection results of four algorithm on ten weeks data on link 4, IIJ: Apr 5th, 2014 to Jun 14th, 2014. Four algorithms are OMP, PCA, RPCA, and BasisEvolution
respectively. BE is short for BasisEvolution.

Table 4 can themselves have different shapes that make them harder or
Description of anomaly detection results for four algorithms in terms
easier to detect.
of reliability of those detected anomalies based on Fig. 15. Here, total
anomalies means a total number of anomalies detected by any ap- In this paper, we address these problems with BasisEvolution,
proach, amount of real anomalies counts the detected points that are which looks for a concise but accurate, and easy-to-understand ba-
real anomalies based on the expert advice but fail to get any vote sis that represent the typical traffic patterns more precisely. We
and No. with votes records the number of detected anomalous points evolve the basis to maintain similarity with the old one in describ-
with a certain amount of votes. BasisEvolution is more reliable and
ing the new data. We show it can then detect anomalous points on
can find anomalies that the other three cannot.
the residual series with higher accuracy than common alternatives.
Method Total Amount No. (with votes) It still works even the anomalies are complex.
anomalies (of real anomalies) 3 2 1 0 However, there are many issues left for future work. For exam-
BE 15 2 5 4 3 1 ple, the basis generation algorithm we used is based on SVD, which
OMP 13 1 5 0 2 5 is not feasible for very large data scenarios. The first issue is to find
PCA 24 0 5 4 9 6 robust, fast algorithm to replace SVD, such as the pursuit princi-
RPCA 24 0 5 4 7 8 pal used in OMP. Another point is that the size of data segment
we used in our framework is fixed and our algorithm proceeds in
batches. A more realistic algorithm should be able to tackle arbi-
nique also identifies it as abnormal. In our situation, the number of trary new segments of data as they arrive. We plan to extend Ba-
votes that one point could obtain ranging from 0 to 3. Larger num- sisEvolution in the future to tackle these issues.
bers of votes indicates that the corresponding anomalous points
are shared by more approaches. Hence, the corresponding meth-
ods are more reliable. Acknowledgment
Take BasisEvolution as an example, A total of 15 anomalous
points are detected in this context. Among them, five points have This authors thank IIJ for providing data. This work was sup-
three votes, and can be detected by the other three approaches; ported by the National Key Basic Research Program of China
four points have two votes; three points have one vote; two points (2013CB329103 of 2013CB329100), the China Scholarship Council
(A and B in Fig. 15) have been labeled as anomalous ones even (201506050072), as well as Australian Research Council (ARC) grant
though they have not been detected by any other technique; and DP110103505 and the ARC Centre of Excellence for Mathematical
only the left point (C in Fig. 15) is neither detected nor labeled. and Statistical Frontiers (ACEMS) CE14010 0 049.
Similarly, descriptions of the other methods can be seen in Table 4.
Compared with the other three methods, 12 out of 15 BE-detected
References
points can also be detected by others, and only one point is zero
vote, which indicates the strong reliability of BasisEvolution. Re- [1] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A comparative study of
garding PCA, RPCA and OMP, larger proportions of detected points anomaly detection schemes in network intrusion detection, in: Proceedings of
are not admitted with a zero vote, which demonstrates that the re- the SIAM International Conference on Data Mining, 2003, pp. 25–36.
[2] P. Barford, D. Plonka, Characteristics of network traffic flow anomalies, in:
liability of BasisEvolution is better than the rest three. In addition, Proceedings of the First ACM SIGCOMM Workshop on Internet Measurement
there are two zero-vote BE-detected points, which are authorized (IMW), San Francisco, CA, USA, 2001, pp. 69–73.
as real anomalies, while OMP has one and both PCA and RPCA have [3] A. Delimargas, Iterative Principal Component Analysis (IPCA) for network
anomaly detection, Carleton University, 2015 (Master’s thesis).
zero. These results, as well as the simulation, support the conclu-
[4] K. Xu, F. Wang, L. Gu, Behavior analysis of internet traffic via bipartite graphs
sion that BasisEvolution can find anomalous points that other ap- and one-mode projections, IEEE/ACM Trans. Netw. 22 (3) (2014) 931–942.
proaches can not. [5] M.H. Bhuyan, D.K. Bhattacharyya, J.K. Kalita, Network anomaly detection:
methods, systems and tools, IEEE Commun. Surv. Tutor. 16 (1) (2014) 303–336.
[6] P. García-Teodoroa, J. Díaz-Verdejoa, G. Maciá-Fernándeza, E. Vázquezb,
Anomaly-based network intrusion detection: techniques, systems and chal-
6. Conclusion and future work lenges, Comput. Secur. 28 (1–2) (2009) 18–28.
[7] L. Huang, L. Nguyen, M.N. Garofalakis, M.I. Jordan, A.D. Joseph, N. Taft, In-
network PCA and anomaly detection, International Conference on Neural
Accurately detecting anomalies without pre-knowledge plays Information Processing Systems, 2007, pp. 617–624.
an important role for network operators. However, to be practi- [8] D. Jiang, C. Yao, Z. Xu, W. Qin, Multi-scale anomaly detection for high-speed
cal, methods need to have a very low false-alarm probability, and network traffic, Trans. Emerg. Telecommun. Technol. 26 (3) (2015) 308–317.
[9] E. Sober, The principle of parsimony, Br. J. Philos. Sci. 32 (2) (1981) 145–156.
there are other challenges. Strong anomalies can pollute the nor- [10] J.S. Chiou, The antecedents of consumers loyalty toward internet service
mal space used as a baseline to detect anomalies, and anomalies providers, Inf. Manag. 41 (6) (2004) 685–695.
30 H. Xia et al. / Computer Networks 135 (2018) 15–31

[11] T. Tavallaee, N. Stakhanova, A.A. Ghorbani, Toward credible evaluation of [26] S. Novakov, C. Lung, I. Lambadaris, N. Seddigh, Studies in applying PCA and
anomaly-based intrusion-detection methods, IEEE Trans. Syst. Man Cybern. wavelet algorithms for network traffic anomaly detection, in: Proceedings of
Part C Appl. Rev. 40 (5) (2010) 516–524. the Fourteenth IEEE International Conference on High Performance Switching
[12] C. Croux, P. Filzmoser, M.R. Oliveira, Algorithms for projection–pursuit ro- and Routing (HPSR), 2013, pp. 185–190.
bust principal component analysis, Chemom. Intell. Lab. Syst. 87 (2) (2007) [27] F. Meng, N. Jiang, B. Liu, R. Li, F. Xia, A real-time detection approach to net-
218–225. work traffic anomalies in communication networks, in: Proceedings of the
[13] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E.D. Kolaczyk, N. Taft, Struc- Joint International Conference on Service Science, Management and Engineer-
tural analysis of network traffic flows, in: Proceedings of the ACM SIGMET- ing (SSME) and International Conference on Information Science and Technol-
RICS/Performance, 2004, pp. 61–72. ogy (IST), 2016.
[14] C. Pascoal, M.R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador, A. Pacheco, [28] Z. Chen, C.K. Yeo, B.S. Lee, C.T. Lau, Detection of network anomalies using im-
Robust feature selection and robust PCA for internet traffic anomaly detection, proved-MSPCA with sketches, Comput. Secur. 65 (2017) 314–328.
in: Proceedings of the IEEE INFOCOM, 2012, pp. 1755–1763. [29] Z. Wang, K. Hu, K. Xu, B. Yin, X. Dong, Structural analysis of network traffic
[15] S. Huang, F. Yu, R. Tsaih, Y. Huang, Network traffic anomaly detection with matrix via relaxed principal component pursuit, Comput. Netw. 56 (7) (2012)
incremental majority learning, in: Proceedings of the International Joint Con- 2049–2067.
ference on Neural Networks (IJCNN), 2015, pp. 1–8. [30] G.H. Golub, C.F.V. Loan, Matrix Computations, 3, JHU Press, 2012.
[16] Z. Chen, C.K. Yeo, B.S.L. Francis, C.T. Lau, Combining MIC feature selection and [31] J. Zobel, How reliable are the results of large-scale information retrieval exper-
feature-based MSPCA for network traffic anomaly detection, in: Proceedings iments, in: Proceedings of the SIGIR, 1998, pp. 307–314.
of the Third International Conference on Digital Information Processing, Data [32] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: Pro-
Mining, and Wireless Communications (DIPDMWC), 2016, pp. 176–181. ceedings of the Advances in Neural Information Processing Systems, 2001,
[17] J. Zhang, H. Li, Q. Gao, H. Wang, Y. Luo, Detecting anomalies from big network pp. 556–562.
traffic data using an adaptive detection approach, Inf. Sci. 318 (2015) 91–110. [33] R. Peharz, F. Pernkopf, Sparse nonnegative matrix factorization with
[18] A. Lakhina, M. Crovella, C. Diot, Diagnosing network-wide traffic anomalies, in: l0-constraints, Neurocomputing 80 (2012) 38–46.
Proceedings of the ACM SIGCOMM, 2004, pp. 219–230. [34] Y. Zhao, G. Karypis, D.-Z. Du, Criterion functions for document clustering, Tech-
[19] YusukeTsuge, H. Tanaka, Quantification for intrusion detection system using nical Report, Department of Computer Science, University of Minnesota, 2005.
discrete fourier transform, in: Proceedings of the International Conference on [35] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011.
Information Science and Security (ICISS), 2016, pp. 1–6. [36] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8)
[20] P. Barford, J. Kline, D. Plonka, A. Ron, A signal analysis of network traffic (2006) 861–874.
anomalies, in: Proceedings of the Second ACM SIGCOMM Workshop on Inter- [37] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz, J. Yates, Y. Zhang, Ex-
net Measurement (IMW), Marseille, France, 2002, pp. 71–82. perience in measuring backbone traffic variability: models, metrics, measure-
[21] B. Eriksson, P. Barford, R. Bowden, N. Duffield, J. Sommers, M. Roughan, Ba- ments and meaning, in: Proceedings of the Second ACM SIGCOMM Workshop
sisDetect: a model-based network event detection framework, in: Proceedings on Internet Measurement (IMW), 2002, pp. 91–92.
of the Tenth ACM SIGCOMM Conference on Internet Measurement (IMC), Mel- [38] M. Roughan, J. Gottlieb, Large-scale measurement and modeling of backbone
bourne, Australia, 2010, pp. 451–464. internet traffic, in: Proceedings of the SPIE ITCom, Boston, MA, USA, 2002.
[22] M. Ahmed, A.N. Mahmood, J. Hu, A survey of network anomaly detection tech- [39] M. Roughan, On the beneficial impact of strong correlations for anomaly de-
niques, J. Netw. Comput. Appl. 60 (2016) 19–31. tection, Stoch. Models 25 (1) (2009) 1–27.
[23] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Com- [40] C.C. Zou, W. Gong, D. Towsley, Code red worm propagation modeling and anal-
put. Surv. 41 (3) (2009) 15. ysis, in: Proceedings of the Ninth ACM Conference on Computer and Commu-
[24] D.J. Wellerfahy, B.J. Borghetti, A.A. Sodemann, A survey of distance and similar- nications Security (CCS), New York, NY, USA, 2002, pp. 138–147, doi:10.1145/
ity measures used within network intrusion anomaly detection, IEEE Commun. 586110.586130.
Surv. Tutor. 17 (1) (2015) 70–91. [41] P. Tune, K. Cho, M. Roughan, A comparison of information criteria for traffic
[25] H. Ringberg, A. Soule, J. Rexford, C. Diot, Sensitivity of PCA for traffic anomaly model selection, in: Proceedings of the Tenth International Conference on Sig-
detection, ACM SIGMETRICS 35 (1) (2007) 109–120. nal Processing and Communications Systems (ICSPCS), Gold Coast, Australia,
2016.
H. Xia et al. / Computer Networks 135 (2018) 15–31 31

Hui. Xia is a Ph.D. student in the Department of Computer Science at Chongqing University. She received a Bachelors degree in Computer Networks
from Nanchang University. Her research interests include network traffic analysis, data mining, and personal recommendation.

Bin Fang received the B.S. degree in electrical engineering from Xian Jiaotong University, Xian, China, the M.S. degree in electrical engineering from
Sichuan University, Chengdu, China, and the Ph.D. degree in electrical engineering from the University of Hong Kong, Hong Kong. He is currently
a Professor with the College of Computer Science, Chongqing University, Chongqing, China. His research interests include computer vision, pattern
recognition, information processing, biometrics applications, and document analysis. He has published more than 120 technical papers and is
an Associate Editor of the International Journal of Pattern Recognition and Artificial Intelligence. Prof. Fang has been the Program Chair, and a
Committee Member for many international conferences.

Matthew. Roughan received a Bachelors degree in math science at Adelaide University and the Ph.D. degree in applied mathematics at Adelaide
University. Currently, he is a professor with the School of Mathematical Sciences, the University of Adelaide. His research interests include Internet
measurement and estimation, network management, and stochastic modeling, in particular with respect to network traffic and performance mod-
eling. He has published more than 150 technical papers, and is a member of the MASA, AustMS, IEEE and ACM. Prof. Roughan has been a chair of
ACM SIGCOMM Doctoral Dissertation Award Committee, and won 2013 ACM SIGMETRICS Test of Time Award.

Kenjiro Cho received the B.S. degree in electronic engineering from Kobe University, the M.Eng. degree in computer science from Cornell Univer-
sity, and the Ph.D. degree in media and governance from Keio University. He is Deputy Research Director with the Internet Initiative Japan, Inc.,
Tokyo, Japan. He is also an Ad- junct Professor with Keio University and Japan Advanced Institute of Science and Technology, Tokyo, Japan, and a
board member of the WIDE project. His current research interests include traffic measurement and management and operating system support for
networking.

Paul. Tune received a Ph.D. degree at the Centre for Ultra-Broadband Information Networks (CUBIN), Dept. of E&E Engineering, the University of
Melbourne. He worked as a postdoctoral research fellow at the School of Mathematical Sciences, the University of Adelaide. Now he worked at the
Image Intelligence, Sydney. His research interests are Network measurement, Information theory, Compressed sensing.

You might also like