Professional Documents
Culture Documents
Computer Networks
journal homepage: www.elsevier.com/locate/comnet
a r t i c l e i n f o a b s t r a c t
Article history: Traffic anomalies arise from network problems, and so detection and diagnosis are useful tools for net-
Received 13 January 2017 work managers. A great deal of progress has been made on this problem so far, but most approaches
Revised 16 October 2017
can be thought of as forcing the data to fit a single mould. Existing anomaly detection methods largely
Accepted 17 January 2018
work by separating traffic signals into “normal” and “anomalous” types using historical data, but do so
Available online 31 January 2018
inflexibly, either requiring a long description for “normal” traffic, or a short, but inaccurate description.
Keywords: In essence, preconceived “basis” functions limit the ability to fit data, and the static nature of many
Anomaly detection algorithms prevents true adaptivity despite the fact that real Internet traffic evolves over time. In our
Basis approach we allow a very general class of functions to represent traffic data, limiting them only by in-
Evolution variant properties of network traffic such as diurnal and weekly cycles. This representation is designed to
Low false-alarm probability evolve so as to adapt to changing traffic over time. Our anomaly detection uses thresholding approxima-
SVD
tion residual error, combined with a generic clustering technique to report a group of anomalous points
as a single anomaly event. We evaluate our method with orthogonal matching pursuit, principal com-
ponent analysis, robust principal component analysis and back propagation neural network, using both
synthetic and real world data, and obtaining very low false-alarm probabilities in comparison.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.comnet.2018.01.025
1389-1286/© 2018 Elsevier B.V. All rights reserved.
16 H. Xia et al. / Computer Networks 135 (2018) 15–31
Past data
Data Basis Data Anomaly Anomalous niques seek ways to pick better functions, so that the reconstructed
transformaon derivaon reconstrucon detecon points traffic τˆ is closer to τ . Our approach tackles these problems by:
New data
1. allowing a wider class of basis functions, constrained by known
Fig. 2. The general anomaly detection process. (Data transformation) is a prepro- traffic features in order to avoid an explosion of possibilities;
cessing procedure. (Basis derivation) generates a basis from the historical data. (Data and
reconstruction) approximates the new data with the basis. (Anomaly detection) de-
2. an adaptive mechanism to allow the representation to easily
tects anomalies from the residual series, which records the difference between the
new data and its approximation. evolve.
Table 1
Description of anomaly detection in terms of candidate basis functions and techniques of finding a representation for a given data set. Here, the candidate
solution space describes the set of functions from which we can construct a typical traffic signal; and the No. is the number of functions in this space for a
signal of length n. The basis derivation refers to the technique used to refine this to a basis for the typical traffic. The basis size is the number of functions then
required to accurately represent the historical traffic.
tion process, the evolution approach can effectively reduce the Anomalous
Data Basis Anomaly
computational cost especially for large-scale data. T0 x0 point
cleaning generaon clustering
detecon
x0' r0=x0-W0 H0 P0 A0
To define the algorithm formally, we start by noting that we H0
sample traffic at times, tj , separated by intervals t. The variables Basis
Anomalous
Anomaly
T1 x1 point
xt j refer to the total traffic during the interval [t j , t j+1 ). We group update clustering
Time index
. detecon
these into blocks of b measurements, such that we obtain a new .
H1 r1=x 1 -W1 H1 . P1 .
A1
. .
process X = {x0 , . . . , xn } where . .
.
.
.
.
Anomalous
Basis Anomaly
xm = [xt j , ..., xt j+b−1 ]T , Tm xm point
update clustering
detecon
where j = bm, and the time intervals in this series are T = b t.
. Pm Am
. Hm . rm=xm-Wm Hm .
.
.
. . . .
In BasisEvolution, we use this data segment series as the input for . .
Anomalous
the whole framework, where T is usually several cycles long, e.g., Tn xn
Basis
point
Anomaly
update clustering
it might be several weeks. detecon
rn =xn-Wn Hn Pn An
In the framework, a small set of clean data is required for the
initialization of basis. However, the collected traffic measurements Fig. 3. A bird’s eye view of BasisEvolution framework for one single link. At time
are contaminated with anomalies, and would cost large amounts index T0 , we clean the data segment x0 , and get the “anomaly free” data segment
of resources to manually separate the typical part. We then design x0 . Then we derive a basis set H0 from x0 . Define r0 to be the approximation error
a data cleaning process on a data block as an alternative. Usually, of x0 reconstructed by H0 . Since x0 is totally normal, no anomaly is expected to be
found in r0 , we then utilize x0 instead of x0 for anomaly detection. The projection of
we choose the first collected data block x0 for cleaning.
x0 on H0 is W0 , and r0 is the difference between x0 and approximation series W0 H0 .
With the cleaned data segment x0 , the framework derives a ba- Through anomaly detection process, we get anomalous points P0 , and the clustered
sis for initialization. We name this procedure as the basis genera- anomalies A0 . At subsequent time indices (e.g., Tm ), we evolve the old basis H(m−1)
tion process and the initialized basis as H0 . into Hm , calculate the reconstruction error rm , and repeat anomaly detection process
For any new traffic (e.g., x1 ), we updating the old basis (e.g., H0 ) to find anomalous points and anomalies.
10
of the evolved basis for new traffic is outside a given bound, the
BasisEvolution will clean the data and initialize the basis instead
of evolving the old one. 5
The framework proceeds anomaly detection on the approxima- t4 t
tion error (e.g., r1 ). We identify the anomalous points by thresh- 2 t
0 3
olding, and cluster them together, with each cluster being referred
as an anomaly. 0 500 1000 1500 2000 2500 3000 3500
The complete algorithm is illustrated in Fig. 3, in which the Ba- Time
sisEvolution is composed of four components: data cleaning, basis
generation, basis update and anomaly detection (anomalous-point Fig. 4. Different anomaly types in the simulated time series. Traffic point at time
t1 is a spike. The context of point at t2 is different with that at t4 , even though the
detection and anomaly clustering). We describe each in detail be-
value at t2 is similar to that at t4 , so it is a contextual anomaly. Any point at time
low. interval t3 is normal, however, all points together is a collective anomaly [23]. For
most anomaly detection methods, e.g., threshold-based techniques, spike is easier
to be found than the rest, which indicates the rest anomalies can hide behind a
3.1. Data cleaning larger anomaly.
where V and U are unitary matrices, and is a diagonal matrix Algorithm 1 SVD on specific cycles algorithm (SVD_SC).
containing the singular values σ j of R(n) in the cycle cn . The singu-
Input:
lar values are arranged in decreasing order.
The traffic segment, x;
To understand the SVD’s use in matrix approximations, consider
The traffic aggregation time, t days;
the following interpretation. The matrix is diagonal, so the SVD
The size of x, b;
of a matrix R(n) can be rewritten
The threshold of basis energy proportion, γ .
Output:
R (n ) = u j σ j vTj , (6) The basis H, and weight W ;
j=1 The final residual r;
1: Initialize the cycle set
where uj and vj are the jth columns of U and V respectively, and
1 7
= min{cn , b − cn + 1}.
by retaining only the terms c← , ,b ;
We then create an approximation R t t
corresponding to the largest Jn singular values:
2: Set r(1 ) ← x;
Jn
3: for n = 1, . . . , I do
=
R u j σ j vTj . (7)
4: // Create R(n ) from r (n ) according to Eq. 4;
j=1
5: for row = {1, 2, ..., b − cn + 1} do
The energy proportion of uj is 6: Slide windows on r(n ) of length cn to form matrix R row
by row,
σ j2
E j = , (8) R(n ) ← R(n ) ; [rt(row
n)
, ..., rt(row
n)
] ;
j=1 σ 2
j
−1 +c n −2
7: end for
and we choose Jn such that Ej is closest to but still larger than the 8: Apply SVD to matrix R(n ) , i.e., find U, and V such that
preset threshold value γ = 0.3. R (n ) = U V T ;
Since vectors of the first Jn columns in U take the largest energy 9: Set j ← 1;
of the signal, we directly associate h(cn , j ) = u j , and obtain the cor- 10: while Energy rate E j ≥ γ do
responding weight 11: j ← j + 1;
12: end while
h ( cn , j ) r ( n )
w ( cn , j ) = . (9) 13: Set Jn ← j;
r(n) 14: The first Jn columns in U constitutes the function family
Then we construct a new residual for the next period, and solve H (cn ) ← u1 , . . . , uJn ;
the new problem as above.
We call this algorithm SVD on Specific Cycles (SVD_SC), which
15: Calculate the corresponding weights according to Eq. (9)
is described in detail in Algorithm 1. One thing we should note is
T
that in Step 6 and also in subsequent algorithms we use Matlab’s w(cn ) ← w(cn ,1) , . . . , w(cn ,Jn ) ;
notation for vector concatenation.
The entire algorithm is supposed to be applied to the first
cleaned signal, i.e., x = x0 . Moreover, when traffic patterns of one 16: r(n+1 ) ← r(n ) − H (cn ) w(cn ) ;
data segment vary too much, we will also generate a new basis 17: end for
through a basis generation process instead of updating the old one. 18: r ← rI+1 ;
seen in Eq. (11), where the first term calculates the similarity with Algorithm 2 Basis update algorithm.
the previous basis, the second term evaluates the deviation to the
Input:
historical mean, and parameter λ1 and λ2 are weighted coefficients
A new traffic segment, xm ;
for trading off the similarity and deviation.
The size of xm , b;
The historical basis set, H(m−1 ) ;
(k )
d hm H((mk)−1) = λ1 hm(k) − h((km)−1) + λ2 hm(k) − h((km)−1) , (11)
The historical minimum, Hmin(m−1 ) ;
where The historical maximum, Hmax (m−1 ) ;
(m−1 ) The historical mean, H(m−1 ) ;
1 (k )
h((km)−1) = hi . The old weight set, W(m−1 ) ;
m
i=0 The threshold of residuals, α ;
(k ) The maximum number of iterations, β and ζ .
The jth element in hm is also constrained to maintain smooth-
Output:
ness by enforcing
The new basis Hm , and weight set Wm ;
min(hi(k ) ( j )) ≤ hm
(k )
( j ) ≤ max(hi(k) ( j )), (12) The final residual r.
i i 1: Hm ← H(m−1 ) , Wm ← W(m−1 ) ;
where i ∈ [0, m − 1]. 2: Set the number of iterations n1 ← 0 and n2 ← 0;
Define Hmin max
(m−1 ) and H(m−1 ) as collections of historical minimum
3: Set the residual e1 ← xm ;
and maximum of basis vectors respectively. Hence, 4: while n1 ≤ β and r > α do
5: for k = 1,2,…,L do
(k )
(m−1 ) = {minh(m−1 ) }k∈[1,L] ;
Hmin (l ) (l )
6: Compute the residuals ek ← xm − l=k wm hm ;
7: while n2 ≤ ζ do
(k ) Calculate the residual matrix E according to Eq. 13
(m−1 ) = {maxh(m−1 ) }k∈[1,L] ,
Hmax 8:
(k )
9: Update hm according to rules Eq. (14);
where (k )
c ← t he lengt h o f hm ;
10:
11: for j = 1,2,...,c do
minh((km)−1) = (k )
min (hi ( j )) ; 12:
(k )
if hm ( j ) < minh((km)−1) ( j ) then
i∈[0,m−1]
j∈[1,b] (k )
13: hm ( j ) ← minh((km)−1) ( j );
and (k )
14: else if hm ( j ) > maxh((km)−1) ( j ) then
maxh((km)−1) = (k )
max (hi ( j )) . 15:
(k )
hm ( j ) ← maxh((km)−1) ( j );
i∈[0,m−1]
j∈[1,b] 16: end if
The result is a new basis that evolves slowly to match changes 17: end for
(k )
in traffic patterns, but is unresponsive to anomalies, which are in- 18: Update wm according to rules (15) and (16);
herently rare and erratic (or else they are not anomalous). 19: n2 ← n2 + 1;
(k ) (k )
We derive hm and wm through alternating the fixed variables 20: end while
(k ) (k )
iteratively [32,33]. We transform the final residual ek to a matrix 21: Hm ← Hm ∪ hm , Wm ← Wm ∪ wm ;
(k ) end for
form E whose rows have the same length as hm , i.e., c. The data 22:
segments cover traffic for several cycles, so the new matrix E has 23: Compute the approximation error
K = b/c rows, and we can write it
L
⎡ ⎤ r ← xm − (k )
wkm hm ;
ek ( 1 ) ··· ek ( c )
⎢ ek ( c + 1 ) ··· ek ( 2c ) ⎥ k=1
E =⎢ .. .. ⎥ (13)
⎣ .
..
. .
⎦ and
1
b
ek ((K − 1 )c + 1 ) ··· ek (Kc )
r← r ( i );
(k ) b
For each basis vector hm , we initialize a vector i=1
(k ) n1 ← n1 + 1;
w = wm y, 24:
25: end while
where y is an IID standard normal variate of length K. We then
iteratively apply the following rules
(k ) (k )
wT E + λd (λ1 h((km)−1) + λ2 h((km)−1) ) Finally, if the residual error r cannot be reduced to a preset
hm ← hm ◦ (k ) (k ) (k )
, (14) bound (e.g., the average approximation error using the previous
wT whm + λd (λ1 hm + λ2 hm ) basis on new data) after a certain number of iterations, it means
the update algorithm can not converge. This non-convergence in-
(k )
Ehm dicates the behaviour patterns of the new data segment varies
w ← w◦ (k )
(k ) T
, (15)
whm (hm ) greatly from the past behaviour, and we should apply the data
cleaning and basis generation processes to produce a totally new
where ◦ and division are element-wise (Hadamard) multiplication
basis set instead of updating the old basis set.
and division, respectively. After the algorithm has converged, we
The time complexity for basis update algorithm is
take the average of w as the weight
O(L × c × β × ζ ). For any two-week data segment, L is small
κ
(k ) 1 (generally 3, corresponding to three different levels of cyclicity),
wm ← wi , (16)
κ c records the length of the basis (no larger than two weeks).
i=1
Meanwhile, we set β and ζ at small or moderate levels (in later
where wi ∈ w. Algorithm 2 shows more detai l. experiments, β and ζ are equal to 10 for simplicity). Hence, the
22 H. Xia et al. / Computer Networks 135 (2018) 15–31
capacity of time cost is not large. In practice, the basis update Algorithm 3 Bottom-up hierarchical clustering based anomaly de-
process takes about 0.6 s to evolve the old basis. The experimental tection.
environment is the same: the simulation code ran 100 times on Input:
Matlab R2012b on a Mac OS X 10.9.5 with a 2.7 GHz Intel Core Time series of the anomalous traffic measurements, t =
i5. This indicates that updating the old basis takes less time than [t1 , . . . , tn ];
generating a new one, which is more suitable for online anomaly A distance threshold, ι;
detection. The maximum number of iterations, η
Output:
3.4. Anomaly detection The set of anomalies (clusters) and the corresponding center
set, V and U;
Many researchers stop their work after detecting the anoma- 1: t(0 ) ← t;
lous points by thresholding the residual. However, many detected 2: k(0 ) ← n;
points may be associated with the same underlying anomaly. It is 3: m ← 1;
confusing to network operators to present a single event as many 4: while m ≤ η do
detections, and there are distinct advantages as well, in combining 5: k ← 1;
individual detections before deciding which anomalies are most 6: for i= 1, …, k(m−1 ) − 1 do
important: for instance, an anomaly that develops more slowly, but j ← i + 1;
7:
over a long time may be more important that a sharp, but singular
8: The distance d((i,mj−1
)
)
← ti(m−1 ) − t (j m−1 ) ;
event.
In this paper, we try to associate detections so that we can 9: if d((i,mj−1
)
)
≤ ι then
present a single anomalous event, along with information about (m−1 ) (m−1 )
ti +t j
the duration of the event. We do so through a clustering process 10: tk(m ) ← 2 ;
that we describe below. 11: V (m ) ← V (m ) ∪ {ti(m−1 ) , t (j m−1 ) };
3.4.1. Anomalous-point detection 12: t(m ) ← t(m ) , tk(m ) ;
We detect anomalous points by thresholding. For any approxi- 13: k ← k + 1;
mation residual error r, we set the upper and lower limits [ p − q ∗ 14: i ← i + 1;
σ , p + q ∗ σ ], where p and σ are the mean and standard deviation 15: else
of r, and q is a parameter chosen to fix the false-alarm probability 16: V (m ) ← V (m ) ∪ {ti(m−1 ) };
(assuming a normal distribution of the residual). Any points of r 17: tk(m ) ← ti(m−1 ) ;
that fall outside these thresholds are declared to be anomalous.
18: t(m ) ← t(m ) , tk(m ) ;
3.4.2. Anomaly clustering 19: k ← k + 1;
Once we have anomalous data points, we use a bottom-up 20: end if
hierarchical clustering algorithm to group points, as shown in 21: end for
Algorithm 3. Similar to Algorithm 1, we also use the Matlab no- 22: if d((i,mj−1
)
)
> ι then
tation to represent the concatenate process of time series t(m) as
23: V (m ) ← V (m ) ∪ {t (j m−1 ) };
shown in steps 12,18 and 25.
For most anomaly types, the times of anomalous points caused 24: tk(m ) ← t (j m−1 ) ;
by the same events are close to each other, which means they oc- 25: t(m ) ← t(m ) , tk(m ) ;
cur at short-time anomalies. In clustering, we measure the distance
between clusters by the absolute time gaps between their centers. 26: k ← k + 1;
Anomalous points with smaller time gaps will tend to be grouped 27: end if
into one large cluster. This clustering method has the ability to find 28: k ( m ) ← k;
short-time anomalies. 29: m ← m + 1;
Since we have not included any specific properties of network 30: end while
traffic for clustering, this clustering algorithm is generally and 31: V ← V (m ) ; U ← t (m ) ;
equally applicable to other detectors given the time locations of
anomalous points, and we shall test it as part of these alternative
algorithms.
determine the duration of the anomalies. As far as we know, this
However, for anomalies such as FISSD type of anomalies, they
information has been labelled in no dataset.
often last long and do not have the short-time attribute we have
Our approach to simulation intends to highlight the features
mentioned above. In this condition, Algorithm 3 is no longer suit-
of different techniques. We make no claim that the simulation is
able. This long-time anomaly clustering problem has been left for
completely realistic, only that it illustrates the properties of differ-
further research in the future. One primary idea to address this
ent anomaly detection techniques.
problem is to combine multi scales into clustering, inspired by ex-
Our analysis uses three pairs of metrics, as described in
tracting multiscale features for anomaly detection [15,16,26,28].
Section 4.1, intended to explore the different characteristics of the
anomaly detection algorithms, and compares our algorithm to four
4. Experiments with synthetic data
other methods:
In this section, we apply several anomaly detection approaches
to synthetic data. The simulation of synthetic data is extremely im- 1. OMP-on-dictionary: This method, motivated by BasisDetect [21],
portant for the validation of anomaly detection algorithms where uses Orthogonal Matching Pursuit (OMP) on a dictionary of dis-
ground truth is difficult to find [25]. In our case we aim to assess crete cosine time and frequency atoms. Similar to BasisEvolu-
very small false-alarm probabilities, necessitating large amounts of tion, this model determines L atoms, L , with the largest coef-
tests. In addition, we want to test the ability of the algorithm to ficients αL at the first data segment x0 , Then, for the new data
H. Xia et al. / Computer Networks 135 (2018) 15–31 23
0.6
Our goal here is to generate traffic with some of the important
observed features of real traffic. This will, of course, be a simplifi- 0.5
DP_AP
cation of real traffic, but we have a great deal more control over
its details and can see the effects, for instance, of changing the du- 0.4
ration of anomalies. 0.3
The simulation has two steps: first, we generate our typical BE on cleaned data
traffic, and second, we inject anomalies. The typical traffic is based 0.2 BE on uncleaned data
around a sine function with a 24-h period to represent the strong
0.1
diurnal cycles of Internet traffic [37]. Noise was added to repre- 0 0.02 0.04 0.06 0.08 0.1 0.12
sent statistical fluctuations around the mean. In this study, we use FAP_NP
white Gaussian noise. Gaussianity is a reasonable first approxi-
mation for Internet traffic on uncongested links [38], and we do Fig. 6. The performance of BasisEvolution on clean & unclean data. The effective-
ness is measured with FAP_NP (False-Alarm Probability on Non-anomalous Points)
not include correlations (though they are often present in traffic), and DP_AP (Detection Probability on Anomalous Points). The whole simulation ex-
because uncorrelated traffic represents a worst case for detecting periment repeats 100 times to get statistical results. The performance of BasisEvo-
anomalies [39]. We also control the relative size of the noise. In lution on clean data is better than which on uncleaned data.
what follows the standard deviation of the noise is 10% of the am-
plitude of the cycles.
The parameter ψ changes the width of anomalies. We dilate the
We generate a series containing eight weeks of data which is
anomaly type function by a factor W + ψ . In our experiments we
aggregated every 1 h, i.e., 1344 data points for each simulation.
consider a wide range of values ψ = {0, 4, 10, 20, 40}.
Each is broken into 4 data segments, to be analyzed in blocks as de-
The parameter φ updates the magnitude of anomalies: we scale
scribed above. We experimented with other data segment lengths,
the shape function of the anomaly type by this factor. In our ex-
and although the results and general trends were similar, a two-
periments we have considered φ = {0.1, 0.3, 0.5, 0.7, 1} resulting in
week segment produced the best results of the lengths tested.
magnitudes φ M.
The anomaly model in this work can be represented as A(τ , θ ,
ϑ, ψ , φ ), where τ , θ , ϑ, ψ and φ are parameters for generating and 4.3. Investigation on BasisEvolution
injecting anomalies.
The parameter τ indexes a set of anomaly types. Each anomaly Before experiments, we need to have a thorough investigation
type is defined by a shape function with width W, which repre- on the whole BasisEvolution framework. In data cleaning phase,
sents the minimum width for that anomaly type (otherwise, all we need to clarify the importance of this step, and should make
anomalies could collapse into spikes), and amplitude M, which is a proper selection on the parameter and detection method (seen
defined by the relative size to the typical traffic. in Fig. 5). In addition, we have to estimate a set of parameters in
1. FIFD: (Fast Increase, Fast Decrease): these are the classical basis update process.
anomalies most often examined in prior work. They might be As discussed before, in real network traffic it is difficult to find
caused by system or device outages, or DoS attacks. Often, past the ground truth, while the synthetic traffic volumes highlighting
literature has focused on “spikes” which are very short FIFD features of network traffic are suitable for analysis. Hence, all ex-
anomalies. periments at this section are on synthetic data.
2. SIFD (Slow Increase, Fast Decrease): these might arise through We first demonstrate the importance of data cleaning process
epidemic processes (e.g., worms or viruses [40]), which grow on the entire framework through the comparison of the perfor-
over time, and are eventually fixed, dropping the system back mance of BasisEvolution on cleaned and uncleaned data. Then we
into its standard state. confirm the number of maximum iterations and the detection al-
3. FISSD (Fast Increase, Stable State, Slow Decrease): these arise gorithm by examining the degree of cleanliness of the revised
typically as the result of phenomena such as flash crowds, traffic after data cleaning process. As for parameters (shown in
which appear very suddenly, are maintained for some time, and Eqs. (10) and (11)) in basis update process, we have found rough
then gradually disappear. parameter settings that are not too far from optimal.
4. FISD (Fast Increase, Slow Decrease): these represent, for com-
pleteness, a special case of the previous anomaly with no stable 4.3.1. The impact of data cleaning process
state. Fig. 6 compares the detection performance of BasisEvolution
under two conditions: with data cleaning process and without
We also include negative anomalies as well as positive, i.e., traf- data cleaning process. We see that the framework performs better
fic decreases as well as traffic increases. Each simulation includes when the data is cleaned, which means basis vectors derived from
only one type of anomaly. Most of our reported results here focus cleaned data can better describe the “normal region” than those
on FIFD anomalies, because most of the results for other anomaly from uncleaned data. If data is uncleaned, and contains traces of
types mirror the ones reported. intrusions, BasisEvolution may not detect future instances of those
Parameter θ determines the number of anomalies that are in- intrusions, since it will presume that they are normal, which leads
jected. We set θ following the Poisson distribution, θ ∼ Poisson(λ), a higher false-alarm probability and lower detection probability.
where λ = 3 per data segment, i.e., on average there will be 12 Hence, the data cleaning process is necessary for the entire work.
anomalies in each simulation. It is important that there are multi-
ple anomalies per segment, because one of the affects we seek to 4.3.2. Detector selection for data cleaning process
include in our simulations is interactions in the detector, between In our test, we compare the performance of single approaches
anomalies. and a combination method by choosing PCA- and ANN-based de-
The parameter ϑ controls the starting locations of anomalies. tectors. In this way, the single approaches are PCA- and ANN-based
Anomalies, by definition, could happen at any time with no pref- detectors, while the combination detector merges the detection re-
erence, so ϑ ∼ Uniform(0, b), where b is the length of the data seg- sults of each detector at each iteration. More specifically, we select
ment. NNBP method as an ANN algorithm.
H. Xia et al. / Computer Networks 135 (2018) 15–31 25
0.6 Table 2
We fix the value of DP_AP, and report the corresponding FAP_NP of
five methods. FAP_NP is the False-Alarm Probability on Non-anomalous
0.5
Points. DP_AP is the Detection Probability on Anomalous Points. DNE
DP_AP means “Does Not Exist”. BE is short for BasisEvolution. Note that very
0.4 PCA small FAP_NP values are reported (shown in bold), as required.
NNBP Method Fixed DP_AP
0.3 PCA+NNBP
0.3 0.5 0.6
0.1 0.2 0.3 0.4 0.5
BE 0.0 0 063 0.0073 0.020
FAP_NP OMP 0.0027 0.18 DNE
PCA 0.019 0.047 0.099
Fig. 7. The performance of three detectors in data cleaning process. Three detec-
RPCA 0.025 0.047 0.064
tors are PCA, NNBP and the combination of them. The effectiveness is measured
NNBP 0.27 DNE DNE
with FAP_NP (False-Alarm Probability on Non-anomalous Points) and DP_AP (De-
tection Probability on Anomalous Points). The whole simulation experiment repeats
100 times to get statistical results. The performance of combined approach is better
than each single one.
In Fig. 8 (a), we measure the performance of the data clean-
ing process through detecting the residuals between original traffic
and the cleaned traffic. More cleaner the traffic, higher detection
In PCA, we define the “normal subspace” with thresholding performance (lower false-alarm probabilities and higher detection
(e.g., [mean − 3 ∗ std, mean + 3 ∗ std], where mean and std are av- probabilities) on the residuals. From Fig. 8 (a), the false alarms are
erage and standard deviations of principal components), and find high when maxC is smaller than 5, however, when maxC is larger
anomalous points with SPE measurement [18]. Since PCA is sensi- than 5, the corresponding curves not only perform better, but also
tive to the polluted data [25], the “normal subspace” is distorting are close to each other. As for Fig. 8 (b), when maxC is larger than
at the first iteration, and the false alarms would be very high. As 3, the corresponding curves are also similar to each other, with
Fig. 5 shows, the continuous detection and gaps padding in the fol- higher detection performance. Based on these, we finally choose
lowing iterations bring the traffic closer to the normal range. maxC = 5 in later experiments.
NNBP finds anomalous points by comparing the differences be-
tween real traffic and predicted traffic volumes. If the differences
4.3.4. Parameter estimation in basis update
are out of the threshold (e.g., c∗ std, where c is a constant value,
The parameters we need to assess are λd , λ1 and λ2 . Given our
std is the standard deviation of test samples), the corresponding
motivation for distance measurement, i.e., the new basis should
points are identified as anomalies. In the first iteration, the data is
both maintain similarity to the immediately previous one and be
polluted and the corresponding predictions are not accurate. Sim-
close to the historical mean. Hence, the importance of two parts in
ilarly, the successive layers of detection and interpolation (seen in
Eq. (11) should be the same, and we set the weighted parameter
Fig. 5) can modify the traffic to the regular level.
As for the combination detector, the set of detected anoma-
λ1 = λ2 = 0.5.
Parameter λd determines the trade-off between the measure-
lous points is larger through combining the detection results of
ment constraint and the importance of changes. A larger λd leads
both PCA and NNBP. With those anomalous points being revised
to lower approximations, whereas a smaller λd means approxima-
by “normal values” in each iteration (seen in Fig. 5), the typical
tions are a better fit to the data. We set the candidate solution of
traffic can be reformed more effectively.
We measure the performance of three detectors by detect-
λd to be {0.01, 0.1, 1, 2, 10, 100}. Fig. 9 shows the detection perfor-
mance results with respect to different λd values. We find that an
ing the corresponding modified traffic. More specifically, we de-
optimal detection result (a higher detection probability and a lower
tect the residuals between original traffic and modified traffic
false-alarm probability) is around λd = 2 with the corresponding
through thresholding (seen in Section 3.4.1). If the detection ac-
curve tending to the top left of the plot. Hence, we finally choose
curacy (DP_AP) is high and the false-alarm probability (FAP_NP) is
low, larger part of the modified traffic will be in the normal region.
λd = 2, λ1 = 0.5, and λ2 = 0.5 in the following experiments.
Fig. 7 illustrates the performance of three detectors in data clean-
ing process. The combination detector performs best with the low- 4.4. Experiment results on synthetic data
est false-alarm probability. This is in accordance with the conclu-
sion discussed above. The lowest false-alarm probability means the We conducted several classes of synthetic experiments. In the
traffic cleaned by the combination detector is closest to the “nor- first class of experiments, we compare anomaly detectors on fixed
mal” traffic, from which we can extract the most accurate “normal width anomalies, but with anomalies that are not simple spikes.
patterns”. In the second class, we consider the affects of anomaly width on
detection performance. In the third and fourth classes, we consider
the same issues but for magnitude.
4.3.3. The estimation of parameter maxC in data cleaning process In each case below we generate traffic and inject anomalies as
In data cleaning process, parameter maxC is defined as the described above. We replicate the experiments 100 times, and re-
number of maximum iterations. Furthermore, both anomaly detec- port aggregated statistics.
tion and gap filling methods should repeat maxC times to clean the
data. We try to estimate a small value of maxC, so that the cleaned 4.4.1. Constant-width results
traffic is close to “normal”, and the BasisEvolution could achieve We start by considering anomalies with random magnitude,
the best detection performance. but fixed width ψ = 10. The resulting ROC curves are shown in
In the simulation, the candidate set for maxC is {1, 3, 5, 7, 9}. Fig. 10 (a), which shows that as the threshold q increases, both De-
we choose the optimal value of maxC through two comparisons: tection Probability on Anomalous Points (DP_AP), and False-Alarm
comparing the performance of the data cleaning process (similar Probability on Non-anomalous Points (FAP_NP) decreases. i.e., there
to Section 4.3.2); and comparing the performance of BasisEvolution is a trade-off between the two types of errors. Noteworthy is the
(similar to Section 4.3.1). The results can be seen in Fig. 8 (a) and fact that BasisEvolution (BE) outperforms the other methods (we
(b). prefer curves towards the top left of the diagram). Table 2 shows,
26 H. Xia et al. / Computer Networks 135 (2018) 15–31
DP_AP
0.5 0.4
TP_AP
maxC=9
0.45 0.3 maxC=7
maxC=5
0.4 0.2 maxC=3
maxC=1
0.35 0.1
0.14 0.15 0.16 0.17 0.18 0.19 0 0.01 0.02 0.03 0.04 0.05
FAP_NP FAP_NP
(a) The performance of the data cleaning (b) The performance of BasisEvolution.
process.
Fig. 8. The performance of data cleaning process (a) and the whole framework (b) under different values of maxC. maxC ∈ {1, 3, 5, 7, 9}. The effectiveness is measured with
FAP_NP (False-Alarm Probability on Non-anomalous Points) and DP_AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to
get statistical results. A better value for maxC is 5.
=0.1 In Fig. 10(c), apart from the NNBP, as the threshold q in-
d
0.4
d
=1 creases, AD_AD (Aggregation Degree on Anomaly Duration) in-
=2 creases, while CD_AD (Consistency Degree on Anomaly Duration)
0.2 d
decreases, showing another trade-off. As in (b), we prefer curves
=10
d towards the top right. It is beyond any doubt that BasisEvolution
=100
0 d outperforms the rest four methods across all threshold values.
0 0.01 0.02 0.03 0.04 0.05 0.06
FAP_NP 4.4.2. The effect of width
We now test the five anomaly detection methods on anoma-
Fig. 9. Comparison of detection performance in terms of FAP_NP (False-Alarm Prob-
ability on Non-anomalous Points) and DP_AP (Detection Probability on Anomalous lies whose magnitudes are randomly distributed, but with fixed
Points) when λd ∈ {0.01, 0.1, 1, 2, 10, 100}. The entire simulation was repeated 100 width parameter ψ = {0, 4, 10, 20, 40}, to understand the affect of
times to get statistical results. A better performance can be found at λd = 2. anomaly width on detection. As for detecting the FIFD type of
anomalies, Fig. 11(a) shows the performance of BasisEvolution de-
creases as the width parameter ψ grows. That is, it is more diffi-
more precisely, a number of FAP_NP values for given DP_AP val- cult to correctly detect wider anomalies. This is to be expected:
ues. The important feature shown in this table, apart from Basi- as the anomaly width increases, there are more points in the
sEvolution’s superiority, is that it can achieve very low false-alarm anomaly, and thus these points are more likely to be considered
probabilities (as low as 0.0 0 063), which are necessary in practical “normal” by any detector, and by any metric. A single spike will
anomaly detection algorithms. always be the easiest type of event to detect, while although it
The trade-off is a relative low DP_AP, but this is less important. might be possible to detect the change into a broad flat anomaly,
In real anomaly detection settings, it is far more important to avoid it will always be difficult to correctly categorize the flat region.
reporting a large number of false alarms, while still providing use-
ful detections, than to detect all events at the cost of false alarms 4.4.3. Constant-magnitude results
that degrade the operators confidence in any of the detections. This We now consider the effect of magnitude of the anomalies on
also emphasizes the need for simulation studies, as it would have detection, starting with a set of tests with fixed magnitude, φ =
been impractical for us to measure such a low probability without 0.5, but random width. The ROC curves for metrics DP_AP (De-
a large set of data on which to perform tests. tection Probability on Anomalous Points) and FAP_NP (False-Alarm
In Fig. 10(b), we show the equivalent of an ROC curve for the Probability on Non-anomalous Points) are shown in Fig. 12.
pair of metrics AD_AA (Aggregation Degree on Anomaly Amounts) As in Fig. 10, both DP_AP and FAP_NP decrease with the incre-
and CD_AA (Consistency Degree on Anomaly Amounts), although ment of the threshold q. We prefer the curves toward top left of
in this plot we prefer curves to the top right in the diagram. As the the diagram. The results are similar to those for constant width,
1 1 0.9
0.9 0.9 0.8
OMP
0.8 0.8 0.7
PCA
0.7 RPCA 0.7 0.6
CD_AA
0.6
CD_AD
0.6 NNBP
DP_AP
0.5
0.5 BE 0.5
0.4
0.4 0.4
0.3 0.3 0.3
0.2 0.2 0.2
0.1 0.1 0.1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.7 0.75 0.8 0.85 0.9 0.95 1
FAP_NP AD_AA AD_AD
(a) FAP NP versus DP AP. (b) AD AA versus CD AA. (c) AD AD versus CD AD.
Fig. 10. Performance of 5 methods for detecting FIFD anomalies. Here the width parameter is ψ = 10, and magnitude is random. BE is short for BasisEvolution. FAP_NP
is the False-Alarm Probability on Non-anomalous Points. DP_AP is the Detection Probability on Anomalous Points. AD_AA is the Aggregation Degree on Anomaly Amounts.
CD_AA is the Consistency Degree on Anomaly Amounts. AD_AD is the Aggregation Degree on Anomaly Duration. CD_AD is the Consistence Degree on Anomaly Duration.
H. Xia et al. / Computer Networks 135 (2018) 15–31 27
1 1
0.9 =0 0.9
0.8 =4 0.8
0.7 =10 0.7
=20 0.6
DP_AP
0.6
DP_AP
=40 =0.1
0.5 0.5
=0.3
0.4 0.4
=0.5
0.3 0.3
=0.7
0.2 0.2
=1
0.1 0.1
0 0
0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 0.12
FAP_NP FAP_NP
1 1 1
0.9 OMP 0.9 0.9
0.8 PCA 0.8 0.8
0.7 RPCA 0.7 0.7
CD_AA
NNBP 0.6
CD_AD
0.6
DP_AP
0.6
BE 0.5
0.5 0.5
0.4 0.4
0.4
0.3 0.3
0.3
0.2 0.2
0.2
0.1 0.1
0 0.1 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99 1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
FAP_NP AD_AA AD_AD
(a) FAP NP versus DP AP. (b) AD AA versus CD AA. (c) AD AD versus CD AD.
Fig. 12. Performance for detecting FIFD anomalies with fixed magnitude φ = 0.5 and random width. FAP_NP is the False-Alarm Probability on Non-anomalous Points. DP_AP
is the Detection Probability on Anomalous Points. AD_AA is the Aggregation Degree on Anomaly Amounts. CD_AA is the Consistency Degree on Anomaly Amounts. AD_AD is
the Aggregation Degree on Anomaly Duration. CD_AD is the Consistence Degree on Anomaly Duration.
c = 1 day c1 = 1 day
1
2 2
Values
Values
0 0
−2 −2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days
5 c2 = 1 week 5 c = 1 week
2
Values
Values
0 0
−5 −5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days
5 c = 2 weeks 5 c = 2 weeks
3 3
Values
Values
0 0
−5 −5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Days Days
(a) The initial basis vectors on the first time (b) Evolved basis on the following time seg-
segment: April 4th-18th, 2014 ment.
Fig. 13. Basis vectors for the first two time segments of the second IIJ link. All functions are normalized to have zero mean.
0.7
0.7 0.4
OMP OMP
0.6 0.6 PCA
RPCA PCA
BE 0.5 BE 0.3 RPCA
0.5
OMP
DP_AP
0.4
DP_AP
DP_AP
0.4
0.2
0.3 0.3
0.2 0.2
0.1
0.1 0.1
0 0 0
0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08
FAP_NP FAP_NP FAP_NP
contains traffic measurements (in and out) on 10 links, at time in- False-Alarm Probability of PCA-detected Non-anomalous Points
tervals of 1 h for a total of 5857 observed data samples for each (FAP_NP) and Detection Probability of PCA-detected Anomalous
link in each direction. Points (DP_AP) are not with respect to ground truth, though, and
An intrinsic problem we have already discussed is that it is dif- thus the meaning of the graph is different. This shows how consis-
ficult to obtain the ground truth for anomaly detection. Even if tent the approaches are. Unsurprisingly, RPCA is the most consis-
we adopted expert advices to label the data, there should remain tent with PCA, in that it finds the largest share of the same anoma-
some anomalies that may not be seen by experts. This is because lies. OMP is the most different from PCA, and our approach is in
they are not traditional ones. Hence, our false-alarm probabilities between the two. Similar conclusions can be found in Fig. 14(b),
would be elevated. Thus, we used simulations to provide quantita- where RPCA-detected anomalies are regarded as real ones. How-
tive comparisons to understand the merits of the various anomaly ever, for the alternated OMP-detected anomalies (Fig. 14(c)), the
detection methods. entire PCA, RPCA and BasisEvolution share almost the same num-
However, we acknowledge that there is an artificiality in any ber of anomalies, which is in accordance with their underlying
simulation. Here we aim to at least comment on the ability of Ba- mechanisms. The alternative measures we considered above rein-
sisEvolution to find anomalies in real data by using common meth- force the same story.
ods (such as PCA) to find anomalies, which we then treat as if they One interesting thing in Fig. 14 is that the whole detection
were real. Using those “real” anomalies, we compare BasisEvolu- probabilities are not high, which means the overlap region among
tion with other widely used detection methods. In this section, those methods is low. One explanation is that different anoma-
however, we omit NNBP results, as this method performed very lies are usually found by different methods because of inherent
poorly in simulations. mechanisms. Hence, the performance of RPCA and PCA at Fig. 14(a)
We divide each series into segments of two weeks in length, and (b) are the best respectively. OMP, PCA and BasisEvolution dis-
as in the simulations. We repeat the process for each algorithm cover different types of anomalies. This is what we might expect,
as described above, for each link. For example, Fig. 13 shows our given the different assumptions and algorithms upon which they
discovered basis functions for the first two time segments, the ini- are based.
tial segment, and its evolved form in the second time segment. In To further analyze the performance of four methods, we ran-
each, we can see that the first two basis vectors show the daily and domly present the detection results of ten weeks data as shown in
weekly cycles in the data, with the third containing more noise. Fig. 15. Since we do not have the ground truth for the real data,
Next, we compare the anomaly detectors against each other by the possible method of identifying anomalous points is through ei-
treating the anomalies found by one method as if they were real, ther expert advice or voting. In view of real network scenarios and
and then seeking to detect these using an alternative method. holiday influences in Japan, we can label several anomalous points,
Take PCA-detected anomalies as an example. Fig. 14(a) il- such as A(624,2.12e+08). In addition, we adopt a voting strategy
lustrates the performance of the left three alternative methods to estimate the reliability of each detected anomalous points. For
(OMP, RPCA, and BasisEvolution) in comparison. Note that here, any detected anomalous point, it will win one vote if another tech-
H. Xia et al. / Computer Networks 135 (2018) 15–31 29
8
10
3.5
Network traffic on Link 4
3 OMP
Network traffic volume PCA
BE
2.5 C(1179,1.411e+08)
RPCA A (624,2.12e+08) B(983,2.32e+08)
2
1.5
0.5
4.5 4.12 4.19 4.26 5.3 5.10 5.17 5.24 5.31 6.7 6.14
Date (Starts with Sunday, Apr. 5th 2014)
Fig. 15. Detection results of four algorithm on ten weeks data on link 4, IIJ: Apr 5th, 2014 to Jun 14th, 2014. Four algorithms are OMP, PCA, RPCA, and BasisEvolution
respectively. BE is short for BasisEvolution.
Table 4 can themselves have different shapes that make them harder or
Description of anomaly detection results for four algorithms in terms
easier to detect.
of reliability of those detected anomalies based on Fig. 15. Here, total
anomalies means a total number of anomalies detected by any ap- In this paper, we address these problems with BasisEvolution,
proach, amount of real anomalies counts the detected points that are which looks for a concise but accurate, and easy-to-understand ba-
real anomalies based on the expert advice but fail to get any vote sis that represent the typical traffic patterns more precisely. We
and No. with votes records the number of detected anomalous points evolve the basis to maintain similarity with the old one in describ-
with a certain amount of votes. BasisEvolution is more reliable and
ing the new data. We show it can then detect anomalous points on
can find anomalies that the other three cannot.
the residual series with higher accuracy than common alternatives.
Method Total Amount No. (with votes) It still works even the anomalies are complex.
anomalies (of real anomalies) 3 2 1 0 However, there are many issues left for future work. For exam-
BE 15 2 5 4 3 1 ple, the basis generation algorithm we used is based on SVD, which
OMP 13 1 5 0 2 5 is not feasible for very large data scenarios. The first issue is to find
PCA 24 0 5 4 9 6 robust, fast algorithm to replace SVD, such as the pursuit princi-
RPCA 24 0 5 4 7 8 pal used in OMP. Another point is that the size of data segment
we used in our framework is fixed and our algorithm proceeds in
batches. A more realistic algorithm should be able to tackle arbi-
nique also identifies it as abnormal. In our situation, the number of trary new segments of data as they arrive. We plan to extend Ba-
votes that one point could obtain ranging from 0 to 3. Larger num- sisEvolution in the future to tackle these issues.
bers of votes indicates that the corresponding anomalous points
are shared by more approaches. Hence, the corresponding meth-
ods are more reliable. Acknowledgment
Take BasisEvolution as an example, A total of 15 anomalous
points are detected in this context. Among them, five points have This authors thank IIJ for providing data. This work was sup-
three votes, and can be detected by the other three approaches; ported by the National Key Basic Research Program of China
four points have two votes; three points have one vote; two points (2013CB329103 of 2013CB329100), the China Scholarship Council
(A and B in Fig. 15) have been labeled as anomalous ones even (201506050072), as well as Australian Research Council (ARC) grant
though they have not been detected by any other technique; and DP110103505 and the ARC Centre of Excellence for Mathematical
only the left point (C in Fig. 15) is neither detected nor labeled. and Statistical Frontiers (ACEMS) CE14010 0 049.
Similarly, descriptions of the other methods can be seen in Table 4.
Compared with the other three methods, 12 out of 15 BE-detected
References
points can also be detected by others, and only one point is zero
vote, which indicates the strong reliability of BasisEvolution. Re- [1] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A comparative study of
garding PCA, RPCA and OMP, larger proportions of detected points anomaly detection schemes in network intrusion detection, in: Proceedings of
are not admitted with a zero vote, which demonstrates that the re- the SIAM International Conference on Data Mining, 2003, pp. 25–36.
[2] P. Barford, D. Plonka, Characteristics of network traffic flow anomalies, in:
liability of BasisEvolution is better than the rest three. In addition, Proceedings of the First ACM SIGCOMM Workshop on Internet Measurement
there are two zero-vote BE-detected points, which are authorized (IMW), San Francisco, CA, USA, 2001, pp. 69–73.
as real anomalies, while OMP has one and both PCA and RPCA have [3] A. Delimargas, Iterative Principal Component Analysis (IPCA) for network
anomaly detection, Carleton University, 2015 (Master’s thesis).
zero. These results, as well as the simulation, support the conclu-
[4] K. Xu, F. Wang, L. Gu, Behavior analysis of internet traffic via bipartite graphs
sion that BasisEvolution can find anomalous points that other ap- and one-mode projections, IEEE/ACM Trans. Netw. 22 (3) (2014) 931–942.
proaches can not. [5] M.H. Bhuyan, D.K. Bhattacharyya, J.K. Kalita, Network anomaly detection:
methods, systems and tools, IEEE Commun. Surv. Tutor. 16 (1) (2014) 303–336.
[6] P. García-Teodoroa, J. Díaz-Verdejoa, G. Maciá-Fernándeza, E. Vázquezb,
Anomaly-based network intrusion detection: techniques, systems and chal-
6. Conclusion and future work lenges, Comput. Secur. 28 (1–2) (2009) 18–28.
[7] L. Huang, L. Nguyen, M.N. Garofalakis, M.I. Jordan, A.D. Joseph, N. Taft, In-
network PCA and anomaly detection, International Conference on Neural
Accurately detecting anomalies without pre-knowledge plays Information Processing Systems, 2007, pp. 617–624.
an important role for network operators. However, to be practi- [8] D. Jiang, C. Yao, Z. Xu, W. Qin, Multi-scale anomaly detection for high-speed
cal, methods need to have a very low false-alarm probability, and network traffic, Trans. Emerg. Telecommun. Technol. 26 (3) (2015) 308–317.
[9] E. Sober, The principle of parsimony, Br. J. Philos. Sci. 32 (2) (1981) 145–156.
there are other challenges. Strong anomalies can pollute the nor- [10] J.S. Chiou, The antecedents of consumers loyalty toward internet service
mal space used as a baseline to detect anomalies, and anomalies providers, Inf. Manag. 41 (6) (2004) 685–695.
30 H. Xia et al. / Computer Networks 135 (2018) 15–31
[11] T. Tavallaee, N. Stakhanova, A.A. Ghorbani, Toward credible evaluation of [26] S. Novakov, C. Lung, I. Lambadaris, N. Seddigh, Studies in applying PCA and
anomaly-based intrusion-detection methods, IEEE Trans. Syst. Man Cybern. wavelet algorithms for network traffic anomaly detection, in: Proceedings of
Part C Appl. Rev. 40 (5) (2010) 516–524. the Fourteenth IEEE International Conference on High Performance Switching
[12] C. Croux, P. Filzmoser, M.R. Oliveira, Algorithms for projection–pursuit ro- and Routing (HPSR), 2013, pp. 185–190.
bust principal component analysis, Chemom. Intell. Lab. Syst. 87 (2) (2007) [27] F. Meng, N. Jiang, B. Liu, R. Li, F. Xia, A real-time detection approach to net-
218–225. work traffic anomalies in communication networks, in: Proceedings of the
[13] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E.D. Kolaczyk, N. Taft, Struc- Joint International Conference on Service Science, Management and Engineer-
tural analysis of network traffic flows, in: Proceedings of the ACM SIGMET- ing (SSME) and International Conference on Information Science and Technol-
RICS/Performance, 2004, pp. 61–72. ogy (IST), 2016.
[14] C. Pascoal, M.R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador, A. Pacheco, [28] Z. Chen, C.K. Yeo, B.S. Lee, C.T. Lau, Detection of network anomalies using im-
Robust feature selection and robust PCA for internet traffic anomaly detection, proved-MSPCA with sketches, Comput. Secur. 65 (2017) 314–328.
in: Proceedings of the IEEE INFOCOM, 2012, pp. 1755–1763. [29] Z. Wang, K. Hu, K. Xu, B. Yin, X. Dong, Structural analysis of network traffic
[15] S. Huang, F. Yu, R. Tsaih, Y. Huang, Network traffic anomaly detection with matrix via relaxed principal component pursuit, Comput. Netw. 56 (7) (2012)
incremental majority learning, in: Proceedings of the International Joint Con- 2049–2067.
ference on Neural Networks (IJCNN), 2015, pp. 1–8. [30] G.H. Golub, C.F.V. Loan, Matrix Computations, 3, JHU Press, 2012.
[16] Z. Chen, C.K. Yeo, B.S.L. Francis, C.T. Lau, Combining MIC feature selection and [31] J. Zobel, How reliable are the results of large-scale information retrieval exper-
feature-based MSPCA for network traffic anomaly detection, in: Proceedings iments, in: Proceedings of the SIGIR, 1998, pp. 307–314.
of the Third International Conference on Digital Information Processing, Data [32] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: Pro-
Mining, and Wireless Communications (DIPDMWC), 2016, pp. 176–181. ceedings of the Advances in Neural Information Processing Systems, 2001,
[17] J. Zhang, H. Li, Q. Gao, H. Wang, Y. Luo, Detecting anomalies from big network pp. 556–562.
traffic data using an adaptive detection approach, Inf. Sci. 318 (2015) 91–110. [33] R. Peharz, F. Pernkopf, Sparse nonnegative matrix factorization with
[18] A. Lakhina, M. Crovella, C. Diot, Diagnosing network-wide traffic anomalies, in: l0-constraints, Neurocomputing 80 (2012) 38–46.
Proceedings of the ACM SIGCOMM, 2004, pp. 219–230. [34] Y. Zhao, G. Karypis, D.-Z. Du, Criterion functions for document clustering, Tech-
[19] YusukeTsuge, H. Tanaka, Quantification for intrusion detection system using nical Report, Department of Computer Science, University of Minnesota, 2005.
discrete fourier transform, in: Proceedings of the International Conference on [35] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011.
Information Science and Security (ICISS), 2016, pp. 1–6. [36] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (8)
[20] P. Barford, J. Kline, D. Plonka, A. Ron, A signal analysis of network traffic (2006) 861–874.
anomalies, in: Proceedings of the Second ACM SIGCOMM Workshop on Inter- [37] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz, J. Yates, Y. Zhang, Ex-
net Measurement (IMW), Marseille, France, 2002, pp. 71–82. perience in measuring backbone traffic variability: models, metrics, measure-
[21] B. Eriksson, P. Barford, R. Bowden, N. Duffield, J. Sommers, M. Roughan, Ba- ments and meaning, in: Proceedings of the Second ACM SIGCOMM Workshop
sisDetect: a model-based network event detection framework, in: Proceedings on Internet Measurement (IMW), 2002, pp. 91–92.
of the Tenth ACM SIGCOMM Conference on Internet Measurement (IMC), Mel- [38] M. Roughan, J. Gottlieb, Large-scale measurement and modeling of backbone
bourne, Australia, 2010, pp. 451–464. internet traffic, in: Proceedings of the SPIE ITCom, Boston, MA, USA, 2002.
[22] M. Ahmed, A.N. Mahmood, J. Hu, A survey of network anomaly detection tech- [39] M. Roughan, On the beneficial impact of strong correlations for anomaly de-
niques, J. Netw. Comput. Appl. 60 (2016) 19–31. tection, Stoch. Models 25 (1) (2009) 1–27.
[23] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Com- [40] C.C. Zou, W. Gong, D. Towsley, Code red worm propagation modeling and anal-
put. Surv. 41 (3) (2009) 15. ysis, in: Proceedings of the Ninth ACM Conference on Computer and Commu-
[24] D.J. Wellerfahy, B.J. Borghetti, A.A. Sodemann, A survey of distance and similar- nications Security (CCS), New York, NY, USA, 2002, pp. 138–147, doi:10.1145/
ity measures used within network intrusion anomaly detection, IEEE Commun. 586110.586130.
Surv. Tutor. 17 (1) (2015) 70–91. [41] P. Tune, K. Cho, M. Roughan, A comparison of information criteria for traffic
[25] H. Ringberg, A. Soule, J. Rexford, C. Diot, Sensitivity of PCA for traffic anomaly model selection, in: Proceedings of the Tenth International Conference on Sig-
detection, ACM SIGMETRICS 35 (1) (2007) 109–120. nal Processing and Communications Systems (ICSPCS), Gold Coast, Australia,
2016.
H. Xia et al. / Computer Networks 135 (2018) 15–31 31
Hui. Xia is a Ph.D. student in the Department of Computer Science at Chongqing University. She received a Bachelors degree in Computer Networks
from Nanchang University. Her research interests include network traffic analysis, data mining, and personal recommendation.
Bin Fang received the B.S. degree in electrical engineering from Xian Jiaotong University, Xian, China, the M.S. degree in electrical engineering from
Sichuan University, Chengdu, China, and the Ph.D. degree in electrical engineering from the University of Hong Kong, Hong Kong. He is currently
a Professor with the College of Computer Science, Chongqing University, Chongqing, China. His research interests include computer vision, pattern
recognition, information processing, biometrics applications, and document analysis. He has published more than 120 technical papers and is
an Associate Editor of the International Journal of Pattern Recognition and Artificial Intelligence. Prof. Fang has been the Program Chair, and a
Committee Member for many international conferences.
Matthew. Roughan received a Bachelors degree in math science at Adelaide University and the Ph.D. degree in applied mathematics at Adelaide
University. Currently, he is a professor with the School of Mathematical Sciences, the University of Adelaide. His research interests include Internet
measurement and estimation, network management, and stochastic modeling, in particular with respect to network traffic and performance mod-
eling. He has published more than 150 technical papers, and is a member of the MASA, AustMS, IEEE and ACM. Prof. Roughan has been a chair of
ACM SIGCOMM Doctoral Dissertation Award Committee, and won 2013 ACM SIGMETRICS Test of Time Award.
Kenjiro Cho received the B.S. degree in electronic engineering from Kobe University, the M.Eng. degree in computer science from Cornell Univer-
sity, and the Ph.D. degree in media and governance from Keio University. He is Deputy Research Director with the Internet Initiative Japan, Inc.,
Tokyo, Japan. He is also an Ad- junct Professor with Keio University and Japan Advanced Institute of Science and Technology, Tokyo, Japan, and a
board member of the WIDE project. His current research interests include traffic measurement and management and operating system support for
networking.
Paul. Tune received a Ph.D. degree at the Centre for Ultra-Broadband Information Networks (CUBIN), Dept. of E&E Engineering, the University of
Melbourne. He worked as a postdoctoral research fellow at the School of Mathematical Sciences, the University of Adelaide. Now he worked at the
Image Intelligence, Sydney. His research interests are Network measurement, Information theory, Compressed sensing.