Professional Documents
Culture Documents
1. Introduction
The The world wide data is drowning in the deluge of it while starving for knowledge at the same time. Anomalous events do occur relatively and frequently. However, when they occur, the consequences can be dramatic and quite often in a negative sense. This is one of the major research fields in data mining towards such events. Recent researchers tried in the field to highlight the essentials of a secured system in terms of sensitive data and network. Security of network systems has increased its gain making the more sensitive information stored and manipulated as online. Such systems need to be protected with the construction of IDS (intrusion detection system). An intrusion can be any set of actions that attempts to compromise the integrity, confidentiality or availability of a resource. Intrusion detection is the process of monitoring and analysing the events occurring in the computer system. Over the past ten years ago, intrusion detection and other security technologies like cryptography, authentication and firewalls gained less importance. The concept of the intrusion detection has given data mining an opportunity to involve and make several contributions into the field. Apart from this it could also be accomplished by soft computing and inductive learning [3] [4]. Anomaly detection is also known as outlier detection that refers in detecting the patterns from a given data set that does not confirms to an established behaviour [1][3]. The detected patterns are said to be anomalies which are crucial, needs to take care in several application domains that do sometimes cause serious threats. The topic anomaly detection is a hot topic over the past years. Anomaly detection is a method proposed for Intrusion detection system exclusively which can be accomplished by thresholds and statistics [2] [3]. The method attempts to identify behavior that does not confirm to normal behavior from a large data set. This approach has a better chance of detecting novel attacks and is applicable in various domains like intrusion detection, fraud detection, fault detection, event detection, system monitoring, detecting eco-system disturbances, health, and insurances. Intrusion prevention techniques, such as user authentication (e.g. using passwords or biometrics), avoiding programming errors, and information protection (e.g., encryption) have been used to protect computer systems. IDSs can be distinguished on audit data source basis. Nowadays, most of the IDSs are based on hand crafted signatures that are developed by manual encoding of expert knowledge [4]. Over the past, IDSs failed to generalize in detecting new attacks or some attacks without signatures. Recently, there has been an increased interest in data mining with approaches towards detection models for IDSs. These models were able to generalize and detect unknown attacks from both known attacks and normal behavior [1] [4]. They can also be used to generate audit data in a faster and more automated way than manually by the domain experts. The term data mining is frequently used to designate the process of extracting useful patterns from large databases. Historically, the notion of data mining itself covers aspects as to find useful patterns in data, knowledge discovery in datasets, information harvesting, and data pattern analysis. There are several approaches
14
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 in data mining towards intrusion detection. The phrase knowledge discovery in databases (KDD) is used to denote the process of extracting useful knowledge from large data sets. In this denotation, data mining refers to one particular step that tries to extract patterns from the given databases. This is again followed up a series of KDD steps to extract the useful patterns related to knowledge. Indeed, the KDD process makes use of data mining approaches to extract high-level information [3]. This paper aggregates the survey to provide a structured and comprehensive overview on anomaly detection. We have grouped existing techniques into different categories, based on the underlying approach adopted by each technique. For each category we have identified key assumptions made to differentiate between normal and anomalous behaviour.
15
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
2. Background
In this section, we would like to present certain challenges that need to be put pressure in declaring nonconfirming patterns from a larger scope of data. Defining a representative normal region that encompasses every normal behaviour appears quite difficult, but it is dealt as challenging at the same time. The boundary between normal and outlying behaviour is often not precise, the exact notion of an outlier is different for different application domains, availability of labelled data for training/validation, malicious adversaries, and data might sometimes contain noise that tends to be similar to actual anomalies, and normal behaviour keeps evolving are identified as intellectual issues in the context of anomalies. Various types of anomalies that took place related to the survey is as follows Point anomalies Context anomalies Collective anomalies
Figure 2: Sample graph plotted to show the behavioral and contextual attributes
16
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
17
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
Id 1 2 3
Attack No No Yes
Categorical
Continuous
Categorical
Continuous
Binary
Id 1 2 3
Attack No No Yes
Figure 6: Nature of the data as Binary data, Categorical data, Continuous data and hybrid data [2].
18
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 Relationship among data it is necessary to show the relationship among data instances like sequential, temporal, spatial, spatial-temporal and graph based instances [2].Some times, few data is complex that needs to be checked to show relationships like the motion of a space craft as in Figure 7.
Figure 7: Complex data such as entrance of a car or plane with its updates in traffic
Labels these are assigned to each data instance and can justify the anomaly as point, context or collective type. To perform this assignment, there are three basic approaches as supervised learning, unsupervised and semi-supervised learning to describe normal and anomalous classes which will be discussed further in the same chapter. Once the anomaly type is known, then it is easy to apply a method for learning and removing it. The usage of labels in the approaches are summarized:1)Supervised Anomaly Detection-Labels are available for both normal data and anomalies 2)Semi-supervised Anomaly Detection -Labels are available only for normal data and 3)Unsupervised Anomaly Detection -No need of labels , based on the assumption that anomalies are very rare compared to normal data[ 2][10]. Outcome and Evaluation - it is the output to be achieved and evaluation of the technique is checked after the construction of these components into a working model. To perform and improve the efficiency,there are three distinct phases included[2]. 1. Abstract local information ie build a profile 2.Evolvebackgroundinformationfrom local abstractions 3. Establish anomalies whose characteristics differ from the normal abstraction Output of Anomaly Detection-Typically, the outputs generated after implementation of anomaly detection technique can be any one of the two types : scores or labels. Scores are the scoring techniques that assign an anomaly score to each instance in the test data where each instance is considered as an anomaly. Therefore the output here will be a ranked list of anomalies. Labels are the techniques that assign a label either as normal or anomalous to each instance. Next component of the system as depicted in Figure 4 is application domains that include anomaly detection techniques are : Intrusion detection system , Network Intrusion detection system, Fraud detection / Fault detection, health systems[3]. We conclude this section by mentioning a general procedure in Figure 8 for an anomaly detection system conveying that anomalies/outliers are the set of data points that are considerably different than the remainder of the data. There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data. The general strategy for Anomaly/Outlier Detection Problems is given below. 1. For a given database D, find all the data points x D with anomaly scores greater than some threshold t 2. Given a database D, find all the data points x D having the top n largest anomaly scores f(x) 3. Given a database D, containing mostly normal (but unlabelled) data points, and a test point x, compute the anomaly score of x with respect to D
Figure 8: General strategy of an anomaly detection problem
Research ideas include Data mining, machine learning, statistical learning, misuse techniques etc. among which data mining is the technique illustrated in this survey. Data mining technique for anomaly detection - The labels associated with a data instance are used to denote if that instance is normal or anomalous.Based on the context to which the labels are available, data mining techniques for anomaly detection operates in one of the three modes as follows [2 ]: Supervised mode : Methods in this mode when operated assume the training data has labels for normal class 19
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 and anomaly class of data sets. If any invisible data is present then it is compared against the mode to determine which class it belongs to. These techniques have been proposed that inject artificial anomalies in a normal data set to obtain a labelled training data set. There are two major issues addressed. Firstly, the anomalous instances are far fewer than the normal instances in the training data. Secondly, obtaining accurate and representative labels for anomaly class is difficult task and quite challenging too. This method is similar to building predictive models. Semi-supervised mode : Methods in this mode when operated assume the training data as labels but only for the normal class. There are no labels for anomaly class and hence such requirement made this technique more widely applicable than supervised methods. Here only construction of a model for the class referring to normal behaviour and use this model to identify anomalies in the test data. In practise such modes of methods are difficult in their implementation because it is very difficult to analyse and build a model that covers all possible anomalous behaviour that can occur in the data. Unsupervised mode : Methods in this mode do not require even the training data also. The methods makes an implicit assumption that normal data instances are more than anomalous instances in the test data. If this assumption is not true then such techniques suffer from high alarm rate. Many of the semi-supervised techniques can be made adaptable to operate in an unsupervised mode by using a sample of the unlabelled data set as training data. Such adaptation assumes that the test data contains very few anomalies and the model learnt during training is robust to these few anomalies. This mode is elaborated with network based anomaly in further section.
20
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 The unsupervised anomaly detection approach overcomes the supervised learning by making use of data clustering algorithms which makes no assumption on labels or classes for data patterns. By using clustering, one can identify dense and sparse regions and therefore identify distribution of the patterns uniformly amongst data attributes. The patterns are grouped and the anomalies are obtained in the small clusters. Two assumptions are made as normal patterns or connections can be more in number than the attacks which are different from the normal patterns. Clustering does not rely on predefined classes and training data set with labels. Therefore clustering is a form of learning by observation [10] [9]. Step1: Pattern representation refers to the number of classes, patterns and features of the data. Step 2: Feature selection and extraction to prune the data by reducing the dimension of the data. Feature selection is to identify the most effective subset of features from the given original data Feature extraction is done by applying a transformation to the original features in order to generate new features. Normalizing of data is to prevent large scale features and distance measure can be used to quantify the clusters Step3: Clustering - group the patterns into clusters based on similarities
Figure 10: Clustering outline
Once the cluster assignment is fixed, the mean distance of an example to cluster centres is used as the score. Using the K-means clustering algorithm, different clusters were specified and generated for each output class [6]. K-means clustering is a well-known data mining algorithm that has been used in an attempt to detect anomalous user behaviour, as well as unusual behaviour in network traffic. There are two problems that are inherent to K-means clustering algorithms. The first is determining the initial partition and the second is determining the optimal number of clusters [7]. The Figure 11 as shown below depicted the K-means algorithm. As the algorithm iterates through the training data, each clusters architecture is updated. In updating clusters, elements are removed from one cluster to another, ultimately causes the values of the centroids to change. This change is a reflection of the current cluster elements. Once there are no changes to any cluster, the training of the K-Means algorithm is complete.
1. Choose and define number of clusters as K 2. Initialize patterns with centroids 3. Assign each pattern to the cluster with nearest centroid 4. Calculate means of each cluster to be its new centroid 5. Repeat step 3 until a criteria is met where no pattern can move to another cluster 6. Repeat n times and the best clustering solution can be chosen
Figure 11: A sample K Means algorithm
The idea in this algorithm depicts to train a K-means cluster using normal datasets and cluster the normal behaviour points. Initially, the k number of clusters is chosen and assigned randomly, the centres of the k clusters is computed and assigned to the clusters with the closest centres. For the test data set, the probability of its belonging to the most probable cluster was computed. If this was below a threshold, the instance was agged as anomalous. The mean distance of a cluster to its centre is used as score [2] [13].
6. Conclusions
Over the past, intrusion detection systems were implemented with signature based and failed sometimes in detecting attacks without signatures. This gave an opportunity for data mining with an increased effort towards detection models for such systems. The concept of this system is anomaly detection which refers to identification of confirming and non-confirming patterns from the data. Anomaly detection is based on the general assumption that normal data instances occur in dense neighbourhoods and anomalies occur far from their closest neighbours with respect to distance, density and cluster. To trace the anomalies, techniques used unsupervised methods to identify new threats without labels. Clustering was therst choice w ith these methods, where one can identify a dense region and a sparse region by uniform distribution of the patterns. Unsupervised learning methods are 22
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 referred as data clustering methods from data mining domain that take a different approach in grouping the data into clusters based on similarities. The data is unlabelled pattern. The K-means is a centroid based technique where every cluster is shown by the centre of gravity so that the intra cluster is same and high. The technique is scalable and efficient in processing large data sets. The K means algorithm, a hard partitioned clustering algorithm, was chosen for its simplicity and speed. The labels of the patterns were used for this evaluation, but never used for the clustering procedure. Figure 12 concludes the study of existing methods of IDS.
Technique IDS Signature based Hand craft Signatures Models the normal operation of system/network and detects deviations from them Identification of confirming and nonconfirming patterns in data set Identifying distinct network connections or associations (connectionless traffic) and including enough detail to collect suspicious traffic Network packets Configuration about sensors Misuse detection Anomaly detection Anomaly detection is the main component Attacker profile System profile NIDES(Network intrusion detection expert system) enhancement of IDES IDES(Intrusion detection expert system) SRI International Metric Requirements Over past 5 years
Security profile DIDS(Distribution intrusion detection system) University of California at Davis, Lawrence Livermore Laboratory, Haystack Laboratory and the U.S. Air Force. Input : training data set
Process :build a model and use it in the test data Most Widely applicable (De) Checks an instance lies within a nearest low density then the instance is declared as anomalous or normal. widely applicable Output : identifying anomalies Checks data instances if closer/far to nearest cluster centroid as anomalous or normal. Input : training data set
Less widely applicable (D) -Data pruning -threshold to compute anomaly nearest neighbour with respect to distance
STAT/USTAT(State Transition Analysis Tool ) for the UNIX operating system environment
Process : uses the KNN measure to compute an anomaly Output : anomalous and normal class Pattern representation Quantify the clusters Clustering
K Means
anomalies and attacks can be detected based on cluster sizes, large clusters correspond to normal data, and the rest of the data points are outliers, correspond to attacks
23
International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9
References
[1] [2] [3] [4] [5] [6] [7] D.E.Denning. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13:222-232, 1987. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A survey, ACM Computing Surveys, 09 2009. Klaus Julisch : DATA MINING FOR INTRUSION DETECTION, A Critical Review,IBM Research,Zurich Research Laboratory WenkeLee, Salvatore J.Stolfo, Philip K.Chan, E.Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang : Real Time Data Mining-based Intrusion Detection, Computer Science department, NC Peng Ning, North Carolina SU, Sushil Jajodia, George Mason University : Intrusion Detection Techniques Stolfo, S.J., Wei, F., Lee, W., Prodromidis, A., Chan, P.K.: KDD Cup -knowledge discovery and data mining competition (1999) H. Gnes Kayack 1, Nur Zincir-Heywood 1:Analysis of Three Intrusion Detection System Benchmark Datasets Using Machine Learning Algorithms, 1 Dalhousie University, Faculty of Computer Science, 6050 University Avenue, Halifax, Nova Scotia. B3H 1W5 Kayacik, G. H., Zincir-Heywood, A. N., "Generating Representative Traffic for Intrusion Detection System Benchmarking", Proceedings of the IEEE CNSR 2005 Halifax, Canada,May 2005. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, Proc. SIAM Intl Conf. Data Mining, May 2003. W. Lee, S. J. Stolfo Data Mining Approaches for Intrusion Detection. Kumar S., Spafford E.H. (1994) An Application of Pattern Matching in IntrusionDetection, Technical Report CSD-TR-94- 013, Purdue University. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection using neural networks and support vector machines. In: Proceedings of IEEE Internation Joint Conference on Neural Networks. (2002) 1702 Technical Report, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI,December, 2007.
24