You are on page 1of 11

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No.

A Novel Approach to Detect Anomalies with Unsupervised Learning


Suhasini Sodagudi, Suneetha Manne and Prof K. Rajasekahara Rao Assistant Professor, Department of IT, VR Siddhartha Engineering College, Vijayawada Assistant Professor, Department of IT, VR Siddhartha Engineering College, Vijayawada Principal, KLUniversity, Vijayawada, Andhra Pradesh, India ssuhasini09@gmail.com, suneethamanne74@gmail.com, krr@kluniversity.in Abstract
Data mining techniques are useful to search large amounts of data in discovering consistent and useful patterns. Applied to network monitoring data recorded on a host or in a network, intrusions can be detected. Intrusion detection is needed to maintain security management system and analyses information within a computer or a network to identify possible security measures. This paper emphasizes research in developing general and systematic methods for intrusion detection. System features can be observed by computing outliers that recognizes anomalies. This paper presents k-means algorithm to compute the outliers in normal datasets and cluster the normal behaviour points. For the test data set, most probable cluster is computed. In this study, we provide a detailed description of anomaly detection process. Keywords: Intrusion Detection System, Anomaly Detection

1. Introduction
The The world wide data is drowning in the deluge of it while starving for knowledge at the same time. Anomalous events do occur relatively and frequently. However, when they occur, the consequences can be dramatic and quite often in a negative sense. This is one of the major research fields in data mining towards such events. Recent researchers tried in the field to highlight the essentials of a secured system in terms of sensitive data and network. Security of network systems has increased its gain making the more sensitive information stored and manipulated as online. Such systems need to be protected with the construction of IDS (intrusion detection system). An intrusion can be any set of actions that attempts to compromise the integrity, confidentiality or availability of a resource. Intrusion detection is the process of monitoring and analysing the events occurring in the computer system. Over the past ten years ago, intrusion detection and other security technologies like cryptography, authentication and firewalls gained less importance. The concept of the intrusion detection has given data mining an opportunity to involve and make several contributions into the field. Apart from this it could also be accomplished by soft computing and inductive learning [3] [4]. Anomaly detection is also known as outlier detection that refers in detecting the patterns from a given data set that does not confirms to an established behaviour [1][3]. The detected patterns are said to be anomalies which are crucial, needs to take care in several application domains that do sometimes cause serious threats. The topic anomaly detection is a hot topic over the past years. Anomaly detection is a method proposed for Intrusion detection system exclusively which can be accomplished by thresholds and statistics [2] [3]. The method attempts to identify behavior that does not confirm to normal behavior from a large data set. This approach has a better chance of detecting novel attacks and is applicable in various domains like intrusion detection, fraud detection, fault detection, event detection, system monitoring, detecting eco-system disturbances, health, and insurances. Intrusion prevention techniques, such as user authentication (e.g. using passwords or biometrics), avoiding programming errors, and information protection (e.g., encryption) have been used to protect computer systems. IDSs can be distinguished on audit data source basis. Nowadays, most of the IDSs are based on hand crafted signatures that are developed by manual encoding of expert knowledge [4]. Over the past, IDSs failed to generalize in detecting new attacks or some attacks without signatures. Recently, there has been an increased interest in data mining with approaches towards detection models for IDSs. These models were able to generalize and detect unknown attacks from both known attacks and normal behavior [1] [4]. They can also be used to generate audit data in a faster and more automated way than manually by the domain experts. The term data mining is frequently used to designate the process of extracting useful patterns from large databases. Historically, the notion of data mining itself covers aspects as to find useful patterns in data, knowledge discovery in datasets, information harvesting, and data pattern analysis. There are several approaches

14

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 in data mining towards intrusion detection. The phrase knowledge discovery in databases (KDD) is used to denote the process of extracting useful knowledge from large data sets. In this denotation, data mining refers to one particular step that tries to extract patterns from the given databases. This is again followed up a series of KDD steps to extract the useful patterns related to knowledge. Indeed, the KDD process makes use of data mining approaches to extract high-level information [3]. This paper aggregates the survey to provide a structured and comprehensive overview on anomaly detection. We have grouped existing techniques into different categories, based on the underlying approach adopted by each technique. For each category we have identified key assumptions made to differentiate between normal and anomalous behaviour.

1.1 Intrusion Detection System and its Issues


IDS-various statistics like profiles of users, workstations, networks, remote hosts, groups of users etc. are included to determine a normal network activity to state and identify the anomalous nature of the system[12]. Signature based IDS- examine the packets in the networks and compares with attack patterns known as signatures [4]. These models could generalize from both the known attacks and the normal behaviour in order to detect unknown attacks. Various issues in IDS detects malicious activities such as DOS attacks, port scans etc by monitoring network traffic and system activities. In this direction the process maps the data to feature space and labels in the sparse regions of the feature space as anomalies. IDSs in network environment can be distinguished as Host based and Network based [5]. The host based intrusion detection system identifies the intrusions on a host machine by analyzing system calls, application logs, file system modifications and other system activities. The data is collected through sensors with a software agent and then analyzed. Network intrusion detection system is an independent platform that identifies intrusions by analyzing and monitoring network traffic. The data is collected by sensors that can be located at choke points or at borders of the network that needs to be monitored. In a passive system that emits no energy, the concept of IDS makes usage of sensor to detect the information logs and signals an alert on the console, owner or both. In a reactive system, also referred as intrusion prevention system, it automatically responds to any suspicious activity by resetting the connection or by reprogramming the firewall to block against network traffic from an unspecified source. Firewall is also related to network security like IDS [8] [9].The concept of firewall causes to stop the intrusions in taking place and it limits the access between networks without any signalling an alert from inside the network. The concept of IDS tries to evaluate a suspected intrusion once it happens and signals, also it watches for attacks that originate from within a system by examining network traffic, communications, identifying heuristics and patterns with signatures [5]. All IDSs use one of the two detection techniques. Statically based IDS to be addressed include data collection, data reduction, classification behaviour, reporting and response. The first issue is to use feature sensors to collect data and perform pre-processing to extract features, denoted as feature extraction [11]. The second issue Data reduction consists of analysing a collection of data in order to identify the most important components of the data, thereby reducing processing time, communications overhead and storage requirements. The classification issue is the process of forming clusters and based on the behaviour of the clusters, attackers and intruders can be identified.

1.2 Anomaly Detection


Identification of confirming and non-confirming patterns from data is the main idea. All non confirmalities are known as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities, rare cases which are different non-confirming patterns in different application domains. Two terms anomalies and outliers are used most commonly in the related context[8][10].The notion of anomaly detection is used in wide range of applications like fraud detection for credit cards, insurance or health care, intrusion detection for cyber security, fault detection in safety critical systems and military surveillance for enemy activities[2][3]. The importance of anomaly detection is due to the fact that anomalies in data translate to significant real life entities in a wide variety of application domains. For example, Anomalies in credit card transaction data could indicate credit card or identity theft pattern or anomalous readings from a space craft sensor could signify a fault in some component of the space craft. The rest of the paper is organized as follows. Section 2 presents the background study to intrusion detection system, Section 3 describes various key challenges in the aspects of anomaly detection, Section 4 surveys and presents intrusion detection techniques and Section 5 proposes the unsupervised learning methods for network intrusion. We conclude our study with the comparative study of IDS in Section 6.

15

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

2. Background
In this section, we would like to present certain challenges that need to be put pressure in declaring nonconfirming patterns from a larger scope of data. Defining a representative normal region that encompasses every normal behaviour appears quite difficult, but it is dealt as challenging at the same time. The boundary between normal and outlying behaviour is often not precise, the exact notion of an outlier is different for different application domains, availability of labelled data for training/validation, malicious adversaries, and data might sometimes contain noise that tends to be similar to actual anomalies, and normal behaviour keeps evolving are identified as intellectual issues in the context of anomalies. Various types of anomalies that took place related to the survey is as follows Point anomalies Context anomalies Collective anomalies

2.1 Point Anomalies


Point anomalies are nothing but the individual data instance that is anomalous with respect to data. This is one of the major research areas in anomaly detection. In a real world system regarding the credit card fraud detection of an individual, the data set corresponds to an individual. If the transaction of the credit card of an individual assumed is high or entirely different then that point is said to be point anomaly. Figure 1 illustrates point anomalies in a simple 2-dimensional data set. The data has two normal regions, N1 and N2, since most observations lie in these two regions. Points that are sufficiently far away from the regions, e.g., points O1 and O2, and points in region O3, are anomalies [ 2 ][13].

Figure 1: A sample illustration to show an anomaly

2.2 Contextual Anomalies


Contextual anomaly is a data instance that seems to be anomalous in a specified context based on some assumptions, conditions or behaviour. These are also referred as conditional anomalies[2][13]. In an instance of a spatial data set, the latitude and longitude can be treated as contextual attributes. When a graph is plotted using a time series nature, we observe the position of an instance on the entire sequence, also we can trace the rainfall at any location based on the behaviour of the attributes concerned as shown in Figure 2. Temperatures are not same and differ in a different context and hence are not considered as an anomaly.

Figure 2: Sample graph plotted to show the behavioral and contextual attributes

16

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

2.3 Collective Anomalies


Collective anomalies are the collections of related data instances that can be identified as anomalous. In any given scenario, if an individual data instances are collectively differently identified then it is termed as collective anomalies or hybrid. Of an example, consider a sequence of actions occurring in a computer as shown below in Figure 3. Out of these events, some of them are web based attacks (buffer-overflow, ssh, ftp) that happened due to copying of files from a host based to a remote based via ftp which is nothing but a collective anomaly. These events when occur individually they are not treated as anomalies but when they happen in a sequence, it leads to an anomaly . So it is named as Collective anomaly. http-web, buffer-overow, http-web, http-web, smtp-mail, ftp, http-web, ssh, smtp-mail, http-web, telnet, buffer-overow, ftp, http-web, ftp, smtp-mail, http-web, telnet .
Figure 3: Sample Sequence of actions in a computer leading to collective anomaly

2.4 Anomalies and its measures


With the above mentioned it is obvious that, anomalies are usually caused with data from different sources, natural variants, data measurement and collection errors etc. Such anomalies must be detected and there are several approaches to it with respect to data measurement metrics and their assumptions in design techniques like model based technique, proximity based technique and density based technique. Model based technique, is a statistical method where a model is created for the data and the objects are evaluated with respect to how well the likely objects fits into the model. Proximity based technique is in which a proximity measure for data needs to be set (like distance, neighbourhood, density, cluster) and then an object will be declared as anomalous or not. For instance, if distance is the measure as set, then an object is said to be anomaly. If distance is the measure set from the most point, the method needs to compute the distance between every pair of data points and thus can define outliers differently. Density based technique is in which the density k is taken into account around a point and compute the k-density of every point. Assume the k-relative density of a point P as the ratio between its density and the average k-densities of its k-nearest neighbours. The outlier score of a point P is its k relative density. This approach is used in DBSCAN algorithm. Anomaly detection is related to noise removal, noise accommodation and novelty detection. The removal is nothing but the unwanted data (anomalies) removal from the data set. The distinction between novel patterns and anomalies is that the novel patterns are typically incorporated into the normal model after being detected[2]. Anomaly detection is based on the assumption that misuse or intrusive behaviour deviates from normal system usage. Most of the cases, this assumption is valid as in the attacker who tries to authenticate legitimate users account[8].Based on the different behaviours of the attacker, IDS can be established and reported as an anomaly during a session.

3. Popular Anomaly Detection Techniques


For Anomaly detection is the best way to put away virus and worms often used in pre-processing to remove anomalous data from the dataset. In this section, we would like to highlight the basic framework of an anomaly detection system and few techniques that are widely used. Several anomaly detection techniques have been proposed in literature. Some of the most popular techniques are: Nearest Neighbour method KNN Density based method Cluster based method Any system before its implementation, it is better to focus on identifying the basic components of the system, its inputs and the expected outcome. Figure 4 shows the main components, inputs, output, and the applicable domains of our anomaly detection system [2].

17

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

Figure 4: Components of an anomaly detection system

3.1 Aspects of Anomaly Detection Problem


Various domains are addressed differently in a different approach. Whatever method applied, the ultimate is to detect anomalies. To make this system work efficiently, certain aspects need to be considered like nature of the input data, relationship among data, anomaly type, output and evaluation [2]. Nature of the input data it is better to specify a common form of data to be handled by anomaly detection techniques as record data that can be univariant or multivariant. For instance consider the data that is assumed as record data in a session. Each attribute of this data is referred by its nature and type in real world environment like mentioned in Figure 5 and 6. Session begin 11.00.00 11.05.45 11.14.32 Dest port 123 123 123 No. of bytes 40 100 20

Id 1 2 3

SourceIP 103.135.38.90 103.165.37.95 103.163.37.95

DestIP 160.94.179.223 160.94.179.221 160.94.179.254

Attack No No Yes

Figure 5: Sample record data of a session

Categorical

Continuous

Categorical

Continuous

Binary

Id 1 2 3

SourceIP 103.135.38.90 103.165.37.95 103.163.37.95

Session begin 11.00.00 11.05.45 11.14.32

DestIP 160.94.179.223 160.94.179.221 160.94.179.254

Dest port 123 123 123

No. of bytes 40 100 20

Attack No No Yes

Figure 6: Nature of the data as Binary data, Categorical data, Continuous data and hybrid data [2].

18

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 Relationship among data it is necessary to show the relationship among data instances like sequential, temporal, spatial, spatial-temporal and graph based instances [2].Some times, few data is complex that needs to be checked to show relationships like the motion of a space craft as in Figure 7.

Figure 7: Complex data such as entrance of a car or plane with its updates in traffic

Labels these are assigned to each data instance and can justify the anomaly as point, context or collective type. To perform this assignment, there are three basic approaches as supervised learning, unsupervised and semi-supervised learning to describe normal and anomalous classes which will be discussed further in the same chapter. Once the anomaly type is known, then it is easy to apply a method for learning and removing it. The usage of labels in the approaches are summarized:1)Supervised Anomaly Detection-Labels are available for both normal data and anomalies 2)Semi-supervised Anomaly Detection -Labels are available only for normal data and 3)Unsupervised Anomaly Detection -No need of labels , based on the assumption that anomalies are very rare compared to normal data[ 2][10]. Outcome and Evaluation - it is the output to be achieved and evaluation of the technique is checked after the construction of these components into a working model. To perform and improve the efficiency,there are three distinct phases included[2]. 1. Abstract local information ie build a profile 2.Evolvebackgroundinformationfrom local abstractions 3. Establish anomalies whose characteristics differ from the normal abstraction Output of Anomaly Detection-Typically, the outputs generated after implementation of anomaly detection technique can be any one of the two types : scores or labels. Scores are the scoring techniques that assign an anomaly score to each instance in the test data where each instance is considered as an anomaly. Therefore the output here will be a ranked list of anomalies. Labels are the techniques that assign a label either as normal or anomalous to each instance. Next component of the system as depicted in Figure 4 is application domains that include anomaly detection techniques are : Intrusion detection system , Network Intrusion detection system, Fraud detection / Fault detection, health systems[3]. We conclude this section by mentioning a general procedure in Figure 8 for an anomaly detection system conveying that anomalies/outliers are the set of data points that are considerably different than the remainder of the data. There are considerably more normal observations than abnormal observations (outliers/anomalies) in the data. The general strategy for Anomaly/Outlier Detection Problems is given below. 1. For a given database D, find all the data points x D with anomaly scores greater than some threshold t 2. Given a database D, find all the data points x D having the top n largest anomaly scores f(x) 3. Given a database D, containing mostly normal (but unlabelled) data points, and a test point x, compute the anomaly score of x with respect to D
Figure 8: General strategy of an anomaly detection problem

Research ideas include Data mining, machine learning, statistical learning, misuse techniques etc. among which data mining is the technique illustrated in this survey. Data mining technique for anomaly detection - The labels associated with a data instance are used to denote if that instance is normal or anomalous.Based on the context to which the labels are available, data mining techniques for anomaly detection operates in one of the three modes as follows [2 ]: Supervised mode : Methods in this mode when operated assume the training data has labels for normal class 19

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 and anomaly class of data sets. If any invisible data is present then it is compared against the mode to determine which class it belongs to. These techniques have been proposed that inject artificial anomalies in a normal data set to obtain a labelled training data set. There are two major issues addressed. Firstly, the anomalous instances are far fewer than the normal instances in the training data. Secondly, obtaining accurate and representative labels for anomaly class is difficult task and quite challenging too. This method is similar to building predictive models. Semi-supervised mode : Methods in this mode when operated assume the training data as labels but only for the normal class. There are no labels for anomaly class and hence such requirement made this technique more widely applicable than supervised methods. Here only construction of a model for the class referring to normal behaviour and use this model to identify anomalies in the test data. In practise such modes of methods are difficult in their implementation because it is very difficult to analyse and build a model that covers all possible anomalous behaviour that can occur in the data. Unsupervised mode : Methods in this mode do not require even the training data also. The methods makes an implicit assumption that normal data instances are more than anomalous instances in the test data. If this assumption is not true then such techniques suffer from high alarm rate. Many of the semi-supervised techniques can be made adaptable to operate in an unsupervised mode by using a sample of the unlabelled data set as training data. Such adaptation assumes that the test data contains very few anomalies and the model learnt during training is robust to these few anomalies. This mode is elaborated with network based anomaly in further section.

4. Anomaly Detection Techniques


This present module presents the brief study on most popular techniques that have been used in intrusion detection systems. The general assumption made by the techniques is that normal data instances occur in dense neighbourhoods and anomalies occur far from their closest neighbours. Techniques fall under three categories th based on distance, density and cluster [13]. 1) Techniques that use distance of a data instance to its k nearest neighbour as anomaly. 2) Techniques that use the relative density of each data instance to compute its anomaly. 3) Techniques that use cluster and checks for anomaly.

4.1 Nearest neighbour method


The nearest neighbour method is based on comparing the distribution of the distances that occur from a data point to its nearest neighbour in a given data set with the randomly distributed data set. (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a computation deferred until classification. Not surprisingly, this technique has been applied to detect land mines from satellite ground images, to detect shorted anomalies. The time complexity for this basic technique is O(N2) where N is the data size in different ways. This technique, computes the anomaly score by an object being classified by a majority of its neighbours by counting the number of nearest neighbours n that are not more than distance d apart from a given distance. Thus an object belongs to a class among its k-NN where k is a positive integer and small. Then a threshold will be applied on the anomaly score to determine whether a test instance is anomalous or normal. Easily adaptability of such techniques to different data types, and primarily requires defining an appropriate distance measure for the given data. Such unsupervised techniques sometimes fails to label the close neighbours when the data has normal instances that do not have enough close neighbours or if the data has anomalies that have enough close neighbours thus resulting in missed anomalies. For semi-supervised techniques, if the normal instances in test data do not have enough similar normal instances in the training data, the false positive rate for such techniques is high. Additional complexity of the testing phase is needed to compute the nearest neighbours. The performance of a nearest neighbour based technique greatly relies on a distance measure that can effectively distinguish between normal and anomalous instances which is very difficult task where the data is complex type Measuring the nearest neighbour distance for an entire data set might be not being feasible and so a technique more involving sampling theory would be more useful. Methods were restricted to two dimensions, it needs higher dimensions also.
Figure 9: Few pros and cons of the nearest neighbour method

20

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

4. 2 Density based method


This method estimates the neighbourhood of each data instance relative to density. If an instance lies within a neighbourhood with low density then the instance is declared as anomalous or else the instance is said to be normal. For a given instance of data set, the distance to its kth nearest neighbour is same as the radius of the sphere at that given data instance which contains k other instances. In a 2-D data set, if the density of a data instance=n/d2, where d is the radius, the inverse of this density is nothing but the anomaly score for that data instance. Several techniques usually fix the radius d and represent the anomaly score as 1/n [2].Therefore the method can be estimated as the inverse of the density of the instance in the data set and the basic nearest neighbour relative to density measure as density based anomaly detection technique. These techniques performance is poor if the data instance is various regions of different densities. To address this issue of varying densities in the data set, several improvements had been suggested like LOF(Local Outlier Factor) method[2]. In this method, the LOF score is same as the ratio of average local density of the kth nearest neighbour of the data instance and the local density of the data instance itself. A variation of the LOF is COF(Connectivity Outlier Factor). The difference in this variation is that the COF also computes anomaly score just like LOF but COF able to capture regions like straight lines. A simpler version of LOF is called as Outlier Detection using In-degree Number (ODIN) for each data instance. This number is equal to the number of k nearest neighbours of the data instance which have the given data instance in its k nearest neighbour list. The inverse of this number is declared as the anomaly score.

4. 3 Cluster based method


It is known fact that clusters is a group of similar data instances. This is primarily an unsupervised technique [8]. Anomaly detection and clustering are different in detecting the anomaly score. In cluster based method, normal data instances will be formed to a cluster in the data while anomalies either do not belong to any cluster. If any data instance does not appear in the cluster then it is declared as anomalous [2] [13].Methods like DBSCAN, ROCK and SNN clustering do not force every data instance to belong to a cluster. The clustering techniques assume that the normal data instances are those that lie closer to their nearest cluster centroid and those that remain far away from their nearest cluster centroid are anomalies [7].These techniques are unsupervised in nature and do not make any assumptions regarding the generative distribution for the data. Instead, they are purely data driven [7][6]. However, the nearest neighbour method has some comments to be made. The approximation of taking the number of points and dividing by area may not be a very accurate estimate of intensity, which could skew results. With edge effects, unobserved events outside the area of interest may interact with the events inside the area of interest.

5. Anomaly Detection in Network with Unsupervised Learning


Unsupervised learning is the method in which no variable is denoted as target. In fact the goal here is to establish some relationship among the variables. This method studies how to represent input patterns that reflects the overall input patterns statistically [7][12].In a network environment, the network-based approach relies on the TCP dump data as input, which gives per packet information. This data was pre-processed as grouping records related to one connection. Content based, time based and connection based features were extracted. The attack data was obtained by simulating a TCP-SYN ood. The attack data was also pre-processed using the same procedure as that used for the normal data [12]. Intrusions cause a serious problem which needs to be addressed. New emerging threats are the most difficult tasks. Anomaly detection techniques can make use of unsupervised methods to identify new threats without the need of labels [4]. In this section, we review a network intrusion detection method which is the main theme of our paper. There are generally three approaches towards network intrusion detection: Signature based, misuse detection, anomaly detection [5]. Signature based and misuse detection methods rely on labelled patterns, but are unable to detect new types of threats. Also maintenance of database in these methods is time consuming and expensive. Among these methods, anomaly detection methods can make use of supervised and unsupervised techniques to detect anomalies in data patterns. Clustering was therst choice because the dataset was huge and multidimensional. This is the main objective of our survey towards IDS using a simple data clustering method with K-means over the network data [13]. Data clustering is very useful in various domains like image segmentation, document retrieval, data mining, object recognition and can be divided into two categories: Hierarchical clustering and Partitioned clustering. In hierarchical clustering, the data are not partitioned into a particular step. Instead a series of partitions takes place that leads to n clusters each consisting of an object. The basic steps in data clustering are shown in Figure10. 21

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 The unsupervised anomaly detection approach overcomes the supervised learning by making use of data clustering algorithms which makes no assumption on labels or classes for data patterns. By using clustering, one can identify dense and sparse regions and therefore identify distribution of the patterns uniformly amongst data attributes. The patterns are grouped and the anomalies are obtained in the small clusters. Two assumptions are made as normal patterns or connections can be more in number than the attacks which are different from the normal patterns. Clustering does not rely on predefined classes and training data set with labels. Therefore clustering is a form of learning by observation [10] [9]. Step1: Pattern representation refers to the number of classes, patterns and features of the data. Step 2: Feature selection and extraction to prune the data by reducing the dimension of the data. Feature selection is to identify the most effective subset of features from the given original data Feature extraction is done by applying a transformation to the original features in order to generate new features. Normalizing of data is to prevent large scale features and distance measure can be used to quantify the clusters Step3: Clustering - group the patterns into clusters based on similarities
Figure 10: Clustering outline

Once the cluster assignment is fixed, the mean distance of an example to cluster centres is used as the score. Using the K-means clustering algorithm, different clusters were specified and generated for each output class [6]. K-means clustering is a well-known data mining algorithm that has been used in an attempt to detect anomalous user behaviour, as well as unusual behaviour in network traffic. There are two problems that are inherent to K-means clustering algorithms. The first is determining the initial partition and the second is determining the optimal number of clusters [7]. The Figure 11 as shown below depicted the K-means algorithm. As the algorithm iterates through the training data, each clusters architecture is updated. In updating clusters, elements are removed from one cluster to another, ultimately causes the values of the centroids to change. This change is a reflection of the current cluster elements. Once there are no changes to any cluster, the training of the K-Means algorithm is complete.

1. Choose and define number of clusters as K 2. Initialize patterns with centroids 3. Assign each pattern to the cluster with nearest centroid 4. Calculate means of each cluster to be its new centroid 5. Repeat step 3 until a criteria is met where no pattern can move to another cluster 6. Repeat n times and the best clustering solution can be chosen
Figure 11: A sample K Means algorithm

The idea in this algorithm depicts to train a K-means cluster using normal datasets and cluster the normal behaviour points. Initially, the k number of clusters is chosen and assigned randomly, the centres of the k clusters is computed and assigned to the clusters with the closest centres. For the test data set, the probability of its belonging to the most probable cluster was computed. If this was below a threshold, the instance was agged as anomalous. The mean distance of a cluster to its centre is used as score [2] [13].

6. Conclusions
Over the past, intrusion detection systems were implemented with signature based and failed sometimes in detecting attacks without signatures. This gave an opportunity for data mining with an increased effort towards detection models for such systems. The concept of this system is anomaly detection which refers to identification of confirming and non-confirming patterns from the data. Anomaly detection is based on the general assumption that normal data instances occur in dense neighbourhoods and anomalies occur far from their closest neighbours with respect to distance, density and cluster. To trace the anomalies, techniques used unsupervised methods to identify new threats without labels. Clustering was therst choice w ith these methods, where one can identify a dense region and a sparse region by uniform distribution of the patterns. Unsupervised learning methods are 22

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9 referred as data clustering methods from data mining domain that take a different approach in grouping the data into clusters based on similarities. The data is unlabelled pattern. The K-means is a centroid based technique where every cluster is shown by the centre of gravity so that the intra cluster is same and high. The technique is scalable and efficient in processing large data sets. The K means algorithm, a hard partitioned clustering algorithm, was chosen for its simplicity and speed. The labels of the patterns were used for this evaluation, but never used for the clustering procedure. Figure 12 concludes the study of existing methods of IDS.
Technique IDS Signature based Hand craft Signatures Models the normal operation of system/network and detects deviations from them Identification of confirming and nonconfirming patterns in data set Identifying distinct network connections or associations (connectionless traffic) and including enough detail to collect suspicious traffic Network packets Configuration about sensors Misuse detection Anomaly detection Anomaly detection is the main component Attacker profile System profile NIDES(Network intrusion detection expert system) enhancement of IDES IDES(Intrusion detection expert system) SRI International Metric Requirements Over past 5 years

Security profile DIDS(Distribution intrusion detection system) University of California at Davis, Lawrence Livermore Laboratory, Haystack Laboratory and the U.S. Air Force. Input : training data set

Mode of Operation Supervised (S) Unsupervised(U) Semisupervised(SS)

(S) With labels for normal and anomaly class

(U) Without labels

(SS) With labels only for normal class

Process :build a model and use it in the test data Most Widely applicable (De) Checks an instance lies within a nearest low density then the instance is declared as anomalous or normal. widely applicable Output : identifying anomalies Checks data instances if closer/far to nearest cluster centroid as anomalous or normal. Input : training data set

Less widely applicable (D) -Data pruning -threshold to compute anomaly nearest neighbour with respect to distance

STAT/USTAT(State Transition Analysis Tool ) for the UNIX operating system environment

Tripwire11 - introduced in November 1992, the COAST laboratory at Purdue University.

KNN Measure Distance Density Cluster

Process : uses the KNN measure to compute an anomaly Output : anomalous and normal class Pattern representation Quantify the clusters Clustering

GrIDS(Graph-Based Intrusion Detection System) COAST laboratory

Thumb printing - University of California , Davis

K Means

anomalies and attacks can be detected based on cluster sizes, large clusters correspond to normal data, and the rest of the data points are outliers, correspond to attacks

Figure 12: Comparative analysis of the survey

23

International Journal of Computational Intelligence and Information Security, September 2011 Vol. 2, No. 9

References
[1] [2] [3] [4] [5] [6] [7] D.E.Denning. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13:222-232, 1987. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly Detection: A survey, ACM Computing Surveys, 09 2009. Klaus Julisch : DATA MINING FOR INTRUSION DETECTION, A Critical Review,IBM Research,Zurich Research Laboratory WenkeLee, Salvatore J.Stolfo, Philip K.Chan, E.Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang : Real Time Data Mining-based Intrusion Detection, Computer Science department, NC Peng Ning, North Carolina SU, Sushil Jajodia, George Mason University : Intrusion Detection Techniques Stolfo, S.J., Wei, F., Lee, W., Prodromidis, A., Chan, P.K.: KDD Cup -knowledge discovery and data mining competition (1999) H. Gnes Kayack 1, Nur Zincir-Heywood 1:Analysis of Three Intrusion Detection System Benchmark Datasets Using Machine Learning Algorithms, 1 Dalhousie University, Faculty of Computer Science, 6050 University Avenue, Halifax, Nova Scotia. B3H 1W5 Kayacik, G. H., Zincir-Heywood, A. N., "Generating Representative Traffic for Intrusion Detection System Benchmarking", Proceedings of the IEEE CNSR 2005 Halifax, Canada,May 2005. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Kumar, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection, Proc. SIAM Intl Conf. Data Mining, May 2003. W. Lee, S. J. Stolfo Data Mining Approaches for Intrusion Detection. Kumar S., Spafford E.H. (1994) An Application of Pattern Matching in IntrusionDetection, Technical Report CSD-TR-94- 013, Purdue University. Mukkamala, S., Janoski, G., Sung, A.: Intrusion detection using neural networks and support vector machines. In: Proceedings of IEEE Internation Joint Conference on Neural Networks. (2002) 1702 Technical Report, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI,December, 2007.

[8] [9] [10] [11] [12] [13]

24

You might also like