Professional Documents
Culture Documents
Abstract. Cyber security is a serious global concern. The potential of cyber terrorism has posed a threat to national security; meanwhile the increasing prevalence of malware and incidents of cyber attacks hinder the utilization of the Internet to its greatest benefit and incur significant economic losses to individuals, enterprises, and public organizations. This paper presents some recent advances in intrusion detection, feature selection, and malware detection. In intrusion detection, stealthy and low profile attacks that include only few carefully crafted packets over an extended period of time to delude firewalls and the intrusion detection system (IDS) have been difficult to detect. In protection against malware (trojans, worms, viruses, etc.), how to detect polymorphic and metamorphic versions of recognized malware using static scanners is a great challenge. We present in this paper an agent based IDS architecture that is capable of detecting probe attacks at the originating host and denial of service (DoS) attacks at the boundary controllers. We investigate and compare the performance of different classifiers implemented for intrusion detection purposes. Further, we study the performance of the classifiers in real-time detection of probes and DoS attacks, with respect to intrusion data collected on a real operating network that includes a variety of simulated attacks. Feature selection is as important for IDS as it is for many other modeling problems. We present several techniques for feature selection and compare their performance in the IDS application. It is demonstrated that, with appropriately chosen features, both probes and DoS attacks can be detected in real time or near real time at the originating host or at the boundary controllers. We also briefly present some encouraging recent results in detecting polymorphic and metamorphic malware with advanced static, signature-based scanning techniques.
1 Introduction
Intrusion detection is a problem of great importance to protecting information systems security, especially in view of the worldwide increasing incidents of cyber attacks. Since the ability of an IDS to identify a large variety of intrusions in real time with accuracy is of primary concern, we will in this paper consider performance measures of learning machine based IDSs in the critical aspects of classification accuracy, training time, testing times, and scalability. One of the main problems with IDSs is the overhead, which can become prohibitively high. To analyze system logs, the operating system must keep information regarding all the actions performed, which invariably results in huge
M. J. Maher (Ed.): ASIAN 2004, LNCS 3321, pp. 468482, 2004 Springer-Verlag Berlin Heidelberg 2004
469
amounts of data, requiring disk space and CPU resource. Next, the logs must be processed and converted into a manageable format and then compared with the set of recognized misuse and attack patterns to identify possible intrusions. Further, the stored patterns need be continually updated, which would normally involve human expertise. Detecting intrusions in real time, therefore, is a difficult task. Several artificial intelligence techniques have been utilized to automate the intrusion detection process to reduce human intervention, they include neural networks, fuzzy inference systems, evolutionary computation machine learning, etc. Several data mining techniques have been introduced to identify key features or parameters that define intrusions [1-3]. A summary of intrusion detection techniques is given in [4,5]. Some previous work applied neural networks as classifiers to detect low level probes; and a summary of different port-scan detection techniques is also available [6,7]. In this paper, we implement and evaluate the performance of an intelligent agent based IDS to detect, among other attacks, probes at the originating host and DoS attacks at the boundary controllers. Intelligent agents that encapsulate different AI paradigms including Support Vector Machines (SVM) [8], Multivariate Adaptive Regression Splines (MARS) [9] and Linear Genetic Programming (LGP) [10] are utilized. The data we use in our experiments is collected on a real network at New Mexico Tech that includes normal activity and several classes of probing attacks and DoS attacks [11] generated using scripts available on the Internet. We perform experiments to classify the network traffic in real-time into Normal, Probe and DoS. It is demonstrated that with appropriately chosen population size, program size, crossover rate and mutation rate, linear genetic programs outperform other artificial intelligent techniques in terms of detection accuracy. An introduction to computation techniques used for experiments is given in section 2. A brief introduction to our intelligent agents based architecture is given in section 3. In section 4 we present using different classifiers for IDSs and compare their performance. Section 5 presents three feature ranking algorithms. Real-time data collection and feature extraction are described in section 6, as well as performance evaluation of probes and DoS attacks detection. In section 7 we present a robust technique for signature-based malware (viruses, worms, trojans, etc.) detection, with emphasis on obfuscated (polymorphic) malware and mutated (metamorphic) malware. The conclusions are given in section 8.
470
knot. With a sufficiently large number of knots virtually any shape can be well approximated. While it is easy to draw a spline in 2-dimensions by keying on knot locations (approximating using linear, quadratic or cubic polynomial etc.), manipulating the mathematics in higher dimensions is best accomplished using basis functions. The MARS model is a regression model using basis functions as predictors in place of the original data. The basis function transform makes it possible to selectively blank out certain regions of a variable by making them zero, and allows MARS to focus on specific sub-regions of the data. It excels at finding optimal variable transformations and interactions, and the complex data structure that often hides in high-dimensional data [12]. Given the number of records in most data sets, it is infeasible to approximate the function y = f(x) by summarizing y in each distinct region of x. For some variables, two regions may not be enough to track the specifics of the function. If the relationship of y to some x's is different in 3 or 4 regions, for example, the number of regions requiring examination becomes larger than 34 billion with only 35 variables. Given that the number of regions cannot be specified a priori, specifying too few regions in advance can have serious implications for the final model. A solution is needed that accomplishes the following two criteria: judicious selection of which regions to look at and their boundaries judicious determination of how many intervals are needed for each variable
Given these two criteria, a successful method will essentially need to be adaptive to the characteristics of the data. Such a solution will probably ignore quite a few variables (affecting variable selection) and will take into account only a few variables at a time (also reducing the number of regions). Even if the method selects 30 variables for the model, it will not look at all 30 simultaneously. Such simplification is accomplished by a decision tree at a single node, only ancestor splits are being considered; thus, at a depth of six levels in the tree, only six variables are being used to define the node [12]. 2.2 Support Vector Machines (SVMs) The SVM approach transforms data into a feature space F that usually has a huge dimension. It is interesting to note that SVM generalization depends on the geometrical characteristics of the training data, not on the dimensions of the input space [13,14]. Training a support vector machine (SVM) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. Vapnik shows how training a SVM for the pattern recognition problem leads to the following quadratic optimization problem [15]. Minimize: W ( ) =
l
i =1
i +
1 2
y i y j i j k ( xi , x j )
i =1 j =1
(1)
Subject to
y
i =1
i i
i : 0 i C
(2)
Where l is the number of training examples, is a vector of l variables and each component i corresponds to a training example (xi, yi). The solution of (1) is the vector * for which (1) is minimized and (2) is fulfilled.
471
LGP is a variant of the Genetic Programming (GP) technique that acts on linear genomes [10]. The linear genetic programming technique used for our experiments is based on machine code level manipulation and evaluation of programs. Its main characteristics in comparison to tree-based GP lies in that the evolvable units are not the expressions of a functional programming language (like LISP), but the programs of an imperative language (like C). In the Automatic Induction of Machine Code by Genetic Programming, individuals are manipulated directly as binary code in memory and executed directly without passing an interpreter during fitness calculation. The LGP tournament selection procedure puts the lowest selection pressure on the individuals by allowing only two individuals to participate in a tournament. A copy of the winner replaces the loser of each tournament. The crossover points only occur between instructions. Inside instructions the mutation operation randomly replaces the instruction identifier, a variable or the constant from valid ranges. In LGP the maximum size of the program is usually restricted to prevent programs without bounds. As LGP could be implemented at machine code level, it can possibly be fast enough to be able to detect intrusions in near real time.
Agent controller: Manages the agents functionality, life cycle, communication with other agents and the response mechanisms. Attack signatures module: Maintains all the attack signatures and updates agents in the event of a new attack signature. Data collection: Extracts features required by the detection algorithm to decide whether the activity is malicious or normal. Intrusion detection module: Performs intrusion detection and classification.
472