You are on page 1of 11

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING 6367(Print), ISSN 0976

6375(Online) Volume 4, Issue 2, March April (2013), IAEME & TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), pp. 241-251 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

A COMPARATIVE ANALYSIS OF DATA MINING TOOLS FOR PERFORMANCE MAPPING OF WLAN DATA
Mr. Ajay M. Patel Assistant Professor, Acharya Motibhai Patel Institute of Computer Studies, Ganpat University, Ganpat Vidyanagar-384012, India Dr. A. R. Patel Director, Department of Computer Application & Information Technology, H. North Gujarat University, Patan - 384265, India Ms. Hiral R. Patel Assistant Professor, Department of Computer Science, Ganpat University, Ganpat Vidyanagar-384012, India

ABSTRACT Data Mining is the non-trivial process of identifying valid, potentially and understandable patterns in the form of knowledge discovery from the large volume of data. The main aim of this process is to discovering patterns and associations among preprocessed and transformed data. Data mining is used for two type of analysis: Prediction and description. Prediction in terms of predicts unknown or future values of selected variables. Description in terms of describes human interpretable patterns. The major application areas such as business and finance, stock market, telecommunications, health care, surveillance, fraud detection, scientific discovery and now a days extensive usage in networking. Data mining supports supervised and unsupervised type of machine learning process. This paper uses the unsupervised learning process of data mining. For that the paper uses the wireless network log as a data set which has 13 attributes with 1000 instances for anomaly detection. The research focuses on the performance mapping of different unsupervised algorithm supported by different data mining tools. The different tool provides different types of clustering algorithm with different performance mapping measures. The same data set applied for different tools. This paper shows the comparative analysis for performance of algorithms of on different data mining tools.
241

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Index Terms: Accuracy, Anomaly Detection, Clustering, Data Mining, Error Rate, Unsupervised Learning. 1. INTRODUCTION The mining is a machine learning process for detecting unknown patterns from the data. The data mining provides many useful analytical techniques. This research shows the usage of data mining techniques for anomaly detection in wireless networking. The most obvious advantage of wireless networking is mobility. Wireless network users can connect to existing networks and are then allowed to roam freely. In next generation wireless networks, one of the most serious challenges is how to achieve continuous connection during mobile user movement among cells which is allowed due to handover procedure. An Intrusion prevention system (IPS) is software that has all the capabilities of an intrusion detection system and can also attempt to stop possible incidents. An intrusion prevention system (IPS) combines IDS with a firewall, a virus detection algorithm, a vulnerability assessment algorithm, etc. The ambition of such a system is to manage both preventive and responsive actions against attacks on a computer network. [10] The wireless log history hides this useful knowledge patterns that describe typical behavior of anomalies in packet transmission. [5] In network security research, Intrusion Detection is a dangerous concern. Misuse detection and Anomaly detection are the two basic approaches of intrusion detection. Intrusion Detection System is accrues and examines the data to be aware of the intrusions and mishandlings in the computer system and network. [7] So data mining provides various types of technologies available to find out these types of anomaly intrusion activities. 1.1 Data Mining Data mining is a machine learning technique which provides different techniques to find out the knowledge and unknown patterns from raw data. Data mining is up-and-coming with the key features of much security inventiveness. Both the private and public sectors are currently increasingly usage the data mining. Many application domains such as banking, insurance, medicine, and retailing frequently use data mining to reduce costs, enhance research, and increase sales. Data mining applications initially were used as a means to detect fraud and waste, but have grown to also be used for purposes such as measuring and improving program performance. Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. The Data Mining tools can include statistical models, mathematical algorithms, and machine learning methods. An algorithm improves the performance automatically through experience, such as neural networks or decision trees. Data mining exploits a discovery approach, in which algorithms can be used to scrutinize several multidimensional data relationships concurrently, discovering those that are unique or frequently represented. Data mining has become increasingly common in both the public and private sectors. Many Organizations provide data mining tools to survey different user work oriented information and gives analytical results to interpret so these tools reduce fraud and waste of time to assist in developing algorithms for research. But it is possible and preferable way to use or modify the algorithms as per the requirements. Recently, data mining has been gradually more cited as an imperative tool for various security efforts. Some observers suggest that data mining should be used as a means to identify terrorist or intrusive activities, such as money transfers and electronic communications, and to identify and track individual terrorists or intruders themselves, such as through travel and immigration records. [9]
242

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

1.2 Why Unsupervised Learning? Data mining is the process of extracting knowledge from a database. Data mining models can be categorized according to the tasks they perform. Data mining techniques are predictive (supervised) or descriptive (unsupervised) techniques. Classification Prediction, Clustering, Association Rules are the data mining techniques from which Classification and prediction is a supervised learning models, but clustering and association rules are descriptive models. Classification recognizes patterns that describe the group to which an item belongs. Prediction is the construction and use of a model to assess the class of an unlabeled object or to assess the value or value ranges of a given object is likely to have. A supervised learning model provides the way to classify the data as per pre defined given class label. Unsupervised learning provides a way to classify the data as per the behavior of the data. In unsupervised learning techniques treats all variables in the same way, there is no distinction between descriptive and dependent variables. However, in contrast to the name undirected data mining there is still some target to achieve. This target might be as general as data reduction or more specific like clustering. The difference between supervised learning and unsupervised learning is same as that distinguishes discriminant analysis with cluster analysis. Supervised learning necessitates the target variable is well defined and that a sufficient number of its values are given. For unsupervised learning typically either the target variable is unknown or has only been recorded for too small a number of cases. 1.3 Intrusion Detection in WLAN A wireless IDPS monitors the wireless network traffic and investigate its wireless networking protocols to identify suspicious activity perform by the user and detected by protocols themselves. This section provides a detailed discussion of wireless IDPS technologies. First, it contains a brief overview of wireless networking, which is background material for understanding the rest of the section. It covers the major components of wireless IDPSs and gives the explanation the architectures typically used for deploying the components. It also examines the security capabilities of the technologies in depth, including the methodologies they use to identify and stop suspicious activity. The rest of the section discusses the management capabilities of the technologies, including recommendations for implementation and operation. [10] Wireless intrusion detection systems can be divided into misuse based and anomaly based systems in the same way as the IDS for wired networks. Beside classical misuse and anomalies detectable in any network, wireless IDS must also detect wireless specific misuse and anomalies. Machine learning is regarded as an effective tool utilized by intrusion detection system (IDS) to detect abnormal activities from network traffic. In particular, neural networks, support vector machines (SVM) and decision trees are three significant and popular schemes borrowed from the machine learning community into intrusion detection in recent academic research. [7] 1.4 Anomaly Detection Anomaly is any happening or entity that is eccentric, abnormal or special. It can also indicate an inconsistency or divergence from the preset rule or tendency. A normal behavior is modeled for anomaly detection. Any proceedings which contravene this model will be marked as suspicious. For example, a normal passive public web can be considered to give rise to worm infection if it tries to open connections to a large number of addresses. An
243

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Anomaly Based Intrusion Detection System is a system for finding the intrusions and misuse in the computer by monitoring the system activity and classifies the activities as normal or anomalous. This system will detect any type of misuse that falls out of the normal system operation since the classification is completely based on rules or heuristics, rather than patterns or signatures. Anomaly based detection system seeks deviations from the learned model of normal behavior. An anomaly based IDS analyze the ongoing traffic, activity, transactions or behaviors for detecting anomalies in the system or the network which may be indicative of any attack. An Intrusion Detection System (IDS) is a program that examines what happens or has happened during an execution and endeavor to find suggestions that the computer has been misuse. The development of anomaly detection techniques suitable for Wireless Networks is regarded as a vital research area. [7] 2. DATA MINING TECHNIQUES FOR ANOMALY DETECTION Anomaly detection means any significant deviations from the expected behavior are reported as possible attacks. Data mining provides various techniques to find out the knowledge from the data. Anomalies are some type of activities that would be performs by intruders. Anomaly detection is the process of finding the objects that are not related to other normal objects. Data mining provides the techniques to find out such a groups or classes as per the requirement and the usage of the work. Classification is used to classify the data gathered from the different collected data. Data mining also provides another technique that is clustering. Clustering is also used to grouping the data as per the behavior of the data. So data mining techniques are useful to find out the groups or classes. These classes or groups are useful to differentiate the other dissimilar groups as per the predefined labels or the behavior of data. 3. PROCESS OF UNSUPERVISED LEARNING (CLUSTERING) Unsupervised learning is the method of grouping the data as per behavior of data. It is also known as descriptive method. Clustering is one of the unsupervised learning techniques. Clustering works on the data directly no any predefined label are required. Clustering also executes or gives the different groups as per the user wants to generate. Clustering techniques generate the groups as per the distance criteria among the data. There are different distance measure methods are available to count the distance amount the instances. Different clustering provider tools use different distance measure to grouping the data. The accuracy of the results are depends on the algorithms used to clustering the instances. This paper shows the usage of different tools of data mining. The clustering techniques are applied on same wireless log of data to perform comparative analysis to describe which tool gives more accurate results. 4. DATA MINING TOOLS USED FOR PERFORMANCE ANALYSIS There are various organizations provide data mining tools to perform the data mining techniques. Some of tools are freeware and open source so any one can easily use them. Data mining tools provides inbuilt algorithms for various data mining techniques. In this paper, Different types of data mining tools are used like Weka, SPSS, Tanagra and Microsoft SQL Server Provides Business Intelligence Development Studio for to support data mining
244

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME analysis services. Here in this paper three clusters are generated and defined as Normal activities, Suspicious activities and Animalized activities. These all different tools different clustering algorithm applied on same wireless log file to find out animalized group of activities. Different tools have different results. The important thing is that to interpret the results of the applied techniques. The closed instances are put in to the same cluster and the closeness of the instances is measured by to finding out the distances. So clusters are generated based on this policy. Data mining unsupervised technique model is best suitable but different tools uses different way of finding the distances so to define ideal model is depend on the accuracy and error rate provided by the algorithm of the tools. The following shows the steps to perform data mining techniques using different tools. 4.1 WEKA The full form of Weka is W (aikato) E (nvironment) for K (nowlegde) A (nalysis). Weka is open source tool because it is designed using Java. It provides various data mining techniques. It provides the facility to perform preprocessing task and user is able to develop or change the inbuilt algorithms using weka. Weka works with different file formats like .arff, csv, C4.5, .xrff etc. In this paper Weka 3.7 is used to apply Simple Kmean for 3 clusters on Wireless log based on Euclidean distance because it is sufficient to group similar instances.

Figure 1: Clustering using Weka 3.7. 4.2 SPSS SPSS is specially designed to perform statistical analysis proprietary product from IBM. It provides various statistical test analyses and also provides data mining techniques. SPSS works with .sav file and other database file like excel. In this paper, SPSS 16.0 is used to apply Kmean Clustering from Analyze-> Classify tab. This model also generates the 3 clusters. They are using two methods iterative with classify and only classify. Its also performing ANOVA for statistical verification.

Figure 2: Clustering using SPSS 16.0


245

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

4.3 Tanagra Tanagra is also freely available data mining tool. It provides various statically, Non parametric test, Spv Leaning techniques association and clustering. Tanagra works with .arff and other file format specified by Tanagra. Here Tanagra 1.4.43 is used. It is component based visualize tool. It generates 3 clusters for wireless log. Tanagra uses distance normalization based on variance and find the seed based on random or standard way specify by it.

Figure 3: Clustering using Tanagra 1.4.4 4.4 BIDS of MS SQL Server 2008 Microsoft also provides the data mining tool which is known as MS SQL Server 2008 which provides business intelligent development studio. This tool provides various only data mining effective algorithms which provide scalable results. These algorithms generally applied on the data stored in SQL Server. In this paper Microsoft Clustering algorithm is used to generate 3 clusters for same wireless log. This tool use the pure algorithm defined by Microsoft and as per the data log user can specify the key measurement, inputs and predictable attribute with number of cluster and as per measurement it will calculate clustering and also suggest the user as per statistical testing to provide better result.

Figure 4: Clustering using MS SQL Server 2008

246

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

5. RESULT INTERPRETATION Now a days various organizations provide different tools which support different analytical techniques but the main important thing is to interpret the results. In this paper different tools are used on same wireless log but gives different results. The three clusters are categorized as Normal activities cluster, another activities cluster and animalized activities. 5.1 Results using WEKA Weka performs the simple kmean algorithm to clusterize the wireless log. It is perform the clustering on predefined data set or also user able to provide the test data set. Weka provides four types of distance measure functions to generate the similar instance type clusters. For this log Euclidean Distance function is used. It will generate 3 clusters as per the distance. As per the figure 15% of instances show the anomaly activities, 44% as Normal activities and rest of defined as Suspicious activities. Clustered Instances Result Cluster Clustered Instances 0 409 ( 41%) 1 440 ( 44%) 2 150 ( 15%)

Figure 5: Results of Weka 5.2 Results using SPSS SPSS performs clustering as per the above considerations it will perform the iterative classification and define 25% of shows Anomaly activities and 25% suspicious activities with 50% definition of normal activities. SPSS used for to perform statistical analysis of given data log. Its show the ANOVA table which represent the normality and the data significance for the given log. The results also represent the distance matrix of the clusters. This show the distance between clusters one and cluster two is very small compared to the cluster three. This interpreted as the instance of the cluster three are most different from the others. That means, the cluster three have the different behavior activities which not perform normal activities. Thats the reason the cluster three have the animalized activities which is intrusive because intrusive events are the events which disturb the normal behavior of the network.
247

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

Figure 6: Results of SPSS


5.3 Results using Tanagra The clustering is used to generate homogeneous subgroups of instances. As per Tanagra the accuracy of the model depends on the TSS (Total sum of squares), WSS (Within sum of squares) and BSS (Between Sum of squares). On the basis of TSS and WSS, BSS is calculated. BSS and Result Ratio calculated using following. BSS = TSS WSS [34326.92=39992.00-5665.077] Result Ratio = BSS / TSS [0.85=34326.92/39992.00] This result shows the individual groups classification which represent the no of instances in 3 different clusters is not much differ in ratio.

Figure 7: Results of Tanagra


248

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME 5.4 Results using BIDS of MS SQL Server 2008

MS SQL Server 2008 is also provides the facility to perform data mining task. This tool is produced by Microsoft. It provides effective mining algorithm. As per the results it creates the clusters automatically as per the behavior of the data. The result also contains the lift chart and accuracy chart. Its also display the discriminate statistical analysis. This tool gives the prediction model with its proving result. The lift chart of the model shows the overall accuracy of the model in terms of statistics, Data analysis and model performance. For this log it shows the linear lift chart with statistical measurement. As per all the results this tool gives most accurate results because it also shows the statistics for given results as per shown in below.

Figure 8: Results of BIDS

249

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

The results shows clustering statistics and also shows the clustering which is given as per the behavior of the data. The each cluster shows the density by the instances come up with it. This tool also provides the statistics of the how each instances distance with the same cluster as well as others. The clustering of the BIDS is more flexible because it uses EM, KMean and scalable or non scalable methods of grouping. The Cluster diagram shows the characteristics of each and every clusters. The strength of the similarity of the clusters represented by the shading of lines connected among the clusters. The light shading the clusters denotes that these clusters are not very similar. So as per this model of Cluster diagram cluster number eight, nine and ten represented with light shading so they have instances that is not much similar to the others. So the instances belongs to that cluster shows the anomalous activities. The cluster number five six and seven represented with average shading so its interpreted as the instances of these clusters are suspicious. The remaining clusters are purely highlighted so they have normal behavioral instances. The model gives 16% density which is accurate by calculating the ratio of number of instances in each cluster with the overall instances in the log. So its gives ideal model to identify each and every instances of the log statistically. 6. CONCLUSION Recent research suggests data mining techniques for fraud detection and anomaly detection. The unsupervised learning technique is most useful for this objective because it deals with the behavior of the complex data. Cluster analysis will always produce grouping based on several parameters some of them are available for the researcher to customize cluster analysis. Here this paper shows the usage of different tools for same wireless log and its result interpretation. Among these tools MS SQL Server provides the best ideal model. Some tools have data size limitations. Some tools are best suited for pure statistical analysis. The MS SQL Server has limitation it does not available under GPL however its more preferable to deal with lengthy, complex and dynamic behavioral data among other experimented tools. REFERENCES 1. Marc M. VAN HULLE and Jesse DAVIS, Data Mining in Laboratorium voor Neuroen Psychofysiologie, Katholieke Universiteit Leuven, pp. 154. 2. Mrs.P.Nancy and Dr.R.Geetha Ramani, A Comparison on Performance of Data Mining Algorithms in Classification of Social Network Data in International Journal of Computer Applications (0975 8887) Volume 32 No.8, October 2011 3. Glenn A. Growe, Thesis on Comparing Algorithms and Clustering Data: Components of the Data Mining Process in Grand Valley State University, 1999. 4. Reference Book on 802.11 Wireless Networks The Definitive Guide By Mattbew S. Gast; Published By: OReilly; ISBN: 0-596-00183-5 5. Thuy Van T. Duong and Dinh Que Tran, An Effective Approach for Mobility Prediction in Wireless Network based on Temporal Weighted Mobility Rule, Published At: International Journal of Computer Science and Telecommunications [Volume 3, Issue 2, February 2012], ISSN 2047-3338 6. Mohamed Medhat Gaber, Shonali Krishnaswamy, and Arkady Zaslavsky, A Wireless Data Stream Mining Model, Published At: ICEIS
250

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME

7. M.Moorthy and S.Sathiyabam, A Hybrid Data Mining based Intrusion Detection System for Wireless Local Area Networks, International Journal of Computer Applications (0975 8887) Volume 49 No.10, July 2012 8. Balaji Rengarajan and Gustavo de Vecian, Data Mining and Coordination to Avoid Interference in Wireless Networks, supported by: Intel Research Council and the NSF Award CNS-0721532 9. A CRS Report for CongressData Mining: An Overview By Jeffrey W. Seifert 10. A Research Paper on Guide to Intrusion Detection and Prevention Systems (IDPS) By Karen Scarfone and Peter Mell; Published By: NIST Special Publication 800-94 11. Theodoros Lappas and Konstantinos Pelechrinis, Data Mining Techniques for (Network) Intrusion Detection Systems 12. R. Manickam, D. Boominath and V. Bhuvaneswari, An Analysis of Data Mining: Past, Present and Future, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 6367, ISSN Online: 0976 6375 13. Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, A Literature Review on the Data Mining and Information Security, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 6367, ISSN Online: 0976 6375 14. R. Lakshman Naik, D. Ramesh and B. Manjula, Instances Selection Using Advance Data Mining Techniques, International Journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 47 - 53, ISSN Print: 0976 6367, ISSN Online: 0976 6375

AUTHORS

A. Mr. Ajay M. Patel is an assistant professor of faculty of computer application of


Ganpat University in India. He is well interested in networking era. He has also work with data mining and gets enough expertise on data mining with wireless network. His ongoing research focused on intrusion detection in wireless LAN. He has published number of journal and conference papers in the area of his research interests. He is currently working on pattern matching and predication of wireless network traffic.

B. Dr. Ashok R. Patel an eminent personality interested in finding ways to improve the teaching and learning process. The author has enormous research experience in the Ecommerce and E-Governance. He has guided more the 15 Ph.D. students as well as Post Graduate level students in the diversified fields of computer application such as data mining, neural network, computer network, enterprise resources planning etc. He is a director of department of computer science of H. North Gujarat University of India. He is also working as a director in AICTE the apex body in India for technical education. C. Ms. Hiral R. Patel is an assistant professor of faculty of computer application of
Ganpat University in India. She is starting to working on pattern matching and predication of financial data and wireless network traffic.

251

You might also like