You are on page 1of 2

Introduction

This document will show you how I finished my homework and the result of some experiments which I did. This document has got some sections: section Prepare data gives you some details of making data and some statistical from data. The next section gives you the result of using Naviebayes method. The third section is the result of knearest neighbor with variant k.

Prepare Data:
I downloaded data from UCI ML repository: http://archive.ics.uci.edu/ml/machine-learning-databases/adult/. This data is downloaded and store in folder data/original, but when opening with Weka (GUI), I got some error message, so that, before using data, I cleaned the data by remove space in both adult.data and adult.test. In adult.test, I remove the first line (not sample of data) and correct classes in each line of this file. The class of each instance at the end of each line is >50K. or <=50K., remove dot to correct it. The cleaned data is stored in data/cleaned folder. Because, the dataset which is provided by UCI, is separate to training and testing set, so that I didnt use partitions program, but I wrote the program. However, I wrote a program two read training and testing file and convert it into arff format (weka format file). You can run this program with two arguments: first argument is the input file, second argument is the output file. But before run this program, you should prepare two input file with .data and .names extension. The data is information about adult with 14 attributes: age, work class, fnlwgt, eduction, eduction-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, navtive-counttry and classify into 2 class (more details is givens is adult.names file). The training dataset has got 32561 instances and the testing dataset has got 16281 instances.

Nave Bayes
I write code to train Nave Bayes model with training dataset and use testing dataset to evaluate model. Time to build model is below 1 second. I evaluate with 16281 instance in testing dataset, and there are 13534 correctly classified instances (83.1276%) and 2747 (16.8724%) incorrectly classified instances. You can see this information in result/navie.log file. K-Nearest Neighbor In this task, I used javacode to call KNN method by using weka.classifier.lazy.IBk, the parameter for this program is training file, testing file and k for selecting number of nearest neighbor. Finally, I wrote bat code to call program with k in {1, 2, 3, 4, 5, 10, 15, 20}. The result is showed in table below. K 1 2 3 4 5 10 15 20 Time 63 ms 62 ms 62 ms 62 ms 62 ms 62 ms 62 ms 63 ms Accuracy Correct 79.2028% 77.0653% 81.6965% 80.6707% 82.4949% 82.7406% 83.4654% 83.447% Incorrect 20.7972% 22.9347% 18.3035% 19.3293% 17.5051% 17.2594% 16.5346% 16.553%

We can see that if k the number of nearest neighbor increase, the accuracy of model also increases. But the larger k is, the more over fit the model is.

You might also like