You are on page 1of 7

APPLICATION TEXT MINING FOR AUTOMATIC BOOK CLASIFICATION WITH NAVE BAYES METHOD Mira Kania Sabariah, Sheni

Wahyuni Universitas Komputer Indonesia mira_ljuan@yahoo.com

Abstract
The necessity of information nowadays is assessed to increase. One of them is the necessity of books information in the library. Large amount of books that were not orderly arranged and still used manual data searching of books. Because of that need an application of books data processing to give more information to users. Implementing this text mining application for automatic book classification will process raw data in the form of text databases that is input title and synopsis of book. Preprocessing will do tokenizing, filtering, stemming, tagging, and analyzing to the words in these book title so obtain the keywords. From the keywords, frequent item set can be generated using Nave Bayes method to classification of books category. In the result of naive Bayes method experiment can be used to classification of books category, looking for the probability many more that is exactly title and synopsis book consist of the keywords in the training data. By using the keywords can be obtained book title information that searched by user in the search engine. Keyword : Text Mining, Naive Bayes Method, Search Engine

1. Introduction The need for information is currently valued greatly improved. One is the need for information in the search for books in the library. In a library there are usually a lot of books, presented in categories that are not unique and at the time of book search is still done manually. These things cause a lot of obstacles. The problems were one of them is the difficulty in conducting searches of books. Therefore we need a technique to be easier to search books. Progress on the current information technology has produced a text-processing machines, in this case is machine text classifiers. This text categorization engine is a second alternative other than by way of text categories manually. In fact the process of text analysis to discover new information (unknown information) from the set of natural language text presented is not structured. To help solve the problems mentioned above must have needed a technique. The technique is referred Text Mining with Naive Bayes method. This technique will perform syntactic and semantic analysis of a text (Part Of Speech (POS) Tagging / group of words, word ambiguity, generating parse tree for each sentence in this case the book title and synopsis of the book), made the analogy of the relation table to determine the attributes of words and the frequency of words, conduct stemming and stopword removal, selecting relevant words and delete the words that appear at least and at most.

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

739

2. Text Mining Text mining is one of the applications of data mining. Text mining is also often referred to as a Text Data Mining (TDM) and Knowledge Discovery in Textual Databases (KDT). According to Tan (2006), "Text mining is a process and knowledge patterns extracts that are interesting and nontrivial (important) from text documents." In essence the process of working together with the text mining work processes of data mining in general it's just that the data mining is text databases. Stages of knowledge discovery for text mining together with the stages of knowledge discovery for data mining. These stages can be seen in Figure 1. These stages are: a. Selection: This phase aims to obtain a text data that has relevance to the task of analysis at a later stage. At this stage of selection will be chosen and filter raw data (raw data) into the target data. In this research, which is raw text data is obtained from the library databases Center for Informatics Research in LIPI while the target data in the form of a synopsis of the book. b. Preprocessing: This process will be conducted beginning the process towards the target data (titles), in this phase will be parsing process (sorting word for word in the title of the book), stemming (search the root word), removing stopwords (deletion of the words do not important) to obtain the important words or keywords. c. Transformation: This process will change the preprocessed data which are important words from the title of the book into the appropriate data to do text mining algorithms. That is by transforming the preprocessed data into table form (earlier title, keyword) where the title of the book is considered as a transaction with words is important as the goods or items purchased. d. Text Mining / Data Mining: The most severe stage of the process is executed. In this process will use data mining algorithms namely Naive Bayes algorithm as the target of mining is an important word of each book title that has been stored in the table. Specifically at this stage will produce a collection of keywords used to find interesting knowledge. e. Evaluation: The text mining process results in the evaluation process will be interesting to gain knowledge more understandable and useful to the user. Interesting knowledge referred to in this case is a group of titles that have important words from the most closely related to the title of the book's least-connection of the keywords entered by the user.

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

740

Figure 1. The stages in the process of knowledge discovery 3. Methods for Naive Bayes Text Classification

One classification method that can be used is the Naive Bayes method, which is often referred to as Naive Bayes Classifier (NBC). NBC uses probability theory as the basis of the theory. There are two stages in the process of text classification (Graham,2003). The first stage is training to set an example article (training example). The second step is the process of classification of documents of unknown category. Naive Bayes classifier on the title of each document presented with pairs of attributes (a1, a2 .... an) where a1 is the first word, a2 and so on. Whereas V is the set of categories of books. At the time of classification, Bayes approach will yield the highest category label probability (VMAP) with input attributes (a1, a2 .... an)

VMAP arg max P(v j | a1 , a2 ...an )


v j V

(1)

Bayes theorem states:

P( A | B) P( B) (2) P( A) Using Bayes Theorem, the equation (1) can be written: P(a1 , a 2 ...a n | v j ) P(v j ) (3) VMAP arg max P(a1 , a 2 ...a n ) v j V P (a1, a2 ... an) value is constant for all vj so that this equation can be written as follows: P( B | A)

VMAP arg max P(a1 , a2 ...an | v j ) P(v j )


v j V

(4)

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

741

The degree of difficulty to calculate P (a1, a2 ... an | vj) becomes high because the number of terms P (a1, a2 ... an | vj) can be quite large. This is due to the number of terms are equal to the sum of all combinations of positions multiplied by the number of categories of words Naive Bayes Classfier simplify this by assuming that within each category, each word independent of one another. In other words:
P(a1 , a2 ...an | v j ) i P(ai | v j )

(5)

Substitution of this equation by equation 4 will yield:

VMAP arg max P(v j ) i P(a1 | v j )


v j V

(6)

P (vj) and the probability of word wk for each category P (wk | v) was calculated during the training | category j | P(vj) (7) | numberofcategory | nk 1 (8) P( wk | v j ) n | Vocabulary | Where | kategorij | is the number of words in category j and | examples | is the number of documents used in training. While nk is the number of times the word wk in category vj, n is the number of all words in the category and vj | vocabulary | is the number of unique words (distinc) on all training data. Naive Bayes is one of the methods used for classification. Naive Bayes uses probability theory as the basic theory. There are two stages in the process of text classification. The first stage is the training of the title of the book examples (training example). The second step is the process of classification of documents of unknown category. Stages of the process of classification of books using a Naive Bayes: 1. From the data had been entered only book titles, synopsis and language which are then processed using a Naive Bayes 2. Titles of books and synopses will be processed into pieces called word tokenizing process. Example : Titles : Algoritma dan Pemrograman Synopses : Buku ini membahas bagaimana mengimplementasikan algoritma dalam bahasa PASCAL dan C. Tokenizing : |Algoritma| |dan| |Pemrograman| |Buku| |ini| |membahas| |bagaimana| |mengimplementasikan| |algoritma| |dalam| |bahasa| |PASCAL| |dan| |C| 3. Refineries data filtering process is a stopword (unnecessary words) and keywords (important words). Example :

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

742

: |Algoritma| |dan| |Pemrograman| |Buku| |ini| |membahas| |bagaimana| |mengimplementasikan| |algoritma| |dalam| |bahasa| |PASCAL| |dan| |C| Stopword : dan (the words had been entered on the table as a liaison stopword, questions, punctuation, etc. will not be processed) Keyword : Algoritma, Pemrograman, Buku, Algoritma, Bahasa, Pascal, C (important words to be processed on stage stemming) 4. Stemming process is the management of keywords into a keyword that is by eliminating additive intact. Keyword : Algoritma, Pemrograman, Buku, Algoritma, Bahasa, Pascal, C Stemming : Algoritma Algoritma Pemrograman Program (eliminating additive) Buku Buku Algoritma Algoritma Bahasa Bahasa Pascal Pascal C C 5. Tagging is the process of managing words or keywords that are English. 6. Analyzing process of categorization stage there are two stages The first phase of the training data or sample data on manual processes including tokenizing process, filtering, stemming, tagging and Other categories election. The second stage is stage categorization using Naive Bayes method:

Tokenizing

VMB arg max P(v j )


v j V

i positions

P( w

| vj)

3.1 Learn_Naive_Bayes_Text(Examples,V) Initial state : Examples and v is ready to be processed. Examples is a collection of text documents that had been determined that the target value as possible. fs: learning outcomes of probability terms P (wk | v) showed the probabilities P (VJ). P (wk | v) indicates the probability of occurrence of the word wk in a document class VJ) / / Set of all words, the token that appears in / / Examples Vocabulary collection of all the different words and other tokens that appear in documents from Examples / / Calculate P (vj) and P (wk | v) for each target value vj in v do categoryj subset of examples that target value vj P(vj)

| category j | | numberofcategory |

n number of words in keywordj for each word wk in vocabulary do nk frequency of occurrence in categoryj

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

743

P(wk|vj)

| nk 1 | n | vocabulary |

3.2

Classify_Naive_Bayes target value

Calculate the target value for the document stating keyword.wk keywords found in the position to the j on the keywords that generated from training data. Positions all positions on keywords found in the vocabulary VMB arg max P(v j ) P( wk | v j )
v j V i positions

VMB Table 1. Data Sample No 1 2 Keyword (Number of Category Keyword) Books Komputer(2), Aplikasi(4), Komputer Program(4) Program Komputer network(2), TCP/IP(1) network

Test Data: Three documents containing the word : Algoritma (2), Program (1), Buku (1), Bahasa (1), Pascal (1), C (1) 7 Table 2 Examples of Probabilistic Data P(Wk|Vj) V P(Vj) Komputer Aplikasi Program Network TCP/IP Komputer 1/2 1/15 1/15 5/15 1/15 1/15 Program Komputer 1/2 1/10 1/10 1/10 1/10 1/10 Network Results of Probabilistic above is: Category 1: 1 / 2 * 1 / 15 * 1 / 15 * 5 / 15 * 1 / 15 * 1 / 15 = 0.00000652 Category 2: 1 / 2 * 1 / 12 * 1 / 10 * 1 / 10 * 1 / 10 * 1 / 10 = 0.000005 So to find test third data categorization is to find a higher chance of probabilistic results. Then the appropriate category for third data test is Komputer Pemrograman. 4. Conclusion Implementing this text mining application for automatic book classification will process raw data in the form of text databases that is input title and synopsis of book. Preprocessing will do tokenizing, filtering, stemming, tagging, and analyzing to the words in this book title so obtain the keywords. From the keywords, frequent itemset can be generated using Nave Bayes method to classification of books category. In the result of naive Bayes method experiment can be used to classification of books category, looking for the probability many more that is exactly title and synopsis book consist of the keywords in the training data. By using the keywords can be obtained book title information that searched by user in the search engine.

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

744

5. Bibliography [1] [2] [3] [4] [5] [6] Graham, Paul, (2003), "Better Bayesian filtering", http://www.paulgraham.com/better.html Han, Jiawei and Kamber M., (2001), "Data Mining : Concepts and Techniques", Academic Press, USA. Tan, Pang-Ning, et al (2006), Introduction To Data Mining , Boston, Pearson Education (March 27, 2009), http://en.wikipedia.org/wiki/Stopword (March 27, 2009), http://www.perseus.tufts.edu/ (June 4, 2009), http://www.tartarus.org/~martin/PorterStemmer

Proceedings of Regional Conference on Knowledge Integration in ICT 2010

745

You might also like