You are on page 1of 4

2013 Fifth International Conference on Intelligent Human-Machine Systems and Cybernetics

An Improved Mutual Information-based Feature Selection Algorithm for Text Classification


JIANG Xiao-Yu
Business School Beijing Institute of Fashion Technology Beijing, China e-mail: sxyjiangxiaoyu@bift.edu.cn

JIN Shui
Business School Beijing Institute of Fashion Technology Beijing, China e-mail: sxyjinshui@bift.edu.cn

AbstractFeature selection plays an important role in

text classification, and contributes directly to the accuracy of the classification. In order to correct the defects, such as mutual information-based feature selection method tends to select rare words and those words from small samples as features, and negative MI value. This paper proposes a new improved feature evaluation function for automatic text classification by taking word frequency, concentration rate between classes and dispersion within class into overall consideration. According to experimental results, the improved algorithm is well placed to remedy the defect that the original MI evaluation function is prone to select rare words, and can improve the performance of classification significantly.
Keywords- text classification; feature selection; mutual

describes the Nave Bayes Classifier, Section 5 presents our experiments and result, and finally Section 6 summarizes our conclusions. II. FEATURE SELECTION METHODS

Feature selection for text classification is the task of reducing dimensionality of feature space by identifying informative features and its primary goals are improving classification effectiveness, computational efficiency, or both. The performance of a classifier is affected by the employed feature selection mechanism. Here we will describe several feature selection algorithms which are commonly used: information gain, document frequency, mutual information. A. Information Gain In information gain feature selection method, the criteria to evaluate whether a feature is important or not is how much useful information the feature provide for the classification, the more information feature bring, the more important the feature is. For a feature, information content will change by knowing the presence or absence of a feature in a document, the difference of the two information content is the information gain. Let {ci }ik=1 denote the set of categories. Information gain of feature t is defined to be [8]:

information I. INTRODUCTION

Automatic text classification is a process which classifies the documents to one or several categories by their content. With the rapid development of the online information, text classification becomes a key technology to deal with and organize large numbers of documents. So far, the text classification is practical and is widely used in E-mail classification, information filer and so on. High dimensionality of feature space is a major problem in Text Classification. The number of terms (i.e., features) present in a collection of documents, in general, is large and few are informative. Hence, we need to select some representative features from the original feature space (i.e., feature selection) to reduce the dimensionality of feature space and improve the efficiency and precision of classifier. At present the feature selection method is based on statistical theory and machine learning [1]. Some well-known methods are information gain [2-3], expected cross entropy [4], the weight of evidence of text [5], term frequency [4, 6], mutual information [5, 7], CHI and so on. The remainder of this paper is structured as follows: Section 2 describes several common feature selection algorithms; Section 3 describes the deficiencies of mutual information feature selection algorithm and an improved MI feature selection for text classification is proposed, Section 4
978-0-7695-5011-4/13 $26.00 2013 IEEE DOI 10.1109/IHMSC.2013.37 124 126

G (t )=- p (ci ) log 2 p(ci )+p(t ) p(ci |t ) log 2 p(ci |t )


i =1 i =1

+p(t ) p (ci |t ) log 2 p(ci |t )


i =1

(1)

Where p(ci) is the probability of ci, i.e. the number of documents in category ci divided by the total number of documents; p(t) is the probability of feature t appears in training corpus; p( t ) is the probability of feature t does not appears in training set; p(ci|t) denotes the probability of the documents which contain feature t belong to the category ci; p(ci| t ) denotes the probability of the documents which dont contain feature t belong to the category ci.

B. Document Frequency Document frequency is the number of documents in which a term occurs. The assumption of document frequency feature selection method is that low-frequency features in the training set contain less information and small contribution

to the performance of classification. The advantage of the document frequency is it easily scales to large corpora, with a computational complexity approximately linear in the number of training documents, and can get good results in practical applications, so it is often used as a basic approach to evaluate the efficiency of other feature selection methods. The disadvantage is that the words which have the less frequency than the predefined threshold may still contain important information for classification and, if discarded, may affect the accuracy of the classifier.

C. Mutual Information The mutual information is a general measure of the dependence between two random variables. It expresses the quantity of information one has obtained on X by observing Y. On a discrete domain, the mutual information between two random variables t (feature) and c (category) is defined as:
I (t , c) = log 2 p(t | c) p(t , c) A N = log 2 log 2 (2) p(t ) p (t ) p (c ) ( A + C ) ( A + B)

Where A is the frequency t and c co-occur, B is the frequency t occurs without c, C is the frequency c occurs without t, and N is the total number of documents in training corpus. If t and c are independent, then I(t,c)=0.In order to measure the importance of a feature in global feature selection, we combine the category specific scores of a term into two alternate ways:
I avg (t )= P (ci )I (t ,ci )
i =1 m

(3)

I max (t ) = max{I (t , ci )}
i =1

(4)

III.

IMPROVED MUTUAL INFORMATION ALGORITHM

A. Standards of Strong Features Strong features (i.e., Good Features) are those terms which can effectively distinguish one class from other classes and can affect the performance of text classification stronger than other terms. In general, strong features should have the following three characteristics [8] 1 High Word Frequency: the higher the frequency of one term is in one certain class, the more it can stand for its category, and thus, the more the information it carries. 2 Strong Concentration Rate between Classes: a term with a higher level of distinction should be only concentrated in one certain class, rather than evenly distributed in every class. 3 Large Dispersion within a Class: the higher occurrence of a term is in all texts of a class, the more dispersed the term is, the more the information it carries, and the more it can stand for its category. B. Defects of MI Algorithm

According to Eq. (2) we find that the I(t,c) of rare words is greater than that of common words when the value of p(t|c) is equal, and the rare words may be prone to be selected as strong feature, it is obviously unreasonable, and is also inconsistent with the standards of strong features. Furthermore, due to the mutual information evaluation function that does not take into account the word frequency and other factors, we found that so many words have exactly the same MI value in our experiments, and there may be thousands of such words, so unfortunately, those words with the same MI value can only be deleted randomly, this will result in losing a large number of useful information for classification. 2Favor of the category with small sample If most of the words are the low-frequency words in the training set, it is easy to let the MI value of words in several categories with small sample to be greater than that of other categories, and make the most of selected features using MI evaluation function biased to the words which are in small categories, or even all tend to the words of these categories. That is, the selected features based on MI metrics will be biased to the words contained in the categories with small sample. 3In conflict with information theory According to Eq. (2), in a general way, I(t,c) as defined above compares the probability of feature t and category c together(the joint probability) with the probabilities of observing t and c independently. If there is a genuine association between t and c, then the joint probability p(t,c) will be much larger than the chance p(t)p(c), and consequently I(t,c) >>0.If there is no significant relationship between t and c, then p(t,c)p(t)p(c),and thus, I(t,c)0.If t and c are in complementary distribution ,then p(t,c) will be much less than p(t)p(c),forcing I(t,c) <<0.That is, the MI of t and c can be negative, which is in conflict with the definition of MI in information theory where it is always non-negative, so it would seem that the mutual information is not the one defined in information theory. In our experiments, Iavg(t) is found negative for about 18% of the words. For example, Assuming p(t)=0.8, p(c)=0.7, p(t,c)=0.5,then p(t , c) 0.5 I (t , c) = log 2 = log 2 = log 2 0.89<0 (6) p(t ) p(c) 0.7 0.8 C. Improved Algorithm Whether the term t appears in the class c will play a positive or negative role for text categorization. In order to compensate for the three drawbacks above-mentioned, we propose a new mutual information evaluation function combining with the standards of strong feature and the definition of mutual information in information theory, as shown in the following formula:
MI (t , c) = ( p(t , c) log 2 p(t , c) p( t , c) + p( t , c) log 2 ) p(t ) p(c) p( t ) p(c) Pf (t , c) D f (t , c)

(7)

1Prone to rare words

Where t is a termP(c) is the probability of the class c, Pf(t,c) is the probability of the class c and term t occurring simultaneously, the calculation formula is as follows:

127 125

Pf (t , c) = p(c) Pf (t | c) = p(c)

1 + tf (t , c, d k ) | V | + tf (ti , c, d k )
i =1 k =1 k =1 |V | n

TABLE I.

EXPERIMENTAL DATA
Quantity of testing documents 200 200 200 200 200 200 200 200

(8)

Category Law Art Sport Politics Computer Economy Education Agriculture

Quantity of training documents 510 440 500 490 500 500 480 470

Where tf(t,dk) is the term frequency of term t in the document dk , |V| is The total number of terms, n represents the total number of documents of class c. Df(t,c) is the dispersion of term t in the class c, the calculation formula is as follows: df (t , c) (9) D f (t , c ) = N Where df(t,c) is the document frequency of term t in the class c of the training set ,N is the total number of documents in the class c of the training set.

IV.

NAVE BAYES CLASSIFIER

B. Experimental Metric In order to evaluate the performance of the classifier, a confusion matrix is created for each category as shown in Table 2.
TABLE II.
True positive True negative

According to the bag-of-words representation, the document di can be represented by a feature vector consisting of one feature variable for each word tk in the given vocabulary V={ t1 , t2 ,, tn } containing n distinct words. Let C= {c1 , c2 ,, c|C| } be the set of |C| classes. |C| classes are predefined and that document always belongs to one class or several classes. Give a new document d, the probability that d belongs to class ci is given below: P(ci )P(d |ci ) P(ci |d )= (10) P(d ) If a new document is only be classified into a single class, the class c* with the highest posterior probability will be allocated to the document d. Assuming a multinomial event model and classconditional independence of words yields the well-known Nave Bayes classifier, which computes the most probable class for d as:
c* = argmaxP(ci |d )= argmaxP (ci ) P (tk |ci ) N (tk ,d )
i=1,",C i =1,",C k =1 n

CONFUSION MATRIX
positive judgment A C negative judgment B D

(11)

Where N(tk,d) is the times of occurrences of word tk in the document d. The probability P(tk|ci) are usually estimated using Laplaceant prior: 1+ d c N (tk ,di ) i j P(tk |c j )= (12) [V | |V |+ r =1 d c N (tk ,di )
i r

V.

EXPERIMENTS AND DISCUSSION

A. Datasets The experimental data used in this paper is from Chinese natural language processing group in Department of Computer Information and Technology in Fudan University. With a total of 20 categories for experiments, the training corpus train.rar includes 9,804 documents, and answer.rar includes 9,833 documents. We just choose some of the documents for our experiments, to gain a higher efficiency of the algorithm.

A is the number of documents judged by the classifier to the category that are correct predictions. B is the number of documents that have not been judged by the classifier to the category, but that should have. C is the number of the documents which should belong to the category but the classifier dont judge them to the category .D is the number of the documents which should not belong to the category and the classifier dont judge them to the category. The rate of accuracy and recall is often used to evaluate the performance of classifier. Precision is the ratio of the number of documents which labeled correctly by classifiers to the number of documents which classifiers labeled to this category [10], and precision rate is defined as following: a Precision = (13) a +b Recall is the ratio of the number of documents which labeled correctly by classifiers to the number of documents which are belong to this category [10], and recall rate is defined as follows: a Recall = (14) a +c When as much importance is granted to precision as it is to recall, a composite index called F-value is defined as follows: (1+ 2 ) Presion Recall (15) F = 2 Presion+Recall Where, b is a positive parameter, if b <1, it emphasizes the importance of precision, otherwise, it emphasizes the importance of recall. In order to reach a compromise between recall rate and precision in this paper, we set b =1(namely, F1-value). In order to take the precision and recall rate of every class into overall consideration, and give a comprehensive

128 126

evaluation for our system, we use macro average F1-value (Macro-F) as the evaluation criterion.
Macro F =
i =1 n n Ni N 2 Pi Ri F =1 (ci ) = i N Pi + Ri i =1 N

(16)

classification macro average metrics, indicating that the MI method that takes the overall word frequency, concentration and dispersion into consideration can improve the performance of the classification. VI. CONCLUSIONS

Where Ni is the document frequency of class ci in the training set; N is the total number of documents in the training set; F=1(ci) is a comprehensive evaluation value of the class ci; Pi and Ri are the accuracy and recall rate of the class ci. C. Preprocessing Prior to feature selection, we carried out lexical analysis on the training set and testing set, using the Chinese Lexical Analysis Algorithm (ICTCLAS) from the Institute of Computing Technology, CAS, and then, we identified unknown words based on co-occurrence frequency of adjacent words. We removed the words in a standard stop word list since multi-character words carry more information than single-character words do, which was more beneficial to text classification, we selected only multi-character words as candidate feature. In order to reduce the amount of calculation of the feature selection and classification, only the word frequency of candidate feature is greater than the predefined threshold . We set the value of as 3 in our experiment. D. Experimental Result Four methods (Document Frequency, Information Gain, Mutual Information and Improved MI we proposed) are used for feature selection in our experiment, and Naive Bayes Classifier is adopted to evaluate the strengths and weaknesses of the four feature selection methods, meanwhile, in order to test the performance of classification using these methods with different feature dimensionalities, the feature dimensionalities selected is from 500-8000, and the value of macro average F1-value is shown in Table 3.
TABLE III.

A new improved MI feature selection algorithm for text classification is proposed in this paper based on the analysis of a tradition MI method. The experimental results indicate that the improved method can solve the disadvantage of the original MI evaluation function which tends to select rare words as features, and can significantly improve the performance of classification, however, several sectors, such as overall word frequency, concentration, dispersion etc. are considered, which leads to a higher workload of calculation. What we need to focus on towards the next step is how to reduce the computational complexity of feature selection. ACKNOWLEDGMENT This paper is supported by the Beijing Key Lab of Digital Media & Human-Computer Interaction (No.KF2012-02) and the Science and Technology Project of Beijing Municipal Commission of Education (No. KYJH02120207, No. KM201310012008). REFERENCES
[1] S. Doan and S. Horiguchi, An efficient feature selection using multicriteria in text categorization for nave bayes classifier, WSEAS Transactions on Information Science and Applications, Vol.2, Issue 2, pp.98-103,2005. [2] Liu Haifeng, Wang Yuanyuan, Yao Zeqing and Chen Qi, A method of reducing features based on feature selection twice in text classification, Journal of the China Society For Scientific and Technical Information, Vol.28,pp.23-27,February 2009. [3] Chen PingLiu Xiao-xiaLi Ya-jun, Study on an improved mutual information for feature selection in chinese text classification, Microelectronics &Computer, Vol.25, No.6,pp. 194-196, July 2008. [4] Yan Xu,Gareth Jones,JinTao Li, A study on mutual informationbased feature selection for text categorization, Journal of Computational Information Systems, Vol.3, No.3, pp.10071012.2007. [5] Jiang Xiao-yu, Automatic summarization algorithm based on keyword extraction, Computer Engineering, Vol.38, No.3, pp.183186, Feburary 2012. [6] Su Jinshu, Zhang Bofeng, Xu Xin, Advances in machine learning based text categorization, Journal of Software, Vol.17, No.9, pp.1848-1859, 2006. [7] Wang Yi, Bai Shi, Wang Zhangou, A fast knn algorithm applied to web text categorization, Journal of The China Society for Scientific and Technical Information, Vol.26, No.1, pp.60-64, 2007. [8] Yiming Yang,Jan O.Pedersen, A comparative study on feature selection in text categorization, Proceedings of the Fourteenth International Conference on Machine Learning(ICML~97)l997. [9] Liang yong-jiang, Discussion of feature extraction for chinese web page categorization, Masters Degree Thesis, Sun Yat-sen University,2010. [10] Zhou Yong,Li Youwen ,Xia Shixiong, An improved knn text classification algorithm Based on Clustering, Journal of Computers, Vol. 4, No. 3, pp.230-237, March 2009 .

MACRO AVERAGE F1-VALUE %


500 59.87 73.86 38.57 73.51 1000 61.18 79.34 43.32 77.37 2000 64.74 82.72 47.80 82.17 4000 66.73 85.16 46.55 85.49 6000 65.68 85.73 46.08 84.81 8000 66.36 86.28 45.74 84.50

Method DF IG MI Improved MI

We can reach the following conclusions from Table 3: (1) the best effect is achieved using the IG method, and the worst one using the MI method, these results were due to the failure to take word frequency into consideration in the MI evaluation function, as well as that rare words are often selected as the best feature; (2) when the feature dimensionalities vary from 500 to 4000, macro average F1value changes more significantly, but when the feature dimensionalities are greater than 6000, the performance of system tends to be stable, but sometimes the performance may even decrease; and (3) Compared to the traditional MI method, the improved one leads to a significantly improved

129 127

You might also like