You are on page 1of 5

IPASJ International Journal of Electrical Engineering (IIJEE)

Web Site: http://www.ipasj.org/IIJEE/IIJEE.htm


Email: editoriijee@ipasj.org
ISSN 2321-600X

A Publisher for Research Motivation........

Volume 3, Issue 8, August 2015

Research on Sorting Algorithm for Web


Contents
ZHU Zhenfang
School of Information Science and Electric Engineering,Shandong Jiaotong University, 250357,Jinan,China

ABSTRACT
Text classification is a basic issue in web information processing. The text sorting algorithm is the basic theory for designing
and developing classifier in the text classification. In the field of the text sorting algorithm, the most typical sorting algorithms
at present include decision tree calculation, Bayes calculation and KNN calculation, etc. This paper discusses the theoretical
basis of the above mentioned typical classification algorithms and introduces the applications of those algorithms through
analyzing the advantages and disadvantages of each algorithm. At last, the paper presents a kind of realization process for the
webpage classifier on the basis of C4.5 algorithm.

Keywords: text sorting algorithm, decision tree, Bayes, KNN

1. FOREWORD
The fast development of IT boosts a significant increase in the quantity of various types of information resources,
among which the text data account for a large proportion. In consequence, a series of problems have occurred. While
the most essential issue is how to sort out, analyze and use such huge text data efficiently. With the rapid development
of the information technology such as computer network technology and data reserve technology, collecting, sorting
and analyzing the tremendous data on the network becomes much easier. All kinds of data mining techniques including
data classification play positive roles in the deeper technical fields.
How to classify these text data efficiently is essential to the analysis and process of the huge text information. As for the
mountainous information on the network, the traditional solution is to classify them manually which has many
weaknesses: 1. Spend tremendous human and material resources, as well as energies. 2. The consistency of the
classified data is not high, even though the people responsible for the classification work has very good linguistic
competence, the results of the classification will vary from person to person.
Therefore, the text automatic classification techniques are particularly convenient and accurate, so are the relevant
researches. Now, the text automation classification techniques become a hot topic for the researchers from different
circles. Text classification aims to divide the text collections with category labels. After classifying these texts, find out
the classification model according to the common features of each type of text subset. Then divide the untagged text
into existing categories with proper text sorting algorithms.
In general, the process of text classification can be divided into four steps: pre-process the text, extract features, build
classifier and evaluate classification results. At present, the algorithms for text classification concentrate on two
aspects: extracting feature and build classifier [1]. This paper focuses on the text sorting algorithms which mainly
includes decision tree algorithm, Bayes algorithm, KNN algorithm, etc. All of them are the technical foundation of web
text classification.

2. ANALYSIS ON TEXT CONTENT SORTING ALGORITHM


Classification is one of the essential tasks in data mining and occupies a very important position in this field. It has
been widely used in real life and obtained excellent results. Actually, classification means to map the provided data to
the existing data categories through a classification model.
Classification analysis can be described as follow: find out the different features of a group of sets; use these features to
tag out and classify the original sets; express the features of this group of sets in a certain form; then with the help of
the learned classification model, classify the new unseen data set into one of the original sets which were categorized
before [5]. Generally, this process need to be realized in two steps: first of all, try to build a classifier through the
analysis and study on the training set by sorting algorithm. Secondly, classify the unseen data. At the test phase, use the
established classifier to categorize the unseen data. The classified data has its own attribute value which can be utilized
by the established classifier to divide the provided data into different categories.
As a matter of fact, text classification is a process of mapping in nature. By training certain original text sets, it can
obtain the mapping rules between the known categorized sets and the unseen texts, which means calculating the

Volume 3, Issue 8, August 2015

Page 1

IPASJ International Journal of Electrical Engineering (IIJEE)


A Publisher for Research Motivation........

Volume 3, Issue 8, August 2015

Web Site: http://www.ipasj.org/IIJEE/IIJEE.htm


Email: editoriijee@ipasj.org
ISSN 2321-600X

similarity between the known text and unseen text. Then it can tell which categories the new text belongs to through
the established classifier.
In the process of text classification, structuring the classifier as per the text sorting algorithms is an essential step. The
text sorting algorithm is the theoretical basis for realizing classifier which is also the focus of the current researches. At
present, there are many text sorting algorithms, among which, the most representative algorithms include KNN
algorithm, Bayes algorithm, decision tree algorithm and SVM (support vector machine). All these algorithms are
widely applied to the various data mining areas. Meanwhile, many algorithms which were optimized on the basis of
these algorithms have also been put into use in real life.

3. WEB CONTENT SORTING ALGORITHM


As for each web text, their feature properties can be used to predict their categories. This section briefs several typical
and highly efficient sorting algorithms on texts.
3.1 Decision Tree Algorithm
Decision tree algorithm is a typical sorting algorithm which is quite useful for establishing classifier. In general,
decision tree is a tree-like graph. There are many nodes in the decision tree. Judges on the attribute values are normally
made on these nodes. Then the nodes will be branched further according to the judges of the attribute values. Decision
tree categorizes the text data in accordance with a series of rules. Each branch developed from one root node to its
nearest leaf note represents a classification rule.
The typical algorithms based on decision tree mainly include ID3 and C4.5. These two algorithms differentiate in the
aspects of attribute selection methods for constructing the tree, classifying speed, precision, processing data, etc. ID3
algorithm is a process of separate classification from top to down. The basic and core concept of ID3 is to select the test
attribute at each node of the decision tree through computer information gain. In this way, while testing every non-leaf
node, the uncertainty of the classification will be minimized to provide as much information as possible. In short, ID3
algorithm selects the attribute and classifies the data through information gain [4]. In contrast, the C4.5 algorithm is
used more often. While inherits the advantages of ID3, it also improves the weakness of ID3 in the applications. As a
result, the application of C4.5 algorithm is wider and more accurate. For instance, C4.5 algorithm employs info gain
ratio to select the attribute for classification which eliminate the influence on the constructing of the decision tree
caused by ID3 which prefers the attribute with larger value while selecting attribute through information gain.
3.2 Bayes Algorithm
Bayes algorithm works more efficiently in integrating data text information, causality dependency, prior probability and
conditional probability, etc. It is developed based on Bayes theorem, MAP (maximum a posteriori) hypothesis, etc.
Bayes algorithm has a relatively simple structure and excellent performance.
Bayes classification is the general term for a class of classification algorithms. On the basis of Bayes theorem, it is a
text classification algorithm which employs Bayes formula, prior probability and conditional probability to make
calculations. The basic principle of Bayes algorithm is to predict the occurrence probability of the future events through
calculating the probability of the happened events. Chart 1 shows the process of Bayes algorithm:

Chart 1 Basic Process of Bayes Algorithm


3.3 KNN Algorithm
KNN algorithm is a relatively simple one in the classification algorithms. It is a non-parameter text classification
algorithm which is also known as k-nearest neighbor algorithm.
KNN algorithm represents k pieces of nearest neighbor algorithm. It distinguishes the new data through finding k
pieces of historical data which are most similar to the new data from the training set. It means K-nearest neighbor
classifier aims to find out K pieces of text from the categorized texts which mostly resemble the text to be processed so

Volume 3, Issue 8, August 2015

Page 2

IPASJ International Journal of Electrical Engineering (IIJEE)


A Publisher for Research Motivation........

Volume 3, Issue 8, August 2015

Web Site: http://www.ipasj.org/IIJEE/IIJEE.htm


Email: editoriijee@ipasj.org
ISSN 2321-600X

as to determine the class of the tested text. The core concept of this algorithm is: as for the new text to be classified,
find out K pieces of text from the training text set which are most similar to the new text; then classify the new text
according to the class of the K pieces of historical texts. In other words, if the majority of the k pieces of texts which are
most similar to the new text belong to one class (this can be determined through calculating the weights), the new text
also belongs to this class. To emphasize, in KNN algorithm, all the selected K pieces of training text are correctly
categorized. Moreover, in final decision process, the class of the new text is judged as per only the main class of the
nearest one or several texts.

4. WEBPAGE CLASSIFIER BASED ON C4.5 ALGORITHM


From the above analysis, we can see that the various algorithms used widely at present have different advantages and
disadvantages, as well as applications. This section concentrates on the analysis and realization of the application of
C4.5 algorithm in the text classification which belongs to decision tree algorithms.
4.1 Webpage Classification Problems and Feature Expressions
In many cases, the web applications such as web information search and web information extraction need to judge the
type of the webpage. We divide the web-pages into two types: Link Page and Detail Page. Link Page features in its
plenty of links on the webpage which are suited to crawl by the crawler. And the Detail Page is mainly in the form of
text which is fit for web text extraction. However, since the current Detail Pages are mainly for commercial use, they
always contain many links which are irrelevant to the topic. Therefore, more complicated situation shall be taken into
account. Based on the observations, we can stipulate rules to describe the features of the above two webpage types so as
to judge the type of the webpage.
1. Page type (two values attribute), use value Y or N to indicate whether the URL of the webpage is ended with html,
htm or shtm. Generally, Detail Page use Y.
2. Text Degree (continuous values), Text Degree=log2( plain Texts / k), plain texts represent the number of texts
remained after deleting the link words of the entire document, K is an empirical constant which means for the
webpage with plan texts <k, its text degree is low. The increase of the Text Degree (plain texts become more) will
enhance the possibility of the webpage belonging to Detail Page.
3. Link Degree (continuous values), Link Degree =log2 (Total Words / linknums). Total Words represent the number
of texts of the whole document. Linknums represent the number of the links in the document. The decrease of the
Link Degree (Total words becomes smaller) will enhance the possibility of the webpage belonging to Link Page.
4. Aggregation (continuous values), represents the aggregation degree of the words in web-pages. Normally, Detail
Page prominently features in the blocks of text. Thus, the bigger the aggregation, the more likely the webpage
belongs to Detail Page. This feature is obtained through calculating the text blocks of the entire document.
5. Catalog (L/D), class tagging. L: Link Page, D: Detail Page. In the training set, the tags are made manually.
Based on the above features, we can carry out features expression and extraction for all the web-pages in the data
set used by us. This dataset is a collection of the web-pages weve searched from 10 renowned search engines such
as Sina, Sohu, etc. This collection consists of 1625 web-pages, among them, 830 pages belong to Link Page, 795
pages belong to Detail Page. Table 1 is a segment of the corresponding collection for the feature data after
extracting the features from the dataset.
Table1 Feature Dataset Segment

Volume 3, Issue 8, August 2015

Page 3

IPASJ International Journal of Electrical Engineering (IIJEE)


A Publisher for Research Motivation........

Volume 3, Issue 8, August 2015

Web Site: http://www.ipasj.org/IIJEE/IIJEE.htm


Email: editoriijee@ipasj.org
ISSN 2321-600X

4.2 Realization of C4.5 Algorithm


Step1: Create root node N;
Step2: If the sample sets llv s all belong to class C, return with N as the leaf node and tag as class C.
Step3: If the attribute list is blank, return with N as the leaf node, tag N is the most class occurred in S.
Step4: Calculate the info gain ratio of each attribute on the attribute list; the test attribute of N belongs to the attribute
with the highest info gain ratio.
Step5: As for each new leaf node generated from node N, if the sample subset corresponded to the leaf node is empty,
split this leaf node and generate new leaf node; otherwise, restart the execution of Step 1 at this leaf node and continue
splitting.
Step6: Calculate the classification error at each node and cut off the wrong branches.
4.3 Apply C4.5 Algorithm and Test Result
After confirm the feature expressions of the webpage, weve designed a webpage classifier based on C4.5 decision tree
algorithm in accordance with the above C4.5 decision tree algorithm as stated in section 4.2. First of all, construct the
decision tree as shown in chart 2:

Chart 2 Realization Process of Decision Tree


With these 5 rules as the nodes, setting the depth of the tree as 4, make classification training and test. The specific
process is as follow: divide the entire dataset into 5 equal independent subsets; after that, make 5 rounds of training and
tests. Use 4 subsets as training set each time and leave the rest of 1 subset as test set. Rotate next time. In this way,
every subset can work as a test set for one time. At last, choose the average precision, recall rate and F-measure [11].
The test result is:
Precision=0.8722
Recall=0.9336
F-measure=0.9018

5. CONCLUSION
The general rule for the text sorting algorithm is utilizing the features of the data in the training text set to find or
construct a vector model or hypothesis in space so as to determine the class of the provided text. Its purpose is to make
the classified results generated by the sorting algorithm resemble the actual classification of the text as much as
possible. Text sorting algorithm plays an essential role in automatic text classification system. However, it also has
many shortcomings in various aspects which caused by the characteristics of the text like polysemy, multi-words and
ambiguity, etc. In the future studies and applications, more efforts have to be put to improve the text sorting algorithms

Volume 3, Issue 8, August 2015

Page 4

IPASJ International Journal of Electrical Engineering (IIJEE)


A Publisher for Research Motivation........

Volume 3, Issue 8, August 2015

Web Site: http://www.ipasj.org/IIJEE/IIJEE.htm


Email: editoriijee@ipasj.org
ISSN 2321-600X

to enable the classified results become closer to the actual categories of the texts.

6. ACKNOWLEDGMENTS
National Natural Science Foundation (61373148), National Social Science Fund (12BXW040); Shandong Province
Natural Science Foundation (ZR2012FM038, ZR2011FM030); Shandong Province Outstanding Young Scientist
Award Fund (BS2013DX033),Science Foundation of Ministry of Education of China(14YJC860042).

Reference
[1] Zhao Yan, Zhou Bin & Chen Ruhua 2013, 12 (10), Study on Text Sorting Algorithms [J], Software Guide.
[2] Tao Wei, Ma Jiming & Zhang Suzhi 2009, 5 (13), Analysis on Decision Tree Algorithm and Its Application [J],
Computer Knowledge and Technology.
[3] Mao Guojun, Wang Shi & Duan Lijuan, 2005, Principle and Algorithm of Data Mining [M], Beijing, Tsinghua
University Press.
[4] Ma Zhiyuan & Cao Baoxiang, 2013, Application of the Improved Decision Tree Algorithm in Invasion Detect [J],
Computer Technology and Development.
[5] Chen Hongyu, 2009, Study on the Bayes Algorithm in Date Mining [J], Disc Technology.
[6] Zhang Huazhong, 2013, Study on Bayes Algorithm [J], Digital Technology and Application.
[7] Wang Dafu, 2009, Study on the Email Filter System based on Bayes Algorithm [J], Computer & Information
Technology.
[8] Zhang Ning, Jia Ziyan & Shi Zhongzhi, 2005, 31 (8), Text Classification Based on KNN Algorithm [J], Computer
Engineering.
[9] Huang Wei, 2011, 6, Application of KNN in Enterprise Information Search [J], Information Technology.
[10] Cao Wei & Zhang Naizhou, 2010, 19 (10), Webpage Classification Algorithm based on C4.5 Decision Tree[J],
Computer System & Application.

AUTHOR
ZHU Zhenfang , PhD, lecturer, he was born in 1980, Linyi City, Shandong Province. He obtained
Ph.D. in management engineering and industrial engineering at the Shandong Normal University in
2012, his main research fields including the security of network information, network information
filtering, information processing etc.. The authors present the lecturer at the Shandong Jiaotong
University, published more than 30 papers over the year.

Volume 3, Issue 8, August 2015

Page 5

You might also like