Professional Documents
Culture Documents
102
E.Jebamalar Leavline
Department of CSE
M.I.E.T Engineering College,
Tiruchirappalli -7,India
asirantony@gmail.com
Research Coordinator
K.L.N College Of Information
Technology, Madurai-15,India
app_s@yahoo.com
Department of ECE
Anna University Chennai, BIT
Campus, Tiruchirappalli - 24, India
jebilee@gmail.com
I. INTRODUCTION
Dimensionality reduction for classification process has
involved significant attention in both pattern recognition
and machine learning. High dimensional data space may
increase the computational cost and reduce the prediction
accuracy of classifiers [1]-[3]. The classification process is
known as supervised learning that builds the classifier by
learning from a training data sets[6] and it is observed that,
when the features of training data sets exceed a particular
range of a sample space, the accuracy of the classifier will
decrease [1][2]. There are two ways to be followed to
achieve the dimensionality reduction: feature extraction and
feature selection [5], [3].
In feature extraction problems, [12], [13], [26] the
original features in the measurement space are initially
CFS
In this method CFS (Correlation-based Feature
Selection), the subsets of features are evaluated rather than
individual features [38, 39, 43]. The kernel of this heuristic
principle that evaluates the effectiveness of individual
features based of the degree of inter-correlation among the
features to predict the class. The goodness of the feature
subset is determined by heuristic equation (1) on basis of the
features presented in subset have high correlation with the
class and low inter-correlation with each other.
k r cf
Merits
k (k
(2)
1)r ff
SU
2.0
H ( X ) H (Y ) H ( X , Y )
H ( X ) H (Y )
(3)
Consistency
In consistency-based subset elevation, many approaches
use class consistency as an evaluation metric in order to
select the feature subset [40], [41]. These methods look for
combinations of features whose values divide the data into
subsets containing a strong single class majority.
several approaches to feature subset selection use class
consistency as an evaluation metric [40], [41]. These
methods look for combinations of features whose values
divide the data into subsets containing a strong single class
majority. Usually the search is biased in favour of small
feature subsets with high-class consistency. Our
consistency-based subset evaluator uses the work of [41]
consistency metric.
J
Consistency S
i 0
(4)
Filter
This approach can be most suitable for reducing only the
data in dimensionality, rather than for training a classifier.
The goodness of the subset selection takes the advantages of
the Cfs subset evaluator and Greedy Stepwise searching
algorithm. It reduces the dimensionality as maximum as
possible in high dimensional data [44].
d.
Chi-Squared
This Feature ranker uses the chi-square (2) test [50]. It
estimates the worth fullness of a feature by computing the
value of the chi-squared statistic with respect to the class.
The initial hypothesis H0 is the assumption that the two
features are unrelated, and it is tested by chi-squared
formula as in equation-(5)
2
Oij E ij
(5)
E ij
i 1j 1
Information Gain
This is a ranker based feature selection measure using
information theory. Given the entropy is a criterion of
impurity in a training set S, a measure reflecting additional
information about Y is defined, provided by X that
represents the amount by which the entropy of Y decreases
[51]. The Information Gain(Info-Gain) measure is formulated
in equation (6).
Info-Gain = H(Y) H(Y |X ) = H(X ) H(X|Y)
(6)
The information gained about Y after observing X is
equal to the information gained about X after observing Y. A
weakness of the IG criterion is that it is biased in favor of
features with more values even when they are not more
informative.
III.
PROPOSED WORK
Dataset
Instances
Features
Classes
1
2
3
4
5
6
7
8
9
10
Contact Lenses
Diabetes
Glass
Ionosphere
Iris
Labor
Soybean
Vote
Weather
Car
24
768
214
351
150
57
683
435
14
1728
5
9
10
35
5
17
36
17
5
7
3
2
7
2
3
2
19
2
2
4
TABLE II.
S.
No
1
2
3
4
5
6
7
8
9
10
Feature Subset
Evaluators
Dataset
Contact
Lenses
Diabetes
Glass
Ionosphere
Iris
Labor
Soybean
Vote
Weather
Car
Feature Rankers
Cfs
Consistency
Filtered
Chisquared
Ranker
Information
Gain
Ranker
4
8
14
2
7
22
4
2
1
4
7
7
2
4
13
10
2
6
3
5
14
2
4
7
1
1
1
3
7
25
2
3
10
4
1
2
3
8
25
2
3
10
3
1
1
S.No.
SUMMARY OF CLASSIFIERS ACCURACY WITH RESPECT TO THE FEATURE SUBSET EVALUATORS AND RANKERS
Accuracy of NB of Reduced
Feature Subsets
Dataset
II
III
IV
II
III
IV
II
III
IV
70.83
70.83
70.83
87.50
87.50
70.83
83.33
70.83
87.50
87.50
66.66
62.50
66.66
75.00
75.00
77.47
47.66
77.47
44.39
76.43
44.39
76.43
49.06
76.43
47.66
74.86
68.69
74.86
70.09
74.60
65.88
74.60
69.62
73.04
68.69
68.35
71.02
68.35
70.09
70.18
71.02
70.18
77.57
68.48
71.02
2
3
Contact
Lenses
Diabetes
Glass
Ionosphere
92.02
87.17
92.02
83.47
83.19
90.59
87.46
90.59
91.45
91.16
88.88
87.74
88.88
88.03
88.03
5
6
7
8
9
10
Iris
Labor
Soybean
Vote
Weather
Car
Average
96.00
91.22
87.11
96.09
57.14
70.02
78.55
96.00
87.71
81.69
92.41
57.14
85.53
78.03
96.00
87.71
83.30
95.63
50.00
70.02
76.63
96.00
84.21
81.25
92.87
50.00
76.85
77.76
96.00
84.21
82.72
94.71
50.00
76.85
77.92
96.00
77.19
85.65
96.09
42.85
70.02
77.27
96.00
82.45
83.74
96.32
42.85
92.36
80.94
96.00
80.70
82.86
95.63
50.00
70.02
77.71
96.00
80.70
80.81
95.63
50.00
76.56
80.28
96.00
80.70
82.86
95.63
50.00
76.56
80.21
96.66
84.21
83.89
94.02
78.57
66.84
79.91
96.66
87.71
76.57
93.33
78.57
77.25
79.87
96.66
87.71
79.94
91.72
64.28
66.84
78.38
96.66
80.70
79.20
93.79
64.28
73.49
79.89
96.66
80.70
78.91
93.10
64.28
73.49
78.96
I- Cfs, II-Consistency ,III- Filter ,IV- Chi-square Ranker, V- Information Gain Ranker
V.
CONCLUSION
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]