You are on page 1of 8

A FAST CLUSTERING-BASED FEATURE SUBSET SELECTION ALGORITHM FOR HIGH DIMENSIONAL DATA

ABSTRACT:

Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FA !" is proposed and e#perimentally evaluated in this paper.

!he FA ! algorithm wor$s in two steps. %n the first step, features are divided into clusters by using graph-theoretic clustering methods. %n the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent& the clustering-based strategy of FA ! has a high probability of producing a subset of useful and independent features. !o ensure the efficiency of FA !, we adopt the efficient minimum-spanning tree (' !" clustering method. !he efficiency and effectiveness of the FA ! algorithm are evaluated through an empirical study.

(#tensive e#periments are carried out to compare FA ! and several representative feature selection algorithms, namely, F)BF, *eliefF, )F , )onsist, and F+), - F, with respect to four types of well-$nown classifiers, namely, the probabilitybased -aive Bayes, the tree-based )../, the instance-based %B0, and the rule-based *%11(* before and after feature selection. !he results, on 2/ publicly available real-world high-dimensional image, microarray, and te#t data, demonstrate that the FA ! not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

EXISTING SYSTEM:

!he embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms, and therefore may be more efficient than the other three categories. !raditional machine learning algorithms li$e decision trees or artificial neural networ$s are e#amples of embedded approaches. !he wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the accuracy of the learning algorithms is usually high. 3owever, the generality of the selected features is limited and the computational comple#ity is large. !he filter methods are independent of learning algorithms, with good generality. !heir computational comple#ity is low, but the accuracy of the learning algorithms is not guaranteed. !he hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper. !hey mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time comple#ity of the filter methods.

DISADVANTAGES: 0. !he generality of the selected features is limited and the computational comple#ity is large. 4. !heir computational comple#ity is low, but the accuracy of the learning algorithms is not guaranteed. 2. !he hybrid methods are a combination of filter and wrapper methods by using a filter method to reduce search space that will be considered by the subsequent wrapper.

PROPOSED SYSTEM: Feature subset selection can be viewed as the process of identifying and removing as many irrelevant and redundant features as possible. !his is because irrelevant features do not contribute to the predictive accuracy and redundant features do not redound to getting a better predictor for that they provide mostly information which is already present in other feature(s". +f the many feature subset selection algorithms, some can effectively eliminate irrelevant features but fail to handle redundant features yet some of others can eliminate the irrelevant while ta$ing care of the redundant features.

+ur proposed FA ! algorithm falls into the second group. !raditionally, feature subset selection research has focused on searching for relevant features. A well-$nown e#ample is *elief which weighs each feature according to its ability to discriminate instances under different targets based on distance-based criteria function. 3owever, *elief is ineffective at removing redundant features as two predictive but highly correlated features are li$ely both to be highly weighted. *elief-F e#tends *elief, enabling this method to wor$ with noisy and incomplete data sets and to deal with multiclass problems, but still cannot identify redundant features.

ADVANTAGES: 5ood feature subsets contain features highly correlated with (predictive of" the class, yet uncorrelated with (not predictive of" each other. !he efficiently and effectively deal with both irrelevant and redundant features, and obtain a good feature subset. 5enerally all the si# algorithms achieve significant reduction of dimensionality by selecting only a small portion of the original features. !he null hypothesis of the Friedman test is that all the feature selection algorithms are equivalent in terms of runtime.

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENT:

1rocessor peed *A' 3ard :is$ Floppy :rive <ey Board 'ouse 'onitor

1entium 6%7 0.0 538 4/9 'B (min"

- 4; 5B 0... 'B tandard Windows <eyboard !wo or !hree Button 'ouse 75A

SOFTWARE REQUIREMENTS:

+perating ystem Front (nd cripts !ools :atabase :atabase )onnectivity

= = = = = =

Windows >1 ?ava ?:< 0.@ ?ava cript. -etbeans AB erver or ' -Access ?:B).

FLOW CHART:

:ata set

%rrelevant feature removal

'inimum pinning tree constriction

!ree partition C representation feature selection

elected Feature

CONCLUSION:

%n this paper, we have presented a novel clustering-based feature subset selection algorithm for high dimensional data. !he algorithm involves 0" removing irrelevant features, 4" constructing a minimum spanning tree from relative ones, and 2" partitioning the ' ! and selecting representative features. %n the proposed algorithm, a cluster consists of features. (ach cluster is treated as a single feature and thus dimensionality is drastically reduced.5enerally, the proposed algorithm obtained the best proportion of selected features, the best runtime, and the best classification accuracy confirmed the conclusions.

We have presented a novel clustering-based feature subset selection algorithm for high dimensional data. !he algorithm involves removing irrelevant features, constructing a minimum spanning tree from relative ones, and partitioning the ' ! and selecting representative features. %n the proposed algorithm, a cluster consists of features. (ach cluster is treated as a single feature and thus dimensionality is drastically reduced. We have compared the performance of the proposed algorithm with those of the five well-$nown feature selection algorithms F)BF, )F , )onsist, and F+), - F on the publicly available image, microarray, and te#t data from the four different aspects of the proportion of selected features, runtime, classification accuracy of a given classifier, and the WinD:rawDBoss record. 5enerally, the proposed algorithm obtained the best proportion of selected features, the best runtime, and the best classification accuracy for -aive, and *%11(*, and the second best classification accuracy for %B0. !he WinD:rawDBoss records confirmed the conclusions. We also found that FA ! obtains the ran$ of 0 for microarray data, the ran$ of 4 for te#t data, and the ran$ of 2 for image data in terms of classification accuracy of the four different types of classifiers, and )F is a good alternative. At the same time, F)BF is a good alternative for image

and te#t data. 'oreover, )onsist, and F+), - F are alternatives for te#t data. For the future wor$, we plan to e#plore different types of correlation measures, and study some formal properties of feature space. REFERENCES: E0F 3. Almuallim and !.5. :ietterich, GAlgorithms for %dentifying *elevant Features,H 1roc. -inth )anadian )onf. Artificial %ntelligence, pp. 2I-./, 0JJ4. E4F 3. Almuallim and !.5. :ietterich, GBearning Boolean )oncepts in the 1resence of 'any %rrelevant Features,H Artificial %ntelligence, vol. 9J, nos. 0D4, pp. 4@J-2;/, 0JJ.. E2F A. Arau8o-A8ofra, ?.'. Benite8, and ?.B. )astro, GA Feature et 'easure Based on *elief,H 1roc. Fifth %ntKl )onf. *ecent Advances in oft )omputing, pp. 0;.-0;J, 4;;.. E.F B.:. Ba$er and A.<. 'c)allum, G:istributional )lustering of Words for !e#t )lassification,H 1roc. 40st Ann. %ntKl A)' information *etrieval, pp. J9-0;2, 0JJI. E/F *. Battiti, G,sing 'utual %nformation for electing Features in upervised -eural -et %5%* )onf. *esearch and :evelopment in

Bearning,H %((( !rans. -eural -etwor$s, vol. /, no. ., pp. /2@-//;, ?uly 0JJ.. E9F :.A. Bell and 3. Wang, GA Formalism for *elevance and %ts Application in Feature ubset election,H 'achine Bearning, vol. .0, no. 4, pp. 0@/-0J/, 4;;;. E@F ?. Biesiada and W. :uch, GFeatures (lection for 3igh-:imensional data a 1earson *edundancy Based Filter,H Advances in oft )omputing, vol. ./, pp. 4.4-4.J, 4;;I. EIF *. Butterworth, 5. 1iatets$y- hapiro, and :.A. imovici, G+n Feature election through )lustering,H 1roc. %((( Fifth %ntKl )onf. :ata 'ining, pp. /I0-/I., 4;;/.

EJF ). )ardie, G,sing :ecision !rees to %mprove )ase-Based Bearning,H 1roc. 0;th %ntKl )onf. 'achine Bearning, pp. 4/-24, 0JJ2.

E0;F 1. )handa, L. )ho, A. Mhang, and '. *amanathan, G'ining of Attribute %nteractions ,sing %nformation !heoretic 'etrics,H 1roc. %((( %ntKl )onf. :ata 'ining Wor$shops, pp. 2/;-2//, 4;;J. E00F . )hi$hi and . Benhammada, G*elief' = A 7ariation on a Feature *an$ing *elieff

Algorithm,H %ntKl ?. Business %ntelligence and :ata 'ining, vol. ., nos. 2D., pp. 2@/-2J;, 4;;J. E04F W. )ohen, GFast (ffective *ule %nduction,H 1roc. 04th %ntKl )onf. 'achine Bearning (%)'B KJ/", pp. 00/-042, 0JJ/. E02F '. :ash and 3. Biu, GFeature election for )lassification,H %ntelligent :ata Analysis, vol. 0, no. 2, pp. 020-0/9, 0JJ@. E0.F '. :ash, 3. Biu, and 3. 'otoda, G)onsistency Based Feature election,H 1roc. Fourth 1acific Asia )onf. <nowledge :iscovery and :ata 'ining, pp. JI-0;J, 4;;;. E0/F . :as, GFilters, Wrappers and a Boosting-Based 3ybrid for Feature election,H 1roc. 0Ith %ntKl )onf. 'achine Bearning, pp. @.- I0, 4;;0.

You might also like