You are on page 1of 5

CLUSTERINGALGORITHMFORTEXTDATAMINING B.H.Chandrashekar.1,Dr.G.

Shoba2
Lecturer,DepartmentofMasterofComputerApplications,R.V.CollegeofEngineering,Bangalore560059 Professor,DepartmentofComputerScienceandEngineering.,R.V.CollegeofEngineering,Bangalore560059 chandrashekarbh@gmail.com1,shobatilak@rediffmail.com2

ABSTRACT Clusteringisatechniquewherethedatais grouped into similar objects. Clustering the document (text) datasets is often vulnerable by the nuisance of high dimensionality. Most of the clustering algorithmsloosesomeoftheefficiencyin high dimensional datasets. This paper presents a method for clustering high dimensional data sets. The key idea involves reducing high dimensional data sets into lower dimensional data sets by using Principal component Analysis and thenclusteringisperformedusingkmeans algorithms.Kmeansalgorithmsisslightly modifiedwhichusesthemedianvalueas centroidsforclusteringinordertoobtain high performance. The performances of ournewapproachofattributeselectionare evaluatedonseveralhighdimensionaldata sets.Sincethenumberofdimensionsused islow,itispossibletodisplaythedatasets toviewefficientlyandalsotointerpretthe resulteasily.

databasesisthenotaneasytasktoidentify valid,relevantandunderstandablepatterns in data. Data mining is one step of the Knowledge Discovery in Databases process [1]. Data mining [2] is the principleofsortingthroughlargeamounts of data and picking out relevant information.Clusteringisusefulinawide rangeofdataanalysisfields,includingdata mining, document retrieval, image segmentation, and pattern classification. This paper focuses on clustering high dimensionaldatasets,whichisoneofthe mostusefultasksindatamining.Thegoal ofclusteringisthattheobjectsinagroup should be related to one another and different fromtheobjects inothergroup. Kmeans is one of the simplest unsupervisedlearningalgorithmsthatsolve wellknown clustering problem [3]. However, the clustering of high dimensionaldatasetshasbeenproventobe verydifficult. Acommonapproachisto reduce the irrelevant redundancy behind the input data [4]. Principal Component Analysis (PCA)isoneofthemethods to fulfillthisdemand.

Keywords: attribute,kmeans,clustering, Basically document classification datamining canbedefinedascontentbasedassignment of one or more predefined categories or topicstodocumentsie.,collectionofwords 1.Introduction determine the best fit category for this Inthecurrentscenario,browsingfor collection of words. The goal of all exactinformationhasbecomeverytedious documentclassifiersistoassigndocuments jobasthenumberofelectronicdocuments intooneormorecontentcategoriessuchas ontheInternethasgrowngargantuanand technology,entertainment,sports,politics, still is growing. Knowledge discovery in etc., Classification of any type of text

documentispossible,includingtraditional documentssuchasmemosandreportsas wellasemails,webpages,etc., Insection2,wediscussdocument preprocessing, which includes document parsing,stemming,stopwordremovaland dimensionality reduction. Section 3 describestheproblemofexistingkmeans algorithm and the modified kmeans algorithm. Section 4 describes data clusteringandPCAbaseddataclustering. Section5describesthetestsandresults.

sincethesestopwordsareinsignificantfor searchkeywords.Stopwordscanbepre specifiedlistofwordsortheycandepend onthecontextofthecorpus. 2.2Stemming

Thenextprocessinphaseoneafter stopwordremovalisstemming.Stemming is process of linguistic normalization in which the variant forms of a word is reducedtoacommonform.Forexample: thewordconnectshavevariousformssuch as connect, connection, connective, SECTIONII connected,etc.,Stemmingprocessreduces DocumentPreprocessing all these forms ofwords toa normalized word connect. Porters English stemming Data preprocessing is a very algorithm is used to stem the words for important and essential phase in an each of the document in our stemming effectivedocumentclassification.Thefirst process. partoffeatureextractionispreprocessing the lexicon and involves removal of stop words,stemmingandtermweighting[5]. 2.3DocumentRepresentation 2.1StopWords This is the first step in preprocessingwhichwillgeneratealistof terms that describes the document satisfactorily. The document is parsed throughtofindoutthelistofallthewords. Thenextprocessinthisstepistoreduce thesizeofthelistcreatedbytheparsing process, generally using methods of stop words removal and stemming. The stop wordsremovalaccountsto20%to30%of total words counts while the stemming process;reducethenumberoftermsinthe document. Both the process helps in improvingtheeffectivenessandefficiency of text processing as they reduce the indexingfilesize. Stopwordsareremovedfromeachofthe documentbycomparingthewiththestop wordlist.Thisprocessreducesthenumber of words in the document significantly Documentsarerepresentedbyaset of keywords/ terms extracted from the document themselves. The collection or unionofallsetoftermsisthesetofterms that represents the entire collection and defines a space such that each distinct term represents one dimension in that space.Sinceeachdocumentisrepresented as a set of terms, this space is called documentspace[6]. Atermdocumentmatrixcanbeencodedas acollectionofndocumentsandmterms. Anentryinthematrixcorrespondstothe weightofaterminthedocument;zero meansthetermhasnosignificanceinthe documentoritsimplydoesntexistinthe document.Thewholedocumentcollection can therefore be seenas am x nfeature matrix A (with m as the number of

documents) where the element aij represents thefrequencyofoccurrenceof feature j in document i. This was of representingthedocument iscalled term frequency method. However the terms that have a large frequency are not necessarymoreimportantorhavehigher discriminationpower. Sowemight,want toweightthetermswithrespecttothelocal context,thedocumentorthecorpus. The mostpopulartermweightingistheInverse document frequency, where the term frequency is weighed with respect to the totalnumberoftimesthetermappearsin thecorpus. Thereisanextensionofthis designated the term frequency inverse document frequency (tfidf). The formulationoftfidfisgivenasfollows: Wij=tfi,j*log(N/dfi) Where wij istheweightofthetermIin documentj,tfi,j =numberofoccurrences of term I in document j, N is the total numberofdocumentsinthecorpus,dfi=is the number of documents containing the termi. Thedevelopmentandunderstandingofthe impactofterms andweightsontextdata mining methodologies is another area wherethestatisticianscancontribute.The encoding schemeis best explained inthe recentworkbyBerry[10]. Figurebelow showthetermdocumentfrequencyforthe titleofthebooks. D1 Data mining techniques: for marketing sales and customer relationshipmanagement D2 Principles of data mining : Adaptive computation & machine learning

D3 Data mining: practical machine learningtools&techniqueswithJava D4MasteringDataMining thearts and science of Customer Relationship Management D5MasteringDataModeling:Auser drivenapproach D6 Investigate Data mining for securityandCriminaldetection D7Scienceandcriminaldetection D8 Crime and Human nature: the definitivestudyofthecausesandcrime D9Statisticsofcrimeand criminals: ahandbookofprimarydata
Term Document Matrix
Crime Customer Data Detection Learning Machine Managem ent Mastering Mining Relations hip Science Techniqu e D 1 0 1 1 0 0 0 1 0 1 1 0 1 D 2 0 0 1 0 1 1 0 0 1 0 0 0 D 3 0 0 1 0 1 1 0 0 1 0 0 1 D 4 0 1 1 0 0 0 1 0 0 1 1 0 D 5 0 0 1 0 0 0 0 1 0 0 0 0 D 6 1 0 1 1 0 0 0 0 1 0 0 0 D 7 1 0 0 1 0 0 0 0 0 0 1 0 D 8 2 0 0 0 0 0 0 0 0 0 0 0 D9 2 0 1 0 0 0 0 0 0 0 0 0

Fig. Shows a small corpus of 8 book titles;eachtitleisadocument.Tosave space,weareusingonlyitalicizedwords inthedocumentlist.Theijth elementof the termdocument matrix shows the numberoftimestheithwordisrepeated inthejthdocument. 2.4DimensionalityReduction Thespaceinwhichthedocument, resideistypicallythousandsofdimensions or more. Given the collection of documents along with the associated distance matrix, we would like to find a convenient lowerdimensional space to

perform subsequent analysis. This will certainly facilitate clustering or classification. By dimensionality reduction,onecanremovenoisefromdata andbetterapplyourstatisticaldatamining methodstodiscoversubtlerelationshipthat Theeigenvaluesgiveanindicationofthe amount of information, which the mightexistbetweenthedocuments. respectiveprincipalcomponentsrepresent. Thefirstprincipalcomponentisoftenthe 2.4.1PrincipalComponentAnalysis leadingprincipalcomponentandthusisthe Oneofthedecisivecharacteristics ofany mostinformative. knowledge discovery practice is attribute selection. Attribute selection determines which attributes contribute something valuableforunderstandingofthedata,and should hence be reserved, and which attributescanbediscarded. Theprincipalcomponentanalysisis apopularmethod,whichuseseigenvectors from either covariance or correlation matrixtoreducethedimensionality.[7] PCA isusedfordimensionalityreduction in a dataset by retrieving those characteristics of the dataset that contributesmostofitsvariance;bykeeping lowerorder principal component and ignoringhigherorderones. Suchlower ordercomponentsoftencontainthemost important aspect of the data. The main objective ofPrincipal componentanalysis is to transform number of correlated variables into a number of uncorrelated variables called principal component [8]. Thereareseveralalgorithmsforcalculating principalcomponents. Ageneralformula istosolvetheeigenvalue problemofthe covariancematrixfortheinputdata.Then theresultingeigenvectorscanbestoredina descending order according to their correspondingeigenvalues. SECTIONIII

DocumentClustering
Kmeans is one of the simplest unsupervisedlearningalgorithmsthatsolve wellknown clustering problem. It identifies groups or clusters, of related documents as well as the relationships amongthem.Thisapproachiscommonly referred to as unsupervised since it eliminates the need for tagged training documents and also does not require a preexistingtaxonomyorcategorystructure. However, clustering algorithms are not alwaysgoodatselectingcategoriesthatare intuitivetohumanusers.Oneofthemost popularheuristicsforsolvingthekmeans problem is based on a simple iterative scheme for finding a locally minimal solution.[9]. The clustering stage consists of the following2modules: Similarity measurement module: The vector space model obtained from the preprocessing stage is processed based on the similarity measurement method specified by

v=v
WheresigmaisthecovariancematrixofX

the user (i.e. Euclidian or Cosine similarity method) to obtain the different degrees of similarity existent among the various documentsrepresentedinthevector spacemodel. Clusteringmodule:Thedocuments with the maximum degree of similarity (obtained from the previous module) are clustered using the clustering specification givenbytheuser(i.e.Kmeans)to obtainthefinalsetofclusters. Conclusion The Web has become the largest source knowledge repository. Extracting informationandbuildingknowledgefrom extracted information efficiently and effectively is becoming increasingly important for various reasons. As popularityofthewebcontinuestoincrease, there is a growing need to develop tools and techniques that will help improve its overallusefulness. References 1. E.Bingham and H. Mannila. Random projection in dimensionality reduction applicationstoimageandtextdata. InKnowledgeDiscoveryandData Mining,pages245250,2001 2. V.EstivillCastroandJ.Yang,Fast and Robust General Purpose Clustering Algorithms, Data

MiningandKnowledgeDiscovery, vol8,pages127150,2004. 3. Cadez,S.GaffneyandP.Symth,A generalprobabilisticframeworkfor clustering individuals in Proceedings of Sixth ACM SIGKDD International Confernce, pages 140149, Boston, United States,2000. 4. A.Jain, and R.Dubes, Algorithms for Clustering data, Prentice Hall Inc.,UpperSaddleRiver,NJ,1988. 5. Wei,C.P,Dong,Y.X.(2001):A MiningBased Category Evolution, Approach to Managing Online Document Categories, in: Proceedings of the 34th ,Annual HawaiiInternationalConferenceon SystemSciences. 6. J. L. MartnezFernndez , A. GarcaSerrano, P. Martnez1, J. Villena, Automatic Keyword Extraction for News Finder, Computer Science Department, Universidad Carlos III de Madrid, Avda. Universidad 30, 28911 Legans,Madrid,Spain 7. Jolliffe, I. (1986). Principal Component Analysis. Springer Verlag 8. Vesanto, 1999, SOMbased data visualiation methods. Intelligent DataAnalysis, v3, 111126, VesantoandAlhoniemi,2000. 9. A.Likas,N.VlassisandJ.J.Verbeek, The global kmeans clustering algorithm, Pattern Recognition, vol36,no2,pages451461,2003. 10. Berry, M. W. (2003). Survey of Text Mining: Clustering, Classification, and Retrieval (Hardcover).Springer

You might also like