You are on page 1of 8

10 Challenging Problems in Data Mining Research prepared for ICDM 2005 1.

Developing a Unifying Theory of Data Mining The current state of the art of data-mining research is too ``ad-hoc techniques are designed for individual problems no unifying theory Needs unifying research Exploration vs explanation Long standing theoretical issues How to avoid spurious correlations? Deep research Knowledge discovery on hidden causes? Similar to discovery of Newtons Law? An Example (from Tutorial Slides by Andrew Moore ): VC dimension. If you ve got a learning algorithm in one hand and a dataset in the other hand, to what extent can you decide whether the learning algorithm is in danger of overfitting or underfitting? formal analysis into the fascinating question of how overfitting can happen, estimating how well an algorithm will perform on future data that is solely based on its training set error, a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to cross-validation, called Structural Risk Minimization (SRM), for choosing classifiers. CV,SRM, AIC and BIC. 2. Scaling Up for High Dimensional Data and High Speed Streams Scaling up is needed ultra-high dimensional classification problems (millions or billions of features, e.g., bio data) Ultra-high speed data streams Streams continuous, online process e.g. how to monitor network packets for intruders? concept drift and environment drift? RFID network and sensor network data 3. Sequential and Time Series Data How to efficiently and accurately cluster, classify and predict the trends ? Time series data used for predictions are contaminated by noise How to do accurate short-term and long-term predictions? Signal processing techniques introduce lags in the filtered data, which reduces accuracy Key in source selection, domain knowledge in rules, and optimization methods 4. Mining Complex Knowledge from Complex Data Mining graphs Data that are not i.i.d. (independent and identically distributed) many objects are not independent of each other, and are not of a single type. mine the rich structure of relations among objects, E.g.: interlinked Web pages, social networks, metabolic networks in the cell Integration of data mining and knowledge inference The biggest gap: unable to relate the results of mining to the realworld decisions they affect - all they can do is hand the results back to the user. More research on interestingness of knowledge 5. Data Mining in a Network Setting Community and Social Networks

Linked data between emails, Web pages, blogs, citations, sequences and people Static and dynamic structural behavior Mining in and for Computer Networks detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attacks Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet 6. Distributed Data Mining and Mining Multi-agent Data Need to correlate the data seen at the various probes (such as in a sensor network) Adversary data mining: deliberately manipulate the data to sabotage them (e.g., make them produce false negatives) Citation (Paper 2) Author (Paper1) Title Conference Name Game theory may be needed for help Games Player1=miner 7. Data Mining for Biological and Environmental Problems New problems raise new questions Large scale problems especially so Biological data mining, such as HIV vaccine design DNA, chemical properties, 3D structures, and functional properties need to be fused Environmental data mining Mining for solving the energy crisis 8. Data-mining-Process Related Problems How to automate mining process? the composition of data mining operations Data cleaning, with logging capabilities Action: H H H T T T (1,-1)(1,-1) Outcome Visualization and mining automation Need a methodology: help users avoid many data mining mistakes What is a canonical set of data mining operations? 9. Security, Privacy and Data Integrity How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? Knowledge integrity assessment Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and security Development of measures to evaluate the knowledge integrity of a collection of Data Knowledge and patterns 10. Dealing with Non-static, Unbalanced and Cost-sensitive Data The UCI datasets are small and not highly unbalanced Real world data are large (10^5 features) but only < 1% of the useful classes (+ve) There is much information on costs and benefits, but no overall model of profit and loss Data may evolve with a bias introduced by sampling

Sampling Feature Sel Mining 1.5 Data mining problems/issues Data mining systems rely on databases to supply the raw data for input and this raises problems in that databases tend be dynamic, incomplete, noisy, and large. Other problems arise as a result of the adequacy and relevance of the information stored. 1.5.1 Limited Information A database is often designed for purposes different from data mining and sometim es the properties or attributes that would simplify the learning task are not present n or can they be requested from the real world. Inconclusive data causes problems because if s ome attributes essential to knowledge about the application domain are not present i n the data it may be impossible to discover significant knowledge about a given domain. For example cannot diagnose malaria from a patient database if that database does no t contain the patients red blood cell count. 1.5.2 Noise and missing values Databases are usually contaminated by errors so it cannot be assumed that the da ta they contain is entirely correct. Attributes which rely on subjective or measurement judgements can give rise to errors such that some examples may even be mis-class ified. Error in either the values of attributes or class information are known as noise . Obviously where possible it is desirable to eliminate noise from the classification inform ation as this affects the overall accuracy of the generated rules. Missing data can be treated by discovery systems in a number of ways such as; simply disregard missing values omit the corresponding records infer missing values from known values treat missing data as a special value to be included additionally in the attribut e domain or average over the missing values using Bayesian techniques. Noisy data in the sense of being imprecise is characteristic of all data collect ion and typically fit a regular statistical distribution such as Gaussian while wrong va lues are data entry errors. Statistical methods can treat problems of noisy data, and separate different types of noise. 1.5.3 Uncertainty Uncertainty refers to the severity of the error and the degree of noise in the d ata. Data precision is an important consideration in a discovery system. 1.5.4 Size, updates, and irrelevant fields Databases tend to be large and dynamic in that their contents are ever-changing

Another issue is the relevance or irrelevance of the fields in the database to t he current focus of discovery for example post codes are fundamental to any studies trying to establish a geographical connection to an item of interest such as the sales of a product. Workshop Description Motivation Early work in predictive data mining did not address the complex circumstances i n which models are built and applied. It was assumed that a fixed amount of training data were available and only simple objectives, namely predictive accuracy, were considered. Over time, it became clear that these assu mptions were unrealistic and that the economic utility of acquiring training data, building a model, and applying the model had to be considered. The machine learning and data mining communities responded with research on acti ve learning, which focused on methods for cost-effective acquisition of information for the training data, and research on cost-sensitive learning, which considered the costs and benefits associated with using the learned knowle dge and how these costs and benefits should be factored into the data mining process. All the different stages of the data mining process are affected by economic uti lity. In the data acquisition phase we have to consider the costs of obtaining training data, such as the cost of la belling additional examples or acquiring new feature values. In applying the data mining algorithm, we have to consider the running time of the algorithm and the costs and benefits associated with cleaning the data, transfor ming the data and constructing new features. Economic utility also impacts the assessment of the decisions made based on the learned knowledge. Simple assessment measures like predictive accuracy have given way to more complex economic measures, including measures of profitability. These considerations can in turn impact policies for model induction. The latter topic has received more attention in the context of cost-sensitive le arning. Goals Almost all work that considers the impact of economic utility on data mining foc uses exclusively on one of the stages in the data mining process. Thus, economic factors have been studied in i solation, without much attention to how they interact. This workshop will begin to remedy this deficiency by brin ging together researchers who currently consider different economic aspects in data mining, and by promoting a

as information is added, modified or removed. The problem with this from the data ining perspective is how to ensure that the rules are up-to-date and consistent with he most current information. Also the learning system has to be time-sensitive as some ata values vary over time and the discovery system is affected by the `timeliness f the data.

m t d o

n examination of the impact of economic utility throughout the entire data mining process. This workshop will a ttempt to encourage the field to go beyond what has been accomplished individually in the areas of active learnin g and cost-sensitive learning (although both of these areas are within the scope of this workshop). In additio n, existing research which has addressed the role of economic utility in data mining has focused on predictive data mining tasks. This workshop will begin to explore methods for incorporating economic utility considerations into both predictive and descriptive data mining tasks. This workshop will be geared toward researchers with an interest in how economic factors affect data mining (e.g., researchers in cost-sensitive learning and evaluation and active learning ) and practitioners who have realworld experience with how these factors influence data mining. Attendance is not limited to the paper authors and we strongly encourage interested researchers from related areas to attend the wo rkshop. This will be a full-day workshop and will include invited talks, paper presentations, short position sta tements and two panel discussions. Workshop Topics Types of economic factors in data mining o What economic factors arise in the context of data mining and to what stage of t he data mining process do they apply? o What assessment metrics are used in response to these economic factors? o Can the use of economic utility help address previously studied problems in data mining, such as the problems of learning rare classes and learning from skewed distributions? Algorithms o Utility-based approaches for information acquisition, data preprocessing, mining and knowledge application. This includes work in active learning/sampling and cost-sensitive l earning. o This workshop will also address how predictive and descriptive data mining tasks such as predictive modeling, clustering and link analysis can be adapted to incorporate economic utility. Consideration of economic utility throughout the data mining process o Work towards a comprehensive framework for incorporating economic utility to ben efit the entire data mining process. This work includes utility-based data mining techniques whi ch take into account the dependencies between different phases of the data mining process to maximize the utility of more than a single phase. For example, methods for acquiring training data which take into account the costs of errors in addition to the cost of training data; or me thods for the

extraction of predictive patterns which take into account the cost of test featu res necessary at prediction time. Applications o What existing data mining applications have taken economic utility into account? o What methods do these applications use to take economic utility into considerati on? o How does economic utility and the methods for dealing with it vary according to the specific problem addressed (e.g., by industry)? 10 Challenging Problems in Data Mining Research Download this Document for FreePrintMobileCollectionsReport Document Report this document? Please tell us reason(s) for reporting this document Spam or junk Porn adult content Hateful or offensive If you are the copyright owner of this document and want to report it, please fo llow these directions to submit a copyright infringement notice. Report Cancel This is a private document. Info and Rating Reads: 1,427 Uploaded: 03/12/2010 Category: School Work>Study Guides, Notes, & Quizzes Rated: Copyright: Attribution Non-commercial Attribution_noncommercial Data Mining Data Mining (fewer) Follow abhi_cool25 Share & Embed Related Documents PreviousNext 1. p. p. p. 2. p. p. p. 3.

p. p. p. 4. p. p. p. 5. p. p. p. 6. p. p. p. 7. p. p. More from this user PreviousNext 1. 19 p. 5 p. 8 p. 2. 8 p. Add a Comment Submit Characters: 400 Print this document High Quality Open the downloaded document, and select print from the file menu (PDF reader re quired). Download and Print You Must be Logged in to Download a Document Use your Facebook login and see what your friends are reading and sharing. Other login options Login with FacebookSpinner_mac_white Signup I don t have a Facebook account email address (required) create username (required) password (required) Send me the Scribd Newsletter, and occasional account related communicat ions. Sign Up Privacy policy Spinner_mac_white You will receive email notifications regarding your account activity. You can ma nage these notifications in your account settings. We promise to respect your pr ivacy. Why Sign up? Discover and connect with people of similar interests. Publish your documents quickly and easily. Share your reading interests on Scribd and social sites. Social-icons

Already have a Scribd account? email address or username password Log In Spinner_mac_white Trouble logging in? Login Successful Now bringing you back... Spinner_large_mac_white Back to Login Reset your password Please enter your email address below to reset your password. We will send you a n email with instructions on how to continue. Email address: You need to provide a login for this account as well. Login: Submit Upload a Document Search Documents * * * * * * * * * * * * * * * * * Follow Us! scribd.com/scribd twitter.com/scribd facebook.com/scribd About Press Blog Partners Scribd 101 Web Stuff Support FAQ Developers / API Jobs Terms Copyright Privacy

Copyright 2011 Scribd Inc. Language: English Choose the language in which you want to experience Scribd: * English * Espaol * Portugus (Brasil) scribd. scribd. scribd. scribd. scribd. scribd. scribd. scribd. scribd. scribd. scribd.

You might also like