Curs 1 Cours Informations

DATA MINING
Course Information
o Course Description
o Course Goals
o Grading Policies
o Course Content
o Textbook / Materials
o Datasets sources
o Software
1
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software
Course description:
Data Mining studies algorithms and computational paradigms that allow computers to
find patterns and regularities in databases,
perform prediction and forecasting,
and generally improve their performance through interaction with data.
It is currently regarded as the key element of a more general process called

Knowledge Discovery that deals with
extracting useful knowledge from raw data.
The knowledge discovery process includes:
data selection,
cleaning,
coding,
using different statistical and machine learning techniques, and
visualization of the generated structures.
The course will cover all these issues and will illustrate the whole process by examples.1
Data Mining is an analytic process

designed to explore data
(usually large amounts of data - typically business or market related
also known as "big data")
in search of consistent patterns and/or systematic relationships between variables,
and then to validate the findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction and predictive data mining is the most common
type of data mining and one that has the most direct business applications.2
Data mining offers great promise in helping organizations uncover patterns hidden in their data
that can be used to predict the behavior of customers, products and processes. However, data
mining tools need to be guided by users who understand the business, the data, and the
general nature of the analytical methods involved. Realistic expectations can yield rewarding
results across a wide range of applications, from improving revenues to reducing costs. Building
models is only one step in knowledge discovery. Its vital to properly collect and prepare the
data, and to check the models against the real world. The best model is often found after
building models of several different types, or by trying different technologies or algorithms.
1
http://www.cs.ccsu.edu/~markov/ccsu_courses/580Syllabus.html
2
http://documents.software.dell.com/statistics/textbook/data-mining-techniques#eda
2
Choosing the right data mining products means finding a tool with good basic capabilities, an
interface that matches the skill level of the people wholl be using it, and features relevant to your
specific business problems. After youve narrowed down the list of potential solutions, get a hands-
on trial of the likeliest ones. .
..There are two keys to success in data mining. First is coming up with a precise formulation
of the problem you are trying to solve. ... The second key is using the right data. After choosing
from the data available to you, or perhaps buying external data, you may need to transform and
combine it in significant ways..
The goal of data mining is to produce new knowledge that the user can act upon. It does this
by building a model of the real world based on data collected from a variety of sources which may
include corporate transactions, customer histories and demographic information, process control
data, and relevant external databases such as credit bureau information or weather data. The result
of the model building is a description of patterns and relationships in the data that can be
confidently used for prediction.3
... analytic skills are in high demand in the nonprofit and governmental arenas. In fact,
analytics is even a mission-critical skill in military, intelligence and security operations. ... Here
are a few of the questions addressed by these methods:
to respond to an offer?
appeal to web radio listeners?
e contribution of an individual player to a

sports team?
4
The rapid proliferation of the Internet and related technologies has created an unprecedented
opportunity for enterprises to collect massive amounts of data regarding customers and all aspects
of their business operations. Yet the reality is that most organizations today are (i) data rich
but information and knowledge poor, and (ii) not harnessing the full potential of their data,
which is perhaps the second most important asset after human capital. Internet based
applications such as social media, website usage tracking and online reviews as well as more
traditional technology applications like RFID, Supply Chain Management (SCM), Enterprise
Resource Planning (ERP) and Customer Relationship Management (CRM) provide access to vast
amounts of data regarding customers, suppliers, competitors as well as a firms own activities and
http://www.webdelprofesor.ula.ve/economia/gcolmen/programa/economia/intro_to_data_mining_(3rd_edition).
pdf
4 advanced business analytics syllabus
3
business processes. Being able to unlock the insights and knowledge trapped in such raw data
constitutes a key lever for competitive advantage in hypercompetitive business
environments. This course is designed to showcase the virtually unlimited opportunities that
exist today to leapfrog the competition by leveraging the data that organizations routinely collect
every day, but which they hardly use strategically to make decisions at various points in the value
chain.5
5
https://www.mccombs.utexas.edu/~/media/Files/MSB/Sharepoint/Syllabi/IROM%20Syllabi/2011-
12/Spring2012/MIS%20382N.9%20Data%20Mining%20for%20Business%20Intelligence%20(Barua).pdf.pdf
4
Course Information:
Course Goals / Aims / Outcomes
To introduce students to the basic concepts and techniques of Data Mining

Understanding of the value of data mining in solving real-world problems.
Understanding of foundational concepts underlying data mining
Understanding of algorithms commonly used in data mining tools
Ability to apply data mining tools to real-world problems
Evaluate models/algorithms with respect to their accuracy
Critique the results of a data mining exercise
To develop skills of using recent data mining software for solving practical problems
To gain experience of doing independent study and research
5
Course Information:
Grading Policies
Grading will be based on
FIVE assignments (A1, A2, A3, A4, A5) MARKS 0-10 FOR EACH
Final project
ONE quiz (final exam)
SIX evaluations
Assignments, project and quiz will be graded on a 0-10 point scale.
STEP 1
1 + 2 + 3 + 4 + 5
+
1 = 5
2
IF five evaluations = ok THEN

MP
ELSE
MP=MP-1
END IF
STEP 2
IF N1<5 OR NE<5 THEN

N=4
ELSE
N=MAX( NA, NE )
END IF
It is expected that
all students will conduct themselves in an honest manner and
NEVER claim work which is not their own
Violating this policy will result in a substantial grade penalty or a final grade of 4.
Course Information:
6
Week assignments (Romanian and/or English)
Each week assignment Ai (i=1,..,5) consists of three parts:
Part 1
Make a quiz with minim 9 questions and appropriate answers regarding the week topic (and topic
before, related with week topic). Questions will be organized into 3 groups: elementary, moderate
difficulty, high difficulty.
Part 2
Identification in the literature of at least three bibliographical sources (articles, PhD dissertations,
research reports, etc.) that uses the week topic to solve a real-world problem. Achievement of a
synthesis / an overview of these bibliographical sources (maximum 1 page for 3 bibliographical
sources)
Part 3
Use the week topic(technique described in the lecture) in case of the data set chosen for study (in
the fist week of this course). Submit the results electronically.
Each week assignment Ai (i=1,..,5) has the following deadlines:
week topic between Deadlines

assignment programarelogica@yahoo.ro
Dataset 27.02.2017 5.03.2017 5.03.2017
A1 Statistics 6.03.2017 - 12.03.2017 12.03.2017, h 2400
A2 Regression 12.03.2017 - 19.03.2017 19.03.2017, h 2400
A3 Decision tree 20.03.2017 26.03.2017 26.03.2017, h 2400
A4 Association rules 27.03.2017 2.04.2017 2.04.2017, h 2400
A5 Clustering 3.04.2017 9.04.2017 9.04.2017, h 2400
Each week assignment mark MAi (i=1,..,5) will be computed according to the formula:
1 + 2 + 3
=
3
7
Project (Romanian and/or English)
Students will complete a project on a topic of their choosing.

At a minimum, a project will involve:
1. Identifying questions to answer
2. Locating appropriate data. Processing data from a real domain (mining real data)
3. Using/applying various data mining techniques to create comprehensive and accurate
model of the data in order to address each question
4. Presenting results (article format)
Extraordinary contributions to the intellectual process / use of not presented methods, will also
be recognized in the final grade of the project.
Projects evaluation
For evaluating projects you will receive a survey in electronically format. Each student must
evaluate at least 6 projects, two of every category, and must make a hierarchy of projects as follows
(example):
low level projects projects that need more good projects

work
1 2 3 4 5 6 7 8 9 10
Minim Minim Minim Minim Minim Minim
1 1 1 1 1 1
project project project project project project
If a student does not perform the evaluation of 6 projects as required, it will drop a point in
the final grade of the project. (MP=MP-1)
8
Course Information:
Grading Policies
"During Research Training 1, students study the basic principles of scientific research and
statistics. They learn how to formulate a research question, create a research design, collect and
process data, make posters, and present their research to their fellow students.
In the second year, Research Training 2 students fan out across VUmc, and groups of them
participate in the research programmes of various departments. That may involve laboratory
research, research with imaging instruments, or qualitative research by means of interviews. The
student teams write reports, using a standard format. Two weeks later, three selected teams of
students present their research at a symposium. It is a short and intense period of study, in which
they need to generate results quickly. The third year is devoted to scientifically focussed
coursework, and in-depth studies of methodology and statistics. The students write a paper and
prepare for their research internship. As soon as students carry out their own research, they
develop an increased appreciation for methodology and statistics."6
VU University Medical Center has been carrying out extramural epidemiological research into
diabetes since 1989. In that year, we launched a cohort study at Hoorn, into the prevalence, causes,
and impact of the development of type 2 diabetes. The study, which was carried out non-selectively
and by means of random sampling, involved nearly 2,500 individuals aged from 50 to 75. The
study is repeated every five years. Over time we have collected information on lipid metabolism,
coagulation, and cardiovascular function, as well as on lifestyle, diet, family history, and cognitive
function. This data is particularly valuable for our research into the interrelationships
between the various aspects involved. .......
In collaboration with international partners, we are analysing well over one hundred
international cohort studies, in which the data of millions of people is combined. This allows us
to obtain a very accurate picture of the true relationship between specific factors involved in
diabetes. That is a very powerful tool!7
6
https://www.vumc.com/branch/epidemiology-biostatistics/education/5346028/
7
https://www.vumc.com/branch/epidemiology-biostatistics/research/5346010/
9
Course Information:
Course Content
1. Course Information
o Course Description
o Course Goals
o Grading Policies
o Course Content
o Textbook / Materials
o Datasets sources
2. Introduction to Data Mining
o What is data mining?
o Related technologies Statistics, Machine Learning, Artificial Intelligence
o Data Mining Goals
o Data Mining Techniques
3. Stages of the Data Mining Process
o Data preprocessing
A. Data cleaning
B. Data transformation
C. Data reduction
D. Basic data transformation
E. Handling missing values
o Model building and validation
o Deployment
4. Statistical measures
o Task relevant data / data exploration
o Visualization techniques
A. Univariate data visualization
B. Bivariate data visualization
C. Multivariate data visualization
D. Visualizing groups
5. Linear regression
o The prediction task
o Algorithms for Subset Selection in Linear Regression
o Dropping irrelevant variables
o Instance-based methods (nearest neighbor)
6. Logistic regression
o Computation of estimates
o Newton-Raphson algorithm
7. Classification and regression trees
o Basic learning/mining tasks
o Recursive partitioning
o Classification rules from trees
10
8. Classification and regression trees
o ID3, C4.5, C5 algorithm
o CHAID algorithm
o CART algorithm
9. Association rules
o Motivation and terminology
o Basic idea: item sets
o Generating item sets and rules efficiently
o Correlation analysis
10. Performance evaluation
o Basic issues
o Training and testing
o Estimating classifier accuracy (holdout, cross-validation, leave-one-out)
o Combining multiple models (bagging, boosting, stacking)
11. Clustering
o Basic issues in clustering
o First conceptual clustering system: Cluster/2
12. Clustering
o Partitioning methods: k-means, expectation maximization (EM)
o Hierarchical methods: distance-based agglomerative and divisible clustering
13. Project presentations
o Projects presentations
o Projects evaluation
14. Project presentations
o Projects presentations
o Projects evaluation
11
Course Information:
Textbook / Materials
The following texts may be used for reference:
1. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and
Techniques (Second Edition), Morgan Kaufmann, 2005, ISBN: 0-12-088407-0.
ftp://ftp.ingv.it/pub/manuela.sbarra/Data%20Mining%20Practical%20Machine%20Learning%20
Tools%20and%20Techniques%20-%20WEKA.pdf
2. Michael J.A. Berry, Gordon Linoff, Data mining techniques : for marketing, sales, and
customer relationship management / 2nd ed., New York: Wiley, 2004, ISBN 0-471-
47064-3
http://197.14.51.10:81/pmb/GESTION2/MARKETING/Data%20Mining%20Techniques%20For
%20Marketing%20Sales%20And%20Customer%20Relationship%20Management%202Ed.pdf
3. Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed).
Potomac, MD: Two Crows Corp.
http://www.webdelprofesor.ula.ve/economia/gcolmen/programa/economia/intro_to_data_mining
_(3rd_edition).pdf
4. StatSoft, Electronic Statistics Textbook,
http://www.statsoft.com/Textbook
5. Making Sense of Data II by Glenn Myatt & Wayne Johnson, John Wiley& Sons, 2009.
https://books.google.ro/books?id=lFBqIpM-
vuQC&pg=PA273&lpg=PA273&dq=Making+Sense+of+Data+II:+A+Practical+Guide+to+Data
+Visualization,+Advanced+Data+Mining+Methods,+and+Applications+pdf&source=bl&ots=V
NBWybValK&sig=fjzOW1fwzUl1J6rVR_Dpr6Sqw6c&hl=ro&sa=X&ved=0ahUKEwil-
4z2pLDSAhVMVhQKHV5QDUsQ6AEIYTAJ#v=onepage&q=Making%20Sense%20of%20Da
ta%20II%3A%20A%20Practical%20Guide%20to%20Data%20Visualization%2C%20Advanced
%20Data%20Mining%20Methods%2C%20and%20Applications%20pdf&f=false
6. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in

Microsoft Office Excel with XLMiner, by Galit Shmueli, Nitin R. Patel, Peter C. Bruce
Publisher: Wiley; 2 edition (October 26, 2010) ISBN-10: 0470526823 ISBN-13: 978-
0470526828 2.
12
7. Analyzing Social Media Networks with NodeXL: Insights from a Connected World, by
Derek Hansen, Ben Shneiderman and Marc A. Smith Publisher: Morgan Kaufmann; 1
edition (September 10, 2010) ISBN-10: 0123822297 ISBN-13: 978-0123822291
13
Course Information:
Potential Datasets Sources
1. University of California, Irvine, Repository of Machine Learning Databases (UCI)

https://archive.ics.uci.edu/ml/datasets.html
2. http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
3. http://mldata.org/repository/data/ - a repository for your machine learning data
4. http://dnr.maryland.gov/streams/Pages/mbss.aspx
5. http://apps.who.int/bmi/index.jsp - OMS
6. http://gco.iarc.fr/databases.php - 4 BAZE DE DATE PT CANCER
7. http://eco.iarc.fr/ - European Cancer Observatory (ECO)
8. http://www-dep.iarc.fr/nordcan.htm - NORDCAN
EUROSTAT
9. http://ec.europa.eu/eurostat/web/health/health-status-determinants/data/database
ECHI (European Core Health Indicators) data tool
10. http://ec.europa.eu/health/major_chronic_diseases/indicators_en
INSSE
11. http://www.insse.ro/cms/files/eurostat/adse/microdata.htm
12. http://www.insse.ro/cms/ro/content/statistica-oficiala-din-romania
DATAGOV ???
14
Course Information:
Software
RapidMiner
Weka
Python
SPSS
SPSS Modeler
XLMiner (an Excel add-in)
Statistica
SAS Enterprise Miner
Etc.
15

Curs 1 Cours Informations

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Curs 1 Cours Informations

Uploaded by

Copyright:

Available Formats

DATA MINING

It is currently regarded as the key element of a more general process called

Data Mining is an analytic process

appeal to web radio listeners?

e contribution of an individual player to a

Course Goals / Aims / Outcomes

To introduce students to the basic concepts and techniques of Data Mining

Grading will be based on

ONE quiz (final exam)

Assignments, project and quiz will be graded on a 0-10 point scale.

IF five evaluations = ok THEN

IF N1<5 OR NE<5 THEN

Week assignments (Romanian and/or English)

Each week assignment Ai (i=1,..,5) consists of three parts:

Each week assignment Ai (i=1,..,5) has the following deadlines:

week topic between Deadlines

Students will complete a project on a topic of their choosing.

low level projects projects that need more good projects

The following texts may be used for reference:

4. StatSoft, Electronic Statistics Textbook,

6. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in

Potential Datasets Sources

1. University of California, Irvine, Repository of Machine Learning Databases (UCI)

ECHI (European Core Health Indicators) data tool

XLMiner (an Excel add-in)

SAS Enterprise Miner

You might also like