You are on page 1of 15

DATA MINING

Course Information

o Course Description
o Course Goals
o Grading Policies
o Course Content
o Textbook / Materials
o Datasets sources
o Software

1
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Course description:

Data Mining studies algorithms and computational paradigms that allow computers to
find patterns and regularities in databases,
perform prediction and forecasting,
and generally improve their performance through interaction with data.

It is currently regarded as the key element of a more general process called


Knowledge Discovery that deals with
extracting useful knowledge from raw data.
The knowledge discovery process includes:
data selection,
cleaning,
coding,
using different statistical and machine learning techniques, and
visualization of the generated structures.

The course will cover all these issues and will illustrate the whole process by examples.1

Data Mining is an analytic process


designed to explore data
(usually large amounts of data - typically business or market related
also known as "big data")
in search of consistent patterns and/or systematic relationships between variables,
and then to validate the findings by applying the detected patterns to new subsets of data.
The ultimate goal of data mining is prediction and predictive data mining is the most common
type of data mining and one that has the most direct business applications.2

Data mining offers great promise in helping organizations uncover patterns hidden in their data
that can be used to predict the behavior of customers, products and processes. However, data
mining tools need to be guided by users who understand the business, the data, and the
general nature of the analytical methods involved. Realistic expectations can yield rewarding
results across a wide range of applications, from improving revenues to reducing costs. Building
models is only one step in knowledge discovery. Its vital to properly collect and prepare the
data, and to check the models against the real world. The best model is often found after
building models of several different types, or by trying different technologies or algorithms.

1
http://www.cs.ccsu.edu/~markov/ccsu_courses/580Syllabus.html
2
http://documents.software.dell.com/statistics/textbook/data-mining-techniques#eda

2
Choosing the right data mining products means finding a tool with good basic capabilities, an
interface that matches the skill level of the people wholl be using it, and features relevant to your
specific business problems. After youve narrowed down the list of potential solutions, get a hands-
on trial of the likeliest ones. .
..There are two keys to success in data mining. First is coming up with a precise formulation
of the problem you are trying to solve. ... The second key is using the right data. After choosing
from the data available to you, or perhaps buying external data, you may need to transform and
combine it in significant ways..
The goal of data mining is to produce new knowledge that the user can act upon. It does this
by building a model of the real world based on data collected from a variety of sources which may
include corporate transactions, customer histories and demographic information, process control
data, and relevant external databases such as credit bureau information or weather data. The result
of the model building is a description of patterns and relationships in the data that can be
confidently used for prediction.3

... analytic skills are in high demand in the nonprofit and governmental arenas. In fact,
analytics is even a mission-critical skill in military, intelligence and security operations. ... Here
are a few of the questions addressed by these methods:

to respond to an offer?

appeal to web radio listeners?

e contribution of an individual player to a


sports team?
4

The rapid proliferation of the Internet and related technologies has created an unprecedented
opportunity for enterprises to collect massive amounts of data regarding customers and all aspects
of their business operations. Yet the reality is that most organizations today are (i) data rich
but information and knowledge poor, and (ii) not harnessing the full potential of their data,
which is perhaps the second most important asset after human capital. Internet based
applications such as social media, website usage tracking and online reviews as well as more
traditional technology applications like RFID, Supply Chain Management (SCM), Enterprise
Resource Planning (ERP) and Customer Relationship Management (CRM) provide access to vast
amounts of data regarding customers, suppliers, competitors as well as a firms own activities and

http://www.webdelprofesor.ula.ve/economia/gcolmen/programa/economia/intro_to_data_mining_(3rd_edition).
pdf
4 advanced business analytics syllabus
3
business processes. Being able to unlock the insights and knowledge trapped in such raw data
constitutes a key lever for competitive advantage in hypercompetitive business
environments. This course is designed to showcase the virtually unlimited opportunities that
exist today to leapfrog the competition by leveraging the data that organizations routinely collect
every day, but which they hardly use strategically to make decisions at various points in the value
chain.5

5
https://www.mccombs.utexas.edu/~/media/Files/MSB/Sharepoint/Syllabi/IROM%20Syllabi/2011-
12/Spring2012/MIS%20382N.9%20Data%20Mining%20for%20Business%20Intelligence%20(Barua).pdf.pdf

4
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Course Goals / Aims / Outcomes

To introduce students to the basic concepts and techniques of Data Mining


Understanding of the value of data mining in solving real-world problems.
Understanding of foundational concepts underlying data mining
Understanding of algorithms commonly used in data mining tools
Ability to apply data mining tools to real-world problems
Evaluate models/algorithms with respect to their accuracy
Critique the results of a data mining exercise
To develop skills of using recent data mining software for solving practical problems
To gain experience of doing independent study and research

5
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Grading Policies

Grading will be based on

FIVE assignments (A1, A2, A3, A4, A5) MARKS 0-10 FOR EACH

Final project

ONE quiz (final exam)

SIX evaluations

Assignments, project and quiz will be graded on a 0-10 point scale.

STEP 1

1 + 2 + 3 + 4 + 5
+
1 = 5
2

IF five evaluations = ok THEN


MP
ELSE
MP=MP-1
END IF

STEP 2

IF N1<5 OR NE<5 THEN


N=4
ELSE
N=MAX( NA, NE )
END IF

It is expected that
all students will conduct themselves in an honest manner and
NEVER claim work which is not their own
Violating this policy will result in a substantial grade penalty or a final grade of 4.
Course Information:

6
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Week assignments (Romanian and/or English)

Each week assignment Ai (i=1,..,5) consists of three parts:

Part 1

Make a quiz with minim 9 questions and appropriate answers regarding the week topic (and topic
before, related with week topic). Questions will be organized into 3 groups: elementary, moderate
difficulty, high difficulty.

Part 2

Identification in the literature of at least three bibliographical sources (articles, PhD dissertations,
research reports, etc.) that uses the week topic to solve a real-world problem. Achievement of a
synthesis / an overview of these bibliographical sources (maximum 1 page for 3 bibliographical
sources)

Part 3

Use the week topic(technique described in the lecture) in case of the data set chosen for study (in
the fist week of this course). Submit the results electronically.

Each week assignment Ai (i=1,..,5) has the following deadlines:

week topic between Deadlines


assignment programarelogica@yahoo.ro
Dataset 27.02.2017 5.03.2017 5.03.2017
A1 Statistics 6.03.2017 - 12.03.2017 12.03.2017, h 2400
A2 Regression 12.03.2017 - 19.03.2017 19.03.2017, h 2400
A3 Decision tree 20.03.2017 26.03.2017 26.03.2017, h 2400
A4 Association rules 27.03.2017 2.04.2017 2.04.2017, h 2400
A5 Clustering 3.04.2017 9.04.2017 9.04.2017, h 2400

Each week assignment mark MAi (i=1,..,5) will be computed according to the formula:

1 + 2 + 3
=
3

7
Project (Romanian and/or English)

Students will complete a project on a topic of their choosing.


At a minimum, a project will involve:
1. Identifying questions to answer
2. Locating appropriate data. Processing data from a real domain (mining real data)
3. Using/applying various data mining techniques to create comprehensive and accurate
model of the data in order to address each question
4. Presenting results (article format)

Extraordinary contributions to the intellectual process / use of not presented methods, will also
be recognized in the final grade of the project.

Projects evaluation

For evaluating projects you will receive a survey in electronically format. Each student must
evaluate at least 6 projects, two of every category, and must make a hierarchy of projects as follows
(example):

low level projects projects that need more good projects


work
1 2 3 4 5 6 7 8 9 10
Minim Minim Minim Minim Minim Minim
1 1 1 1 1 1
project project project project project project

If a student does not perform the evaluation of 6 projects as required, it will drop a point in
the final grade of the project. (MP=MP-1)

8
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Grading Policies

"During Research Training 1, students study the basic principles of scientific research and
statistics. They learn how to formulate a research question, create a research design, collect and
process data, make posters, and present their research to their fellow students.

In the second year, Research Training 2 students fan out across VUmc, and groups of them
participate in the research programmes of various departments. That may involve laboratory
research, research with imaging instruments, or qualitative research by means of interviews. The
student teams write reports, using a standard format. Two weeks later, three selected teams of
students present their research at a symposium. It is a short and intense period of study, in which
they need to generate results quickly. The third year is devoted to scientifically focussed
coursework, and in-depth studies of methodology and statistics. The students write a paper and
prepare for their research internship. As soon as students carry out their own research, they
develop an increased appreciation for methodology and statistics."6

VU University Medical Center has been carrying out extramural epidemiological research into
diabetes since 1989. In that year, we launched a cohort study at Hoorn, into the prevalence, causes,
and impact of the development of type 2 diabetes. The study, which was carried out non-selectively
and by means of random sampling, involved nearly 2,500 individuals aged from 50 to 75. The
study is repeated every five years. Over time we have collected information on lipid metabolism,
coagulation, and cardiovascular function, as well as on lifestyle, diet, family history, and cognitive
function. This data is particularly valuable for our research into the interrelationships
between the various aspects involved. .......

In collaboration with international partners, we are analysing well over one hundred
international cohort studies, in which the data of millions of people is combined. This allows us
to obtain a very accurate picture of the true relationship between specific factors involved in
diabetes. That is a very powerful tool!7

6
https://www.vumc.com/branch/epidemiology-biostatistics/education/5346028/
7
https://www.vumc.com/branch/epidemiology-biostatistics/research/5346010/

9
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Course Content

1. Course Information
o Course Description
o Course Goals
o Grading Policies
o Course Content
o Textbook / Materials
o Datasets sources
2. Introduction to Data Mining
o What is data mining?
o Related technologies Statistics, Machine Learning, Artificial Intelligence
o Data Mining Goals
o Data Mining Techniques
3. Stages of the Data Mining Process
o Data preprocessing
A. Data cleaning
B. Data transformation
C. Data reduction
D. Basic data transformation
E. Handling missing values
o Model building and validation
o Deployment
4. Statistical measures
o Task relevant data / data exploration
o Visualization techniques
A. Univariate data visualization
B. Bivariate data visualization
C. Multivariate data visualization
D. Visualizing groups
5. Linear regression
o The prediction task
o Algorithms for Subset Selection in Linear Regression
o Dropping irrelevant variables
o Instance-based methods (nearest neighbor)
6. Logistic regression
o Computation of estimates
o Newton-Raphson algorithm
7. Classification and regression trees
o Basic learning/mining tasks
o Recursive partitioning
o Classification rules from trees

10
8. Classification and regression trees
o ID3, C4.5, C5 algorithm
o CHAID algorithm
o CART algorithm
9. Association rules
o Motivation and terminology
o Basic idea: item sets
o Generating item sets and rules efficiently
o Correlation analysis
10. Performance evaluation
o Basic issues
o Training and testing
o Estimating classifier accuracy (holdout, cross-validation, leave-one-out)
o Combining multiple models (bagging, boosting, stacking)
11. Clustering
o Basic issues in clustering
o First conceptual clustering system: Cluster/2
12. Clustering
o Partitioning methods: k-means, expectation maximization (EM)
o Hierarchical methods: distance-based agglomerative and divisible clustering
13. Project presentations
o Projects presentations
o Projects evaluation
14. Project presentations
o Projects presentations
o Projects evaluation

11
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Textbook / Materials

The following texts may be used for reference:

1. Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and
Techniques (Second Edition), Morgan Kaufmann, 2005, ISBN: 0-12-088407-0.

ftp://ftp.ingv.it/pub/manuela.sbarra/Data%20Mining%20Practical%20Machine%20Learning%20
Tools%20and%20Techniques%20-%20WEKA.pdf

2. Michael J.A. Berry, Gordon Linoff, Data mining techniques : for marketing, sales, and
customer relationship management / 2nd ed., New York: Wiley, 2004, ISBN 0-471-
47064-3

http://197.14.51.10:81/pmb/GESTION2/MARKETING/Data%20Mining%20Techniques%20For
%20Marketing%20Sales%20And%20Customer%20Relationship%20Management%202Ed.pdf

3. Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed).
Potomac, MD: Two Crows Corp.

http://www.webdelprofesor.ula.ve/economia/gcolmen/programa/economia/intro_to_data_mining
_(3rd_edition).pdf

4. StatSoft, Electronic Statistics Textbook,

http://www.statsoft.com/Textbook

5. Making Sense of Data II by Glenn Myatt & Wayne Johnson, John Wiley& Sons, 2009.

https://books.google.ro/books?id=lFBqIpM-
vuQC&pg=PA273&lpg=PA273&dq=Making+Sense+of+Data+II:+A+Practical+Guide+to+Data
+Visualization,+Advanced+Data+Mining+Methods,+and+Applications+pdf&source=bl&ots=V
NBWybValK&sig=fjzOW1fwzUl1J6rVR_Dpr6Sqw6c&hl=ro&sa=X&ved=0ahUKEwil-
4z2pLDSAhVMVhQKHV5QDUsQ6AEIYTAJ#v=onepage&q=Making%20Sense%20of%20Da
ta%20II%3A%20A%20Practical%20Guide%20to%20Data%20Visualization%2C%20Advanced
%20Data%20Mining%20Methods%2C%20and%20Applications%20pdf&f=false

6. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in


Microsoft Office Excel with XLMiner, by Galit Shmueli, Nitin R. Patel, Peter C. Bruce
Publisher: Wiley; 2 edition (October 26, 2010) ISBN-10: 0470526823 ISBN-13: 978-
0470526828 2.

12
7. Analyzing Social Media Networks with NodeXL: Insights from a Connected World, by
Derek Hansen, Ben Shneiderman and Marc A. Smith Publisher: Morgan Kaufmann; 1
edition (September 10, 2010) ISBN-10: 0123822297 ISBN-13: 978-0123822291

13
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Potential Datasets Sources

1. University of California, Irvine, Repository of Machine Learning Databases (UCI)


https://archive.ics.uci.edu/ml/datasets.html
2. http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
3. http://mldata.org/repository/data/ - a repository for your machine learning data
4. http://dnr.maryland.gov/streams/Pages/mbss.aspx
5. http://apps.who.int/bmi/index.jsp - OMS
6. http://gco.iarc.fr/databases.php - 4 BAZE DE DATE PT CANCER
7. http://eco.iarc.fr/ - European Cancer Observatory (ECO)
8. http://www-dep.iarc.fr/nordcan.htm - NORDCAN

EUROSTAT

9. http://ec.europa.eu/eurostat/web/health/health-status-determinants/data/database

ECHI (European Core Health Indicators) data tool

10. http://ec.europa.eu/health/major_chronic_diseases/indicators_en

INSSE

11. http://www.insse.ro/cms/files/eurostat/adse/microdata.htm
12. http://www.insse.ro/cms/ro/content/statistica-oficiala-din-romania

DATAGOV ???

14
Course Information:
Course Description, Course Goals, Grading Policies, Course Content, Textbook / Materials,
Datasets sources, Software

Software

RapidMiner

Weka

Python

SPSS

SPSS Modeler

XLMiner (an Excel add-in)

Statistica

SAS Enterprise Miner

Etc.

15

You might also like