Professional Documents
Culture Documents
pg. 1
St. Francis Institute of Management and
Research.
Mount Poinsur, S.V.P Road, Borivali (West),Mumbai-400103.`
Title:
SUBMITTED BY
pg. 2
St. Francis Institute of
Management and Research
Certificate Of Merit
Mr.Patel Pramod
Rameshchandra.
Roll No: 38 MMS-II
pg. 3
[Internal in- charge]
[ External in- charge]
[College Stamp]
[Director]
pg. 4
Acknowledgment
I would like to express my sincere gratitude
toward the MBA department of the
ST. Francis Institute of Management and
Research
for encouraging me in the development of this
project. I would like to thank, our
Director Dr. Thomas Mathew, my internal
project guide Prof.Manoj Mathew and our
faculty coordinator Prof.Vaishali Kulkarni for all
their help and co-operation.
Above this I would not like to miss this
precious opportunity to thank, prof. Thomas
Mathew, prof.Sinimole, M.F Kumbar , Sherli
Biju ,
Mohini Ozarkar & Steve Halge our librarian, my
friends, Mr.Subandu K. Maity, Mr. Durgesh
Tanna,
Miss Hiral Shah, Mr.Narinder Singh Kabo,
Miss Radhika S. Appaswamy, Miss Payal P.
Patel ,
Miss Bhagyalaxmi Subramaniam, Mrs. Soma L.
Joshua and my parents for helping, guiding
and supporting us in all problems.
pg. 5
pg. 6
Table of Contents
Type chapter title (level 1).............................................................1
Type chapter title (level 2)............................................................2
Type chapter title (level 3).........................................................3
Type chapter title (level 1).............................................................4
Type chapter title (level 2)............................................................5
Type chapter title (level 3).........................................................6
pg. 7
Executive Summary
pg. 8
Purpose of the study
This work will help provide additional information for the database
administrator who is engaged in the improvement the way of data
mining from data warehouse and also the effective to handle data
mining. This research work is not done in the intention to replace
or duplicate the work that is being done by me but rather its
outcome that can help to complement the Business Analyst
pg. 9
Objective of the project
pg. 10
Limitation of the project
This project does not focuses on the whole Database design it only
focuses on three tables that is Customers Table, Suppliers Table,
Employee Table but in real scenario there is not only three tables it
has many numbers of tables in the database
pg. 11
Need for study
Data mining roots are traced back along three family lines. The
longest of these three lines is classical statistics. Without statistics,
there would be no data mining, as statistics are the foundation of
most technologies on which data mining is built. Classical statistics
embrace concepts such as regression analysis, standard distribution,
standard deviation, standard variance, discriminated analysis,
cluster analysis, and confidence intervals, all of which are used to
study data and data relationships. These are the very building
blocks with which more advanced statistical analyses are
underpinned. Certainly, within the heart of today's data mining
tools and techniques, classical statistical analysis plays a significant
role.
pg. 12
The third family line of data mining is machine learning, which is
more accurately described as the union of statistics and AI. While
AI was not a commercial success, its techniques were largely co-
opted by machine learning. Machine learning, able to take
advantage of the ever-improving price/performance ratios offered
by computers of the 80s and 90s, found more applications because
the entry price was lower than AI. Machine learning could be
considered an evolution of AI, because it blends AI heuristics with
advanced statistical analysis. Machine learning attempts to let
computer programs learn about the data business analyst study,
such that programs make different decisions based on the qualities
of the studied data, using statistics for fundamental concepts, and
adding more advanced AI heuristics and algorithms to achieve its
goals.
pg. 13
Methodology
pg. 14
Analysis
This is the first and base stage of the project. At this stage,
requirement elicitation is conducted. Potential problem areas of
designing the database are identified.
Technological, social, and educational elements are identified and
examined. Alternatives are explored.
pg. 15
Introduction to Database
The Database
1. Database
2. File
3. Record
4. Data element
5. Character (byte)
6.Bit
pg. 16
The Data Element
pg. 17
Files
“A file is a set of records where the records have the same data
elements in the same format.”
pg. 18
Database Schemas
“The primary key data element in a file is the data element used to
uniquely describe and locate a desired record. The key can be a
combination of more than one data element.”
pg. 19
An Interfile Relationship
• One to one
• Many to one
• One to many
• Many to many
pg. 20
The Data Models
The format of the database and the format of the tables must be in a
format that the computer can translate into the actual physical
storage characteristics for the data. The Data Definition Language
(DDL) is used for such a specification.
pg. 21
The Query Language
pg. 22
Introduction to Data Warehouse and Data
Mining
pg. 23
The Data Mining
“Looking for the hidden patterns and trends in data that is not
immediately apparent from summarizing the data”
pg. 24
We are in an age often referred to as the “information age”. In this
information age, because we believe that information leads to”
power and success”, and thanks to sophisticated technologies such
as computers, satellites, etc., Organizations have been collecting
tremendous amounts of information. Initially, with the advent of
computers and means for mass digital storage, Organizations
started collecting and storing all sorts of data, counting on the
power of computers to help sort through this amalgam of
information. Unfortunately, these massive collections of data stored
on disparate structures very rapidly became overwhelming. This
initial chaos has led to the creation of structured databases and
Database Management Systems (DBMS).
pg. 25
What kind of Information Data Mining is collecting?
pg. 26
• Medical and Personal Data: From government census to
personnel and customer files, very large collections of
information are continuously gathered about individuals and
groups. Governments, companies and organizations such as
hospitals, are stockpiling very important quantities of
personal data to help them manage human resources, better
understand a market, or simply assist clientele. Regardless of
the privacy issues this type of data often reveals, this
information is collected, used and even shared. When
correlated with other data this information can shed light on
customer behaviors and the like.
pg. 27
• Digital Media: The proliferation of cheap scanners, desktop
video cameras and digital cameras is one of the causes of the
explosion in digital media repositories. In addition, many
radio stations, television channels and film studios are
digitizing their audio and video collections to improve the
management of their multimedia assets.
pg. 28
• The World Wide Web Repositories: Since the inception of
the World Wide Web in 1993, documents of all sorts of
formats, content and description have been collected and
inter-connected with hyperlinks making it the largest
repository of data ever built. Despite its dynamic and
unstructured nature, its heterogeneous characteristic, and it’s
very often redundancy and inconsistency, the World Wide
Web is the most important data collection regularly used for
reference because of the broad variety of topics covered and
the infinite contributions of resources and publishers. Many
believe that the World Wide Web will become the
compilation of human knowledge.
pg. 29
What are Data Mining and Knowledge Discovery?
pg. 30
The following figure 2 shows data mining as a step in an iterative
knowledge discovery process.
pg. 31
The iterative process consists of the following steps:
pg. 32
The KDD is an iterative process. Once the discovered knowledge is
presented to the user, the evaluation measures can be enhanced, the
mining can be further refined, new data can be selected or further
transformed, or new data sources can be integrated, in order to get
different, more appropriate results.
pg. 33
What kind of Data can be mined?
Data mining is being put into use and studied for databases,
including relational databases, object-relational databases and
object oriented databases, data warehouses, transactional databases,
unstructured and semi structured repositories such as the World
Wide Web, advanced databases such as spatial databases,
multimedia databases, time-series databases and textual databases,
and even flat files. Here are some examples in more detail:
• Flat files: Flat files are actually the most common data
source for data mining algorithms, especially at the research
level. Flat files are simple data files in text or binary format
with a structure known by the data mining algorithm to be
applied. The data in these files can be transactions, time-
series data, scientific measurements, etc.
pg. 34
• Relational Databases: Briefly, a relational database
consists of a set of tables containing either values of entity
attributes, or values of attributes from entity relationships.
Tables have columns and rows, where columns represent
“Attributes” and rows represent “tuples”. A tuple in a
relational table corresponds to either an object or a
relationship between objects and is identified by a set of
attribute values representing a “unique key”.
pg. 35
The most commonly used query language for relational database is
SQL, which allows retrieval and manipulation of the data stored in
the tables, as well as the calculation of aggregate functions such as
average, sum, min, max and count. For instance, an SQL query to
select the videos grouped by category would be:
pg. 36
• Data Warehouses: A data warehouse as a storehouse is a
repository of data collected from multiple data sources (often
heterogeneous) and is intended to be used as a whole under
the same unified schema. A data warehouse gives the option
to analyze data from different sources under the same roof.
pg. 37
o Figure 4: A multi-dimensional data cube structure
commonly used in data for data warehousing
pg. 38
Because of their structure, the pre-computed summarized data.
Data warehouses contain and the hierarchical attribute values of
their dimensions, data cubes are well suited for fast interactive
querying and analysis of data at different conceptual levels, known
as On-Line Analytical Processing (OLAP). OLAP operations allow
the navigation of data at different levels of abstraction, such as
drill-down, roll-up, slice, dice, etc. Figure 5 illustrates the drill-
down (on the time dimension) and roll-up (on the location
dimension) operations.
pg. 39
• Transaction Databases: A transaction database is a set of
records representing transactions, each with a time stamp, an
identifier and a set of items. Associated with the transaction
files could also be descriptive data for the items.
For example, in the case of the video store, the rentals table such as
shown in Figure 6 represents the transaction database. Each record
is a rental contract with a customer identifier, a date, and the list of
items rented (i.e. video tapes, games, VCR, etc.).
pg. 40
• Multimedia Databases: Multimedia databases include
video, images, and audio and text media. Multimedia
Databases can be stored on extended object-relational or
object-oriented databases, or simply on a file system.
Multimedia is characterized by its high dimensionality,
which makes data mining even more challenging. Data
mining from multimedia repositories may require computer
vision, computer graphics, image interpretation, and natural
language processing methodologies.
pg. 41
• Time-Series Databases: Time-series databases contain time
related data such stock market data or logged activities.
These databases usually have a continuous flow of new data
coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such databases
commonly includes the study of trends and correlations
between evolutions of different variables, as well as the
prediction of trends and movements of the variables in time.
Figure 8 shows some examples of time-series data.
pg. 42
• World Wide Web: The World Wide Web is the most
heterogeneous and dynamic repository available. A very
large number of authors and publishers are continuously
contributing to its growth and metamorphosis, and a massive
number of users are accessing its resources daily. Data in the
World Wide Web is organized in inter-connected documents.
These documents can be text, audio, video, raw data, and
even applications. Conceptually, the World Wide Web is
comprised of three major components: The “Content of the
Web”, which encompasses documents available; The
“Structure of the Web”, which covers the hyperlinks and the
relationships between documents; and The “Usage of the
web”, describing how and when the resources are accessed.
A fourth dimension can be added relating the dynamic nature
or evolution of the documents. Data mining in the World
Wide Web, or web mining, tries to address all these issues
and is often divided into web content mining, web structure
mining and web usage mining.
pg. 43
What can be discovered?
The kinds of patterns that can be discovered depend upon the data
mining tasks employed. By and large, there are two types of data
mining tasks: descriptive data mining tasks that describe the
general properties of the existing data, and predictive data mining
tasks that attempt to do predictions based on inference on available
data.
pg. 44
• Discrimination: Data discrimination produces what are
called discriminate rules and is basically the comparison of
the general features of objects between two classes referred
to as the target class and the contrasting class. For example,
one may want to compare the general characteristics of the
customers who rented more than 30 movies in the last year
with those whose rental account is lower than 5. The
techniques used for data discrimination are very similar to
the techniques used for data characterization with the
exception that data discrimination results include
comparative measures.
pg. 45
• Association analysis: Association analysis is the discovery
of what are commonly called association rules. It studies the
frequency of items occurring together in transactional
databases, and based on a threshold called support, identifies
the frequent item sets. Another threshold, confidence, which
is the conditional probability than an item appears in a
transaction when another item appears, is used to pinpoint
association rules. Association analysis is commonly used for
market basket analysis.
pg. 46
• Classification: Classification analysis is the organization of
data in given classes. Also known as supervised
classification, the classification uses given class labels to
order the objects in the data collection. Classification
approaches normally use a training set where all objects are
already associated with known class labels. The
classification algorithm learns from the training set and
builds a model. The model is used to classify new objects.
pg. 47
• Prediction: Prediction has attracted considerable attention
given the potential implications of successful forecasting in a
business context. There are two major types of predictions:
one can either try to predict some unavailable data values or
pending trends, or predict a class label for some data. The
latter is tied to classification. Once a classification model is
built based on a training set, the class label of an object can
be foreseen based on the attribute values of the object and
the attribute values of the classes. Prediction is however
more often referred to the forecast of missing numerical
values, or increase/ decrease trends in time related data. The
major idea is to use a large number of past values to consider
probable future values.
pg. 48
• Evolution and deviation analysis: Evolution and deviation
analysis pertain to the study of time related data that changes
in time. Evolution analysis models evolutionary trends in
data, which consent to characterizing, comparing, classifying
or clustering of time related data. Deviation analysis, on the
other hand, considers differences between measured values
and expected values, and attempts to find the cause of the
deviations from the anticipated values.
pg. 49
Is all that is Discovered Interesting and Useful?
In some cases the number of rules can reach the millions. One can
even think of a meta-mining phase to mine the oversized data
mining results. To reduce the number of patterns or rules
discovered that have a high probability to be non-interesting, one
has to put a measurement on the patterns. However, this raises the
problem of completeness. The user would want to discover all rules
or patterns, but only those that are interesting. The measurement of
how interesting a discovery is, often called interestingness, can be
based on quantifiable objective elements such as validity of the
patterns when tested on new data with some degree of certainty, or
on some subjective depictions such as understandability of the
patterns, novelty of the patterns, or usefulness.
pg. 50
How do we Categorize Data Mining Systems?
pg. 51
• Classification according to mining techniques used:
pg. 52
What are the Issues in Data Mining?
pg. 53
• User Interface Issues:
pg. 54
• Mining Methodology Issues:
More than the size of data, the size of the search space is even more
decisive for data mining techniques. The size of the search space is
often depending upon the number of dimensions in the domain
space. The search space usually grows exponentially when the
number of dimensions increases. This is known as the curse of
dimensionality. This “curse” affects so badly the performance of
some data mining approaches that it is becoming one of the most
urgent issues to solve.
pg. 55
• Performance Issues:
pg. 56
• Data Source Issues:
There are many issues related to the data sources, some are
practical such as the diversity of data types, while others are
philosophical like the data glut problem. We certainly have an
excess of data since we already have more data than Organizations
can handle and Organizations are still collecting data at an even
higher rate. If the spread of Database Management Systems has
helped increase the gathering of information, the advent of data
mining is certainly encouraging more data harvesting. The current
practice is to collect as much data as possible now and process it,
or try to process it, later. The concern is whether Organizations are
collecting the right data at the appropriate amount, whether
Organizations know what the business want to do with it, and
whether Organizations distinguish between what data is important
and what data is insignificant. Regarding the practical issues
related to data sources, there is the subject of heterogeneous
databases and the focus on diverse complex data types.
Organizations are storing different types of data in a variety of
repositories. It is difficult to expect a data mining system to
effectively and efficiently achieve good mining results on all kinds
of data and sources. Different kinds of data and sources may
require distinct algorithms and methodologies. Currently, there is a
focus on relational databases and data warehouses, but other
approaches need to be pioneered for other specific complex data
types. A versatile data mining tool, for all sorts of data, may not be
realistic. Moreover, the proliferation of heterogeneous data sources,
at structural and semantic levels, poses important challenges not
only to the database community but also to the data mining
community.
pg. 57
Better Data Mining Models.
Observations
Push the data mining tool to the Maximum limits. The more data
in use, the better the model created.
pg. 58
Variables
It tempting to state that probably for each problem there is one best
algorithm. So all data miner have to do is to try a handful of really
different algorithms to find out which one is the best for the
problem. Different data miners will use the same algorithm
differently, according to their taste, experience, mood, preference
So find out which algorithm works best for Data Miner and their
business problem.
pg. 59
• Zoom in on the business targets
When data miners want to use a data mining model to select the
customers who are most likely to buy the business outstanding
product XYZ, it is reasonable to use the business past buyers of
XYZ as the positive targets in the model. Data Miner get a model
with an excellent lift and use it for a mailing.
When the mailing campaign is over, data miner now have all the
data company need to create a new, better, model for product
XYZ. The business targets the past buyers of XYZ in response to
the business mailing. With this new model, data miner will not
only take their “natural” propensity to buy into account, but also
their willingness to respond to the customer mailing
If the databases contain far more observations than the data mining
tool likes, the only thing data miner can do is use samples.
Calculate the model, and data miner can use it. But data miner can
push it a bit further. Use the model to score the entire customer
base. And now zoom in on the customers with the best scores.
Let’s say the top-10%. Use them to calculate a new, second model
which will use the far more tiny differences in customer
information to find the really promising ones.
• Make it simple
pg. 60
• Automate as much as possible
The data miner should not to try out every possible algorithm in
each data mining project. If problem A was best solved with
algorithm X, than probably problem B, which is very similar to A,
should equally be tackled with algorithm X. No need to waste time
checking out other algorithms.
pg. 61
Introduction to Object-Oriented Database
pg. 62
For example, in the transportation network, all highways with the
same structural and behavioral properties can be classified as a
class highway. From the application point of view, Classification
helps in credit approval, product marketing, and medical diagnosis.
So many techniques such as decision trees, neural networks,
nearest neighbor methods and rough set-based methods enable the
creation of classification models. Regardless of the potential
effectiveness of Data Mining to appreciably enhance data analysis,
this technology still to be a niche technology unless an effort is
taken to integrate Data Mining technology with traditional database
system. Database systems offer a uniform framework for Data
Mining by proficiently administering large datasets, integrating
different data-types and storing the discovered knowledge.
pg. 63
An Object-Oriented Database (OODB) is a database in which the
concepts of object-oriented languages are utilized. The principal
strength of Object-Oriented database (OODB) is its ability to
handle applications involving complex and interrelated
information. But in the current scenario, the existing Object-
Oriented Database Management System (OODBMS) technologies
are not efficient enough to compete in the market with their
relational counterparts. Apart from that, there are numerous
applications built on existing Relational Database Management
Systems (RDBMS). It's difficult, if not impossible, to move off
those Relational Databases. Hence, Database Administor intend to
incorporate the object-oriented concepts into the existing Relational
Database Management Systems (RDBMSs), thereby exploiting the
features of RDBMSs and Object-Oriented (OO) concepts.
Undoubtedly, one of the significant characteristic of object-oriented
programming is inheritance.
For instance, a class named Car exhibits an "is a" relationship with
a base class named Vehicle, since a car is a vehicle.
pg. 64
For Example, a bicycle has a steering wheel and, in the same way,
a wheel has spokes. Inheritance can also be stated as
generalization, because the “is-a” relationship represents a
hierarchy between the classes of objects.
pg. 65
“Polymorphism” is another important Object oriented programming
concept. It is a general term which stands for ‘Many forms’.
Polymorphism in brief can be defined as "One Interface, Many
Implementations". It is a property of being able to assign a different
meaning or usage to something in different contexts in particular,
to allow an entity such as a variable, a function, or an object to take
more than one form. Polymorphism is different from Method
Overloading or Method Overriding. In literature, polymorphism
can be classified into three different kinds namely: pure, static, and
dynamic.
pg. 66
It is becoming increasingly important to extend the domain of
study from relational database systems to object-oriented database
systems and probe the knowledge discovery mechanisms in object-
oriented databases, because object-oriented database systems have
emerged as a popular and influential setting in advanced database
applications. The fact that standards, are still not defined for
Object-Oriented Database Management Systems (OODBMSs) as
those for relational Database Management Systems (DBMSs) and
as most organizations have their information systems based on a
relational Database Management Systems (DBMS) technology, the
incorporation of the object oriented programming concepts into the
existing Relational Database Management Systems (RDBMSs) will
be an ideal choice to design a database that best suit the advanced
database applications.
pg. 67
Object-Oriented Database (OODB)
pg. 68
The most important Object-Oriented Concept employed in an
Object-Oriented Database (OODB) model includes the inheritance
mechanism and composite object modeling. In order to cope with
the increased complexity of the object-oriented model, one can
divide class features as follows: simple attributes - attributes with
scalar types; complex attributes - attributes with complex types,
simple methods - methods accessing only local class simple
attributes; complex methods - methods that return or refer instances
of other classes . The object-oriented approach uses two important
abstraction principles for structuring designs: Classification and
Generalization.
pg. 69
New Approach to the Design of Object Oriented
Database
pg. 70
The proposed approach makes use of the Object-Oriented
Programming (OOP) concepts namely,” Inheritance and
Polymorphism “to design an Object-Oriented Database (OODB)
and perform classification in it respectively. Normally, database is
a collection of tables. Hence when I have consider a database, it is
bound to contain a number of tables with common fields. In my
approach, I have grouped together such common set of fields to
form a single generalized table. The newly created table resembles
the base class in the inheritance hierarchy. This ability to represent
classes in hierarchy is one of the eminent Object-Oriented
Programming (OOP) concepts. Next I have employed another
important object-oriented characteristic dynamic polymorphism,
where different classes have methods of the same name and
structure, performing different operations based on the “Calling
Object.” The polymorphism is specifically employed to achieve
classification in a simple and effective manner. The use of these
object-oriented concepts for the design of Object-Oriented
Database (OODB) Object-Oriented Database ensures that even
complex queries can be answered more efficiently. Particularly the
data mining task, classification can be achieved in an effective
manner.
Let T denote a set of all tables on a database D and t subset T,
where ‘t’ represents the set of tables in which some fields are in
common. Now I have create a generalized table composing of all
those common fields from the table set‘t’. To portray the efficiency
of my proposed approach, I consider a traditional table. A
traditional example of the database for large business organizations
will have a number of tables but to best illustrate the Object-
Oriented Programming (OOP) concepts employed in my approach,
I have concentrated on three tables namely, Employees, Suppliers
and Customers. The tables are represented as Table 1, Table 2,
Table 3 respectively
pg. 71
pg. 72
Table 1: Example of Employees Table
pg. 73
Table 2: Example of Customers Table
pg. 74
Table 3: Example of Suppliers Table
pg. 75
The above set of tables namely Employees, Suppliers and
Customers can be represented equivalently as classes. The class
structure may look like as in Figure 9
pg. 76
From the above class structure, it is understood that every table has
a set of general or common fields (highlighted ones) and table-
specific fields. On considering the Employee table, it has general
fields like Name, Age, Gender etc. and table-specific fields like
Title, HireDate etc. These general fields occur repeatedly in most
tables. This causes redundancy and thereby increases space
complexity. Moreover, if a query is given to retrieve a set of
records for the whole organization satisfying a particular rule, there
may be a need to search all the tables separately. So, this
replication of general fields in the table leads to a poor design
which affects effective data classification. To perform better
classification, I have design an Object-Oriented Database (OODB)
by incorporating the inheritance concept of Object-Oriented
Programming (OOP).
pg. 77
Design of the Object-Oriented Database
pg. 78
pg. 79
Figure 10: Inheritance Hierarchy of Classes in the Proposed OODB Design
pg. 80
pg. 81
The generalized table ‘Person’ is considered as the base class
‘Person’ and the fields are considered as the attributes of
the base class ‘Person’. Therefore, the base class ‘Person’,
which contains all the common attributes, is inherited by the other
classes namely Employees, Suppliers and Customers, which
contain only the specialized attributes.
For example, if there is a need to get the contact numbers of all the
people associated with the organization, can define a
method getContactNumebrs() in the base class ‘Person’ and it
can be shared by its subclasses. In addition, the generalized
class ‘Person’ exhibits composition relationship with another
two classes ‘Places’ and ‘PostalCodes’. The class ‘Person’ uses
instance variables, which are object references of the classes
‘Places’ and ‘PostalCodes’. The tables in the proposed (OODB)
design are shown in Tables.
pg. 82
pg. 83
Table 4: Example of Persons Table
pg. 84
Table 5: Example of Extended Employees Table
pg. 85
Table 6: Example of Extended Suppliers Table
pg. 86
Table 7: Example of Extended Customers Table
pg. 87
Table 8: Example of Extended Places Table
pg. 88
Table 9: Example of Extended PostalCodes Table
pg. 89
pg. 90
Owing to the incorporation of inheritance concept in the proposed
design, Database Designer can extend the database by effortlessly
adding new tables, merely by inheriting the common fields from
the generalized table
pg. 91
Data Mining in the Designed Object-Oriented
Database
pg. 92
Implementation and Results
pg. 93
pg. 94
Normalized Un Normalized
pg. 95
Saved Memory (KB): 63.47656 Table 10: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 96
Normalized
Un Normalized
Tables Fields Records Total Records Memory size Fields Total Records Memory
of Table of the table of the table size of the
table
Saved Memory (KB): 126.9531 Table 11: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 97
Tables Fields Records Total Memory size Fields Total Memory
Records of of the table Records of size of the
Table the table table
Saved Memory (KB):190.4297 Table 12: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 98
Tables Fields Records Total Memory Fields Total Memory
Saved Memory (KB):253.9063 Table 13: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 99
Tables Fields Records Total Memory Fields Total Memory
Records of size of the Records of size of the
Saved Memory (KB):317.3828 Table 14: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}
pg. 100
pg. 101
pg. 102
The results of comparative analysis that the saved memory space
increases, as the number of records in each table increases.
pg. 103
For instance, in case of a traditional database if a method
getContactNumbers() is defined to get the contact numbers, the
method has to be defined in all the classes and all those results are
to be combined to obtain the final result. But in the proposed
approach, I have generalized all the classes, so the redefinition of
methods for all the related classes is not needed. If there are ‘n’
classes, placing the common methods in the base class can save a
memory space of
pg. 104
Building Profitable Customer Relationships with
Data Mining
pg. 105
This section of the project will describe the various aspects of
analytic CRM and show how it is used to manage the customer life
cycle more cost-effectively. The case histories of these fictional
companies are composites of real-life data mining applications.
pg. 106
Data Mining in Customer Relationship Management
pg. 107
In CRM, data mining is frequently used to assign a score to a
particular customer or prospect indicating the likelihood that the
individual will behave in the way Business Man want. For
example, a score could measure the propensity to respond to a
particular offer or to switch to a competitor’s product. It is also
frequently used to identify a set of characteristics (called a profile)
that segments customers into groups with similar behaviors, such
as buying a particular product.
pg. 108
Defining CRM
• Acquiring customers
• Increasing the value of the customer
• Retaining good customers
pg. 109
Applying Data Mining to CRM
In order to build good models for the Business CRM system, there
are a number of steps the Business Man must follow.
pg. 110
• Build a Marketing Database.
Steps two through four constitute the core of the data preparation.
Together, Big Sam’s Clothing Company take more time and effort
than all the other steps combined. There may be repeated iterations
of the data preparation and model building steps as business analyst
learn something from the model that suggests business analyst to
modify the data. These data preparation steps may take anywhere
from 50% to 90% of the time and effort of the entire data mining
process!
pg. 111
• Explore the data.
pg. 112
.
• Prepare data for modeling.
This is the final data preparation step before building models and
the step where the most “art” comes in. There are four main parts to
this step:
The next step is to construct new predictors derived from the raw
data.
pg. 113
• Data mining model building.
pg. 114
• Incorporating Data Mining in the business CRM solution
pg. 115
For inbound transactions, such as a telephone order, an Internet
order, or a customer service call, the application must respond in
real time. Therefore the data mining model is embedded in the
application and actively recommends an action.
In either case, one of the key issues the business must deal with in
applying a model to new data is the transformations marketers used
in building the model. Thus if the input data (whether from a
transaction or a database) contains age, income, and gender fields,
but the model requires the age-to-income ratio and gender has been
changed into two binary variables, marketers must transform
business input data accordingly. The ease with which Business
Analyst can embed these transformations becomes one of the most
important productivity factors when marketers want to rapidly
deploy many models.
pg. 116
How to Data Mine for Future Trends
Steps to be follow
pg. 117
• Add other mitigating data as it is found. Write reports
supplementary to the business data mining graphs that
"explain" a trend and evaluate the chances of its continuance.
Having background data and reasonable mitigating factors at
hand will help Business Analyst make better decisions about
the future.
pg. 118
The 10 Secrets to Using Data Mining to Succeed at CRM
Start with the end in mind. Avoid the “ad hoc trap” of
mining data without defined business objectives. Prior to
modeling, define a project that supports the
organization’s strategic objectives. For example, the
business objective might be to attract additional
customers who are similar to business most valuable
current customers. Orit might be to keep business most
profitable customers longer.
pg. 119
• Set specific goals for business data mining project
pg. 120
• Line up the right data
• Secure IT buy-in
pg. 121
• Select the right data mining solution
pg. 122
• Consider mining other types of data to increase the
return on business data mining investment
When the business analyst combine text, Web, or survey data with
structured data used in building models, the business analyst enrich
the information available for prediction. Even if the business
analyst adds only one type of additional data, the business will see
an improvement in the results that the business analysts generate.
Incorporating multiple types of data will provide even greater
improvements. To determine if the company might benefit from
incorporating additional types of data, begin by asking the
following questions:
pg. 123
• Expand the scope of data mining to achieve even greater
results
One way that the business analyst can increase the Return on
Investment ROI generated by data mining is by expanding the
number of projects the business analyst undertake. With the right
data mining solution, one that helps automate routine tasks the
business analyst can do this without increasing staff.
pg. 124
• Consider all available deployment options
pg. 125
• Increase collaboration and efficiency through model
management
Look into data mining solutions that enable the business analyst to
centralize the management of data mining models and support the
automation of processes such as the updating of customer scores.
These solutions foster greater collaboration and enterprise
efficiency. Central model management also helps the organization
avoid wasted or duplicated effort while ensuring that the most
effective predictive models are applied to the business challenges.
Model management also provides a way to document model
creation, usage, and application.
pg. 126
The Suggestion for Analytics, Business
Intelligence, and Performance Management
pg. 127
• The Business Analyst witness the emergence of packaged
strategy-driven execution applications. As we discussed
in Driven to Perform: Risk-Aware Performance
Management From Strategy Through Execution (Nenshad
Bardoliwalla, Stephanie Buscemi, and Denise Broady,
New York, NY, Evolved Technologist Press, 2009), the
end state for next-generation business applications is not
merely to align the transactional execution processes
contained in applications like ERP, CRM, and SCM with
the strategic analytics of performance and risk
management of the organization, but for those strategic
analytics to literally drive execution. We called this
“Strategy-Driven Execution”, the complete fusion of
goals, initiatives, plans, forecasts, risks, controls,
performance monitoring, and optimization with
transactional processes. Visionary applications such as
those provided by Workday and SalesForce.com with
embedded real-time contextual reporting available
directly in the application (not as a bolt-on), and Oracle’s
entire Fusion suite layering Essbase and OBIEE
capabilities tightly into the applications’ logic, clearly
portend the increasing fusion of analytic and transactional
capability in the context of business processes and this
will only increase.
pg. 128
• The holy grail of the predictive, real-time enterprise will
start to deliver on its promises. While classic analytic
tools and applications have always done a good job of
helping users understand what has happened and then
analyze the root causes behind this performance, the
value of this information is often stale before it reaches its
intended audience. The holy grail of analytic technologies
has always been the promise of being able to predict
future outcomes by sensing and responding, with minimal
latency between event and decision point. This has
become manifested in the resurgence of interest in event-
driven architectures that leverage a technology known as
Complex Event Processing and predictive analytics. The
predictive capabilities appear to be on their way to break
out market acceptance IBM’s significant investment in
setting up their Business Analytics and Optimization
practice with 4000 dedicated consultants, combined with
the massive product portfolio of the Cognos and recently
acquired SPSS assets. Similarly, Complex Event
Processing capabilities, a staple of extremely data-
intensive, algorithmically-sophisticated industries such as
financial services, have also become interesting to a
number of other industries that cannot deal with the
amount of real-time data being generated and need to be
able to capture value and decide instantaneously.
Combining these capabilities will lead to new classes of
applications for business management that were
unimaginable a decade ago.
pg. 129
• The industry will put reporting and slice-and-dice
capabilities in their appropriate places and return to its
decision-centric roots with a healthy dose of Web 2.0
style collaboration. It was clear to the pioneers of this
industry, beginning as early as H.P. Luhn’s brilliant
visionary piece A Business Intelligence System from
1958 that the goal of these technologies was to support
business decision-making activities, and we can trace the
roots of modern analytics, business intelligence, and
performance management to the decision-support notion
of decades earlier. But somewhere along the way,
business intelligence became synonymous with reporting
and slicing-and-dicing, which is a metaphor that suits
analysts, but not the average end-user. This has
contributed to the paltry BI adoption rates of
approximately 25% bandied about in the industry, despite
the fact that investment in BI and its priority for
companies has never been higher over the last five years.
Making report production cheaper to the point of nearly
being free, something BI is poised to do is still unlikely to
improve this situation much. Instead, we will see
resurgence in collaborative decision-centric business
intelligence offerings that make decisions the central
focus of the offerings. From an operational perspective,
this is certainly in evidence with the proliferation of rules-
based approaches that can automate thousands of
operational decisions with little human intervention.
However, for more tactical and strategic decisions, mash-
ups will allow users to assemble all of the relevant data
for making a decision, social capabilities will allow users
to discuss this relevant data to generate “crowd sourced”
wisdom, and explicit decisions, along with automated
inferences, will be captured and correlated against
outcomes. This will allow decision-centric business
intelligence to make recommendations within process
contexts for what the appropriate next action should be,
along with confidence intervals for the expected outcome,
pg. 130
as well as being able to tell the user what the risks of her
decisions are and how it will impact both the company’s
and her own personal performance.
pg. 131
• Performance, risk, and compliance management will
continue to become unified in a process-based framework
and make the leap out of the CFO’s office. The
disciplines of performance, risk, and compliance
management have been considered separate for a long
time, but the walls are breaking down, as we documented
thoroughly in Driven to Perform. Performance
management begins with the goals that the organization is
trying to achieve, and as risk management has evolved
from its siloed roots into Enterprise Risk Management, it
has become clear that risks must be identified and
assessed in light of this same goal context. Similarly, in
the wake of Sarbanes-Oxley, as compliance has become
an extremely thorny and expensive issue for companies of
all sizes, modern approaches suggest that compliance is
ineffective when cast as a process of signing off on
thousand of individual item checklists, but rather should
be based on an organization’s risks. All three of these
disciplines need to become unified in a process-based
framework that allows for effective organizational
governance. And while financial performance, risk, and
compliance management are clearly the areas of most
significant investment for most companies, it is clear that
these concerns are now finally becoming enterprise-level
plays that are escaping the confines of the Office of the
CFO. We will continue to witness significant investment
in sales and marketing performance management, as
vendors like Right90 continuing to gain traction in
improving the sales forecasting process and vendors like
Varicent receive hefty $35 million venture rounds this
year, no doubt thanks to experiencing over 100% year
over year growth in the burgeoning Sales Performance
Management category. My former Siebel colleague,
Bruce Cleveland, now a partner at Interwest, makes the
case for this market expansion of performance
management into the front-office rather convincingly and
has invested correspondingly.
pg. 132
• Cloud Business Intelligence, Tools will steal significant
revenue from on-premise vendors but also fight for
limited oxygen amongst themselves. From many
accounts, this was the year that BI based offerings hit the
mainstream due to their numerous advantages over on-
premise offerings, and this certainly was in evidence with
the significant uptick in investment and market visibility
of Business Intelligence, vendors. Although much was
made of the folding of LucidEra, one of the original
pioneers in the space, and while other vendors like
BlinkLogic folded as well, vendors like Birst, PivotLink,
Good Data, Indicee and others continue to announce wins
at a fair clip along with innovations at a fraction of the
cost of their on-premise brethren. From a functionality
perspective, these tools offer great usability, some
collaboration features, strong visualization capabilities,
and an ease-of-use not seen with their on-premise
equivalents whereby users are able to manage the system
in a self-sufficient fashion devoid of the need for
significant IT involvement. Business Intelligence, have
long argued that basic reporting and analysis is now a
commodity, so there is little reason for any customer to
invest in on-premise capabilities at the price/performance
ratio that the vendors are offering . Business
Intelligence, should thus expect to see continued
dimunition of the on-premise vendors BI revenue streams
as the BI value proposition goes mainstream, although it
wouldn’t be surprising to see acquisitions by the large
vendors to stem the tide. However, with so many small
players in the market offering largely similar capabilities,
the Business Intelligence, tools vendors may wind up
starving themselves for oxygen as company put price
pressure on each other to gain new customers. Only
vendors whose offerings were designed from the
beginning for cloud-scale architecture and thus whose
marginal cost per additional user approaches zero will
pg. 133
succeed in such a commodity pricing environment,
although alternatively these vendors can pursue going
upstream and try to compete in the enterprise, where the
risks and rewards of competition are much higher. On
the other hand, packaged Business Intelligence,
Applications such as those offered by Host Analytics,
Adaptive Planning, and new entrant Anaplan, while
showing promising growth, have yet to mature to
mainstream adoption, but are poised to do so in the
coming years. As with all applications, addressing key
integration and security concerns will remain crucial to
driving adoption.
pg. 134
• The undeniable arrival of the era of big data will lead to
further proliferation in data management alternatives.
While analytic-centric OLAP databases have been around
for decades such as Oracle Express, Hyperion Essbase,
and Microsoft Analysis Services, company have never
held the same dominant market share from an
applications consumption perspective that the RDBMS
vendors have enjoyed over the last few decades. No
matter what the application type, the RDBMS seemed to
be the answer. However, we have witnessed an explosion
of exciting data management offerings in the last few
years that have reinvigorated the information
management sector of the industry. The largest web
players such as Google (BigTable), Yahoo (Hadoop),
Amazon (Dynamo), Facebook (Cassandra) have built
their own solutions to handle their own incredible data
volumes, with the open source Hadoop ecosystem and
commercial offerings like CloudEra leading the charge in
broad awareness. Additionally, a whole new industry of
DBMSs dedicated to Analytic workloads have sprung up,
with flagship vendors like Netezza, Greenplum, Vertica,
Aster Data, and the like with significant innovations in in-
memory processing, exploiting parallelism, columnar
storage options, and more. We already starting to see
hybrid approaches between the Hadoop players and the
ADBMS players, and even the largest vendors like Oracle
with their Exadata offering are excited enough to make
significant investments in this space. Additionally,
significant opportunities to push application processing
into the databases themselves are manifesting themselves.
There has never been the plethora of choices available as
new entrants to the market seem to crop up weekly.
Visionary applications of this technology in areas like
metereological forecasting and genomic sequencing with
massive data volumes will become possible at hitherto
unimaginable price points.
pg. 135
• Advanced Visualization will continue to increase in depth
and relevance to broader audiences. Visionary vendors
like Tableau, QlikTech, and Spotfire (now Tibco) made
their mark by providing significantly differentiated
visualization capabilities compared with the trite bar and
pie charts of most Business Intelligence, players’
reporting tools. The latest advances in state-of-the-art
User interface technologies such as Microsoft’s
SilverLight, Adobe Flex, and AJAX via frameworks like
Google’s Web Toolkit augur the era of a revolution in
state-of-the art visualization capabilities. With consumers
broadly aware of the power of capabilities like Google
Maps or the tactile manipulations possible on the iPhone,
these capabilities will find their way into enterprise
offerings at a rapid speed lest the gap between the
consumer and enterprise realms become too large and
lead to large scale adoption revolts as a younger
generation begins to enter the workforce having never
known the green screens of yore.
pg. 136
• Open Source offerings will continue to make in-roads
against on-premise offerings. Much as Business
Intelligence, offerings are doing, Open Source offerings
in the larger Business Intelligence, market are disrupting
the incumbent, closed-source, on-premise vendors.
Vendors like Pentaho and JasperSoft are really starting to
hit their stride with growth percentages well above the
industry average, offering complete end-to-end Business
Intelligence, stacks at a fraction of the cost of their
competitors and thus seeing good bottom-up adoption
rates. This is no doubt a function of the brutal economic
times companies find themselves experiencing.
Individual parts of the stacks can also be assembled into
compelling offerings and receive valuable innovations
from both corporate entities as well as dedicated
committers: JFreeChart for charting, Actuate‘s BIRT for
reporting, Mondrian and Jedox‘s Palo for OLAP Servers,
DynamoBI‘s LucidDB for ADBMS, Revolution
Computing‘s R for statistical manipulation, Cloudera‘s
enterprise Hadoop for massive data, EsperTech for CEP,
Talend for Data Integration / Data Quality / MDM, and
the list goes on. These offerings have absolutely reached
a level of maturity where the companies are capable of
being deployed in the enterprise right alongside any other
commercial closed-source vendor offering.
pg. 137
• Data Quality, Data Integration, and Data Virtualization
will merge with Master Data Management to form a
unified Information Management Platform for structured
and unstructured data. Data quality has been the bain of
information systems for as long as the companies have
existed, causing many an IT analyst to obsess over it, and
data quality issues contribute to significant losses in
system adoption, productivity, and time spent addressing
them. Increasingly, data quality and data integration will
be interlocked hand-in-hand to ensure the right, cleansed
data is moved to downstream sources by attacking the
problem at its root. Vendors including SAP Business
Objects, SAS, Informatica, and Talend are all providing
these capabilities to some degree today. Of course, with
the amount of relevant data sources exploding in the
enterprise and no way to integrate all the data sources into
a single physical location while maintaining agility,
vendors like Composite Software are providing data
virtualization capabilities, whereby canonical information
models can be over layer on top information assets
regardless of where the data are located, capable of
addressing the federation of batch, real-time and event
data sources. These disparate data sources will need to be
harmonized by strong Master Data Management
capabilities, whereby the definitions of key entities in the
enterprise like customers, suppliers, products, etc. can be
used to provide semantic unification over these
distributed data sources. Finally, structured, semi-
structured, and unstructured information will all be able
to be extracted, transformed, loaded, and queried from
this ubiquitous information management platform by
leveraging the capabilities of text analytics capabilities
that continue to grow in importance and combining them
with data virtualization capabilities.
pg. 138
Excel will continue to provide the dominant paradigm for end-user
Business Intelligence, consumption. For Excel specifically, the
number one analytic tool by far with a home on hundreds of
millions of personal desktops, Microsoft has invested significantly
in ensuring its continued viability as we move past its second
decade of existence, and its adoption shows absolutely no sign of
abating any time soon. With Excel 2010's arrival, this includes
significantly enhanced charting capabilities, a server-based mode
first released in 2007 called Excel Services, being a first-class
citizen in SharePoint, and the biggest disruptor, the launch of
Power Pivot, an extremely fast, scalable, in-memory analytic
engine that can allow Excel analysis on millions of rows of data at
sub-second speeds. While many vendors have tried in vain to
displace Excel from the desktops of the business user for more than
two decades, none will be any closer to succeeding any time soon.
Microsoft will continue to make sure of that.
pg. 139
Successful Stories of Implementing Data Mining
in the Businesses
Getting people to fill out an application for the credit card is only
the first step. Then Big Bank and Credit Card Company (BB&CC)
must decide whether the applicant is a good risk and accept them as
a customer. Not surprisingly, poor credit risks are more likely to
accept the offer than are good credit risks. So while 6% of the
people on the mailing list respond with an application, only about
16% of those are suitable credit risks, for a net of about 1% of the
mailing list becoming customers.
pg. 140
Big Bank and Credit Card Company (BB&CC) experience of a 6%
response rate means that within the
million names are 60,000 people who will
respond to the solicitation. Unless Big
Bank and Credit Card Company
(BB&CC) changes the nature of the
solicitation using different mailing lists,
reaching customers in different ways,
altering the terms of the offer Big Bank
and Credit Card Company (BB&CC) are
not going to get more than 60,000 responses. And of those 60,000
responses, only 10,000 will be good enough risks becoming
customers. The challenge Big Bank and Credit Card Company
(BB&CC) faces is getting to those 10,000 people most efficiently.
The cost of mailing the solicitations is about $1.00 per piece for a
total cost of $1,000,000. Over the next couple of years, these
customers will generate about $1,250,000 in profit for the bank (or
about $125 each) for a net return from the mailing of $250,000.
First Big Bank and Credit Card Company (BB&CC) did a test
mailing of 50,000 and carefully analyzed the results, building a
predictive model of who would respond (using a decision tree) and
a credit scoring model (using a neural net). It then combined these
two models to find the people who were both good credit risks and
most likely to respond to the offer.
pg. 141
The model was applied to the remaining 950,000 people in the
mailing list from which 700,000 people were selected for the
mailing. The result was that from the 750,000 pieces mailed overall
(including the test mailing), 9,000 acceptable applications for credit
cards were received. In other words, the response rate had risen
from 1% to 1.2%, a 20% increase. While the targeted mailing only
reaches 9,000 of the 10,000 prospects, no model is perfect;
reaching the remaining 1,000 prospects is not profitable. Had Big
Bank and Credit Card Company (BB&CC) mailed the other
250,000 people on the mailing list, the cost of $250,000 would
have resulted in another $125,000 of gross profit for a net loss of
$125,000.
Gross profit
$1,250,0 $1,125,0
($125,000)
00 00
$250,00 $375,00
Net Profit $125,000
0 0
Cost of model $0 $40,000 $40,000
$250,00 $335,00
Final Profit $85,000
0 0
pg. 142
Notice that the net profit from the mailing increased $125,000.
Even when Business man include the $40,000 cost of the data
mining software, computer, and people resources used for this
modeling effort the net profit increased $85,000. This translates to
a return on investment for modeling of over 200% which far
exceeded Big Bank and Credit Card Company (BB&CC) Return on
Investment (ROI) requirements for this project.
pg. 143
Increasing the Value of Business Existing Customers:
Cross-Selling Via Data Mining
Guns and Roses (G&R) has an excellent chance of selling the caller
something additional cross-selling. But Guns and Roses (G&R) had
found that if the first suggestion fails and Guns and Roses (G&R)
try to suggest a second item, the customer may get irritated and
hang up without ordering anything. And there are some customers
who resent any attempt at all to cross-sell them on additional
products.
pg. 144
Before trying data mining, Guns and Roses (G&R) had been
reluctant to cross-sell at all. Without the model, the odds of making
the right recommendation were one in three. And because making
any recommendation is for some customers unacceptable, Guns
and Roses (G&R) wanted to be exceptionally sure that Guns and
Roses (G&R) never made a recommendation when Guns and Roses
(G&R) should not. In a trial campaign, Guns and Roses (G&R) had
less than a 1% sales rate and had a substantial number of
complaints. Guns and Roses (G&R) were reluctant to continue for
such a small gain.
pg. 145
The second model predicted which offer would be most acceptable.
pg. 146
Increasing the Value of the Business Existing
Customers: Personalization via Data Mining
When Big Sam’s Clothing Company first put up the site, there was
none of this personalization. It was just an on-line version of their
catalog, nicely and efficiently done but not taking advantage of the
sales opportunities presented by the Web.
pg. 147
.
First, Big Sam’s used clustering to discover which products
grouped together naturally. Some of the clusters were obvious,
such as shirts and pants. Others were surprising, such as books
about desert hiking and snakebite kits. Big Sam’s Clothing
company website used these groupings to make recommendations
whenever someone looked at a product.
pg. 148
Retaining Good Customers Via Data Mining
The first thing Know Service needed to do was prepare the data for
predicting which customers would leave. Know Internet Service
Provider needed to select the variables from their customer
database and perhaps transform them. The bulk of their users were
dial-in clients (as opposed to clients who are always connected
through a T1 or DSL line), so Know Internet Service Provider
knew how long each user was connected to the Web. Know
Internet Service Provider also knew the volume of data transferred
to and from a user’s computer, the number of e-mail accounts a
user had, the number of e-mail messages sent and received, and a
customer’s service and billing history. In addition, Big Sam’s
Clothing Company had demographic data that customers provided
at sign-up.
pg. 149
Next Know Internet Service
Provider needed to identify who
were “good” customers. This is not
a data mining question but a
business definition (such as
profitability) followed by a
calculation. Know Service built a
model to profile their profitable
customers and their unprofitable
customers. Know Internet Service
Provider used this model not only
for customer retention but to
identify customers who were not
yet profitable but might become so
in the future.
pg. 150
For example, some churners were exceeding even the largest
amount of usage available for a fixed fee and were paying
substantial incremental usage fees. Know Internet Service Provider
tried offering these users a higher fee service that included more
bundled time. Some users were offered as more free disk space for
personal web pages. Know Internet Service Provider then built
models that would predict which would be the effective offer for a
particular user.
To summarize, the churn project made use of all three models. One
model identified likely churners, the next model picked out the
profitable ones worth keeping, and the third model matched the
potential churners with the most appropriate offer. The net result
was a reduction in their churn rate from 8% to 7.5%, for a savings
in customer acquisition costs of $1,000,000 per month.
Know Service found that their investment in data mining paid off
by improving their customer relationships and dramatically
increasing their profitability.
pg. 151
Conclusion
pg. 152
Data mining has been gaining tremendous interest and hence
research on data mining has mushroomed within the last few
decades. A promising approach for managing complex information
and user defined data types is by incorporating Object-Orientation
Concepts into Relational Database Management Systems. In this
Research, I have presented an approach for the design of an Object-
Oriented Database and performing classification effectively in it.
The Object Oriented Programming concepts such as “Inheritance
and Polymorphism” have been utilized in the presented approach.
Owing to this design of Object-Oriented Database (OODB), an
efficient classification task has been achieved by utilizing simple
SQL\ORACLE queries.
pg. 153
Inspire of the often weird accuracy of insight that data mining
provides, it is not magic. It’s a valuable business tool that
organizations around the globe are successfully using to make
critical business decisions about customer acquisition and
retention, customer value management, marketing optimization,
and other customer-related issues.
Similarly, the keys to effectively using data mining are not secret
or mysterious. With a solid understanding of the issues to be
addressed, appropriate resources and support, and the right
solution, the business analyst, too, can experience the business
benefits that other organizations are reaping from data mining.
pg. 154
APPENDIX – I
List of Figures
pg. 155
Figure 8: Examples of Time-Series Data (Source: Thompson
Investors Group)
pg. 156
APPENDIX – II
List of Tables
pg. 157
Table 9: Example of Extended PostalCodes Table
pg. 158
APPENDIX – III
List of Equations
pg. 159
APPENDIX – IV
SQL Queries
pg. 160
Create Table Employees
pg. 161
APPENDIX - V
pg. 162
CASE: Computer Aided Software Engineering
pg. 163
Repository:
Noise Data:
Noise data is meaningless data. The term has often been used as a
synonym for corrupt data. However, its meaning has expanded to
include any data that cannot be understood and interpreted
correctly by machines, such as unstructured text. Any data that has
been received, stored, or changed in such a manner that it cannot be
read or used by the program that originally created it can be
described as noisy.
pg. 164
Web Mining:
Mobile Computing:
Niche:
Semantics:
pg. 165
Generalization:
2. Employee
pg. 166
Specialization:
pg. 167
Differences in the two approaches may be characterized by their
starting point and overall goal. Generalization proceeds from the
recognition that a number of entity sets share some common
features (namely, entities are described by the same attributes and
participate in the same relationship sets).
pg. 168
Method Override:
For Example:
class Base
{
public:
virtual void DoSomething() {x = x + 5;}
private:
int x;
};
class Derived : public Base
{
public:
virtual void DoSomething() { y = y + 5; Base::DoSomething(); }
private:
int y;
};
pg. 169
Method Overload:
For Example:
pg. 170
Polymorphic:
Interface:
Inheritance:
pg. 171
Composite Object Modeling:
pg. 172
Decision trees:
A decision tree (or tree diagram) is a decision support tool that uses
a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs,
and utility
The first component is the top decision node, or root node, which
specifies a test to be carried out. Each branch will lead either to
another decision node or to the bottom of the tree, called a leaf
node. By navigating the decision tree business analysis can assign a
value or class to a case by deciding which branch to take, starting
at the root node and moving to each subsequent node until a leaf
node is reached. Each node uses the data from the case to choose
the appropriate branch. Decision trees models are commonly used
in data mining to examine the data and induce a tree and its rules
that will be used to make predictions. A number of different
algorithms may be used for building decision trees including
CHAID (Chi-squared Automatic Interaction Detection), CART
(Classification and Regression Trees), Quest, and C5.0.
pg. 173
Reluctant: not eager; unwilling; disinclined
Solicitation:
Clustering:
pg. 174
Neural Networks:
pg. 175
Notes
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
pg. 176