You are on page 1of 95

CHAPTER - 1

INTRODUCTION

Data mining (the superiorscrutiny step of the "familiarityfind in Databases"


progression, or KDD), an interdisciplinary subfield of computer science, is the
computational route of determineoutlines in great data sets involving ways at the
intersection of artificial intelligence, machine studying, statistics, and database systems.
The in excess ofaim of the data mining development is to removein sequence from a
data set and convert it into acomprehensibleconstruction for more use.Aside from the
raw investigation step, it involves database and data management aspects, data pre-
processing, copy and conjecturedeliberations, interestingness metrics, complexity
deliberations, post-processing of discovered configurations, visualization, and online
updating.The term is a exhortation, and is recurrentlymaltreatment to mean any form of
bulky-scale data or in sequence processing (anthology, taking out, warehousing,
breakdown, and information) but is also generalized to any class of computer
conclusion support system, counting artificial intelligence, machine learning, and
business intelligence. In the correct use of the word, the key expression is unearthing
commonly defined as "detecting a bit new". Even the accepted book "Data mining: no-
nonsense machine learning tools and methods with Java" (which covers mostly machine
learning material) was originally to be named just "Practical machine learning", and the
phrase "data mining" was simply added for marketing reasons.

The definite data mining duty is the repeated or semi-automatic investigation of


huge quantities of data to remove previously indefinite interesting outlines such as
clusters of data proofs (cluster analysis), odd records (irregularitydiscovery) and
dependencies (association rule mining). This frequently involves using database
systems such as spatial indexes. These outlines can then be seen as a kind of précis of
the input data, and may be used in further study or, for example, in machine learning
and extrapolative analytics. For example, the data mining step power identify numerous
groups in the data, which can then be used to obtain more accurate prediction results by
a decision support system.

1.1 Background

Data mining idea and physical model from data has open for centuries. Early
methods of identifying patterns in data comprise regression analysis (1800s) and Bayes'
theorem (1700s). The proliferation, ubiquity and growing power of computer
technology has improved strategy gift, thought, data group and storeroom. Data mining
concepts build up more facility for study. As data sets have adult in size has more and
more been increased with indirect, mechanical data processing, aided by other
discoveries in information technology and computer science such as neural networks,
cluster analysis, genetic algorithms (1950s).

Data mining researcher residential decision trees (1960s). Data mining


researcher also urbanized support vector machines (1990s). Data mining is helpful to
applying these methods. These methods are exposure hidden patterns in big data sets.
Data mining bridges the gap from applied statistics, bioinformatics and artificial
intelligence. Data mining provides the mathematical background to database
management by exploiting. In Data mining data is stored and indexed in databases to
implement the definite learning and sighting algorithms more ably. Data mining
allowing such methods to be functional to a large of data sets.

Problem description
 Finding relevant information in unstructured data is a challenge. The data is
unknown in terms of structure and values. The lifecycle of each part of data is in
a specific domain, whereby a domain expert is available for a priori knowledge.
Domain experts can creates structures by hand in the data, however this is a
time-consuming job and it is done for one dataset in one domain.
 An additional challenge is connecting data from different domains.
1.2 Collaborative environment
Al the data is (geographically) spread over multiple domains. Whereby the
environment consists of more than one domain and they have the intention to
collaborate. Users are physically located at different places exchanging knowledge and
share information by interacting. Nowadays collaborative environments have the
characteristics of being a complex infrastructure, multiple organizational sub-domains,
information sharing is constrained, heterogeneity, changes, volatile, and dynamic.

Fig: Collaborative environment

1.3 Domain- and privacy restrictions


Each domain can have its own privacy restrictions. A domain has its own
standards in communication, and interaction, data storage, data structures, and culture.
Another item are the privacy restrictions, domains have their own policies and
restriction. So not all privacy data can be accessed by foreigners.

1.4 Retrieving relevant information


The main problem is retrieving relevant information from multiple domain
resources.gives an overview of multiple domains with cross-domain information
sharing. Each domain consists of multiple manifests, where by these manifest can
change from structure in time. Each domain expert tries to create a structure or pattern
by hand with his/her a priori domain knowledge. However, this is done by hand. Each
domain administrator does this for his domain for fulfilling the goal of retrieving rich
information from manifest in a readable structure. When there is the need for
collaboration in connecting and coupling two domains for creating a shared
conceptualization the domain experts have to perform the job together. By
communication and creating conformity, reducing the noise of interacting, and both
physically and virtual different worlds are connected in creating a holistic environment.

Problem description
During problem analysis, system administrators combine information from
various monitoring tools. Domain administrators try to relate their findings to
administrators from different domains. This process involves manual inspection of log
files. This is a time-consuming job, which sometimes involves the inference of missing
data.The current study aims to find patterns in the observed data, and build models of
the structure of information in the log files. By relating information from multiple
sources (log files) an automated system should be able to infer missing information (1).

The problem in managing grid infrastructures is that local domain information


has no connection to other domains. Manually combining relevant data of two or more
domains is a time-consuming job, especially with grid infrastructures having dynamic
characteristics and considering the domain privacy aspect.

1.5 Approach
The following approach is used to address the research questions. First, state of
the art literature research is performed to get an overview of learning mechanisms and
agent technology in distributed systems.
Experiment: Corpus creation using genetic algorithms the first experiment is using or
creating an intelligent algorithm for Region of Interest (ROI) extraction. ROI's are
possible parts of the solution domain (the retrievable structure/pattern, see figure 1-1).
The algorithm has to have the ability to read manifests, to interpret them, and have a
learning function. The result should be a structured set containing regular expressions.
The structured set is called a corpus. The regular expressions represent the value data of
ROI's. A ROI in a log file can be a sentence, verb, IP address, date or a combination
(see figure 1-3).
The algorithm uses a log file and domain knowledge as input. Domain knowledge
contains descriptive attributes e.g. {; ,Time = . * date= }. A descriptive attribute is a
data pattern that the agent will search for in the log file. A descriptive attribute is a
regular expression. With the manifest and the domain knowledge the algorithm will
return a corpus. The corpus represents an information pattern in the log file. Information
retrieval is achieved by querying the corpus. These queries return the values of the ROI
Fig: general ROI Extraction algorithm
1.6 Prototype: Multi-agent system
The prototype of the decentralized system was based on agent technology.
Agents are described in more detail in 0. In the current study agents are used to collect
and retrieve information from manifests. They are equipped with software components
to collect data. A multi agent system (MAS) contains multiple agents which can
communicate which each other. With MAS it is possible to deal with domain- and
privacy restrictions. Agents in MAS can cooperate to achieve cooperative learning and
distributed problem solving. MAS can increase its self-efficiency and communicate
across domains and work autonomously.
The agents use Regions of Interest (ROI) to analyze manifests. A Region of
Interest is a (part of a) pattern or a (part of a) model from a manifest. The agents can
communicate with each other via the network. The agents collaborate with other agents;
to form an alliance, in order to exchange ROI's. By exchanging their knowledge, the
agents can learn from one another and can achieve higher levels of data abstraction.

Scope
The scope in the current study is shown in figure 1-4 every chapter will zoom in
to a part of the figure. The scope is retrieving data from manifest(s) using domain
knowledge and machine learning techniques. Figure 1-4 describes the various levels of
the current study, and their relations. The top level describes the data that was used, in
this case manifest containing log data. The second level involved pre-processing. That
is retrieving/creating domain knowledge about the manifests. This domain knowledge
was used in the next level by the algorithm to find patterns in the manifest, based on the
domain knowledge. Finally, in the level of the Multi-agent system, the found patterns -
or ROI's- were exchanged between agents.
Fig: Text mining executed by Multi-agents

What is Data Mining?


Data mining is a perfectly interdisplinary subject. Data mining can be explains in
many different ways. Data mining should have been more unlike named first is
knowledge mining from database and second is knowledge discovery from data set.
Many people treat data mining and concepts as a synonym. It’s for another popularly
used term and the others view data mining as simply an essential step in the course of
data mining.

1.7Uses of Data mining


Games
Data mining is worn in games such as table cases with any foundation little-
plank dots and boxes, small timber curse and positive end games in chess dots-and-
boxes original are for data mining has been opened.
Business
Data mining is the scrutiny of chronological business. Data mining stored data
as databases to reveal veiled patterns and trends. Data mining is very important in
customer rapport management applications. In data mining data clustering is most
dominant and important technique. It is also be used to repeatedly learn within a
consumer list. Data mining is furthermore individual loyal to human property
department in identifying the quality of their most thriving employees.

Science and engineering


In recent years, data mining has been worn in many areas of knowledge and
industrial, such as bioinformatics, medicine, health care, physics and environment. In
the study of human genetics, genetics, sequence mining helps tackle the very important
goal of accepting. This thoughtful the mapping the inter personage variations in being
DNA/RNA sequence and protein chain the unpredictability in disease openness. In
simple terms, it aims to discover how the changes in an folks DNA/RNA sequence
affects the risks of developing common diseases such as cancer, The data mining
scheme is used to perform this task. This is known as multifactor dimensionality
reduction.

Human rights
Data mining is practical in human rights area. Data mining of government
records mostly of the courts system. Justice system enables the discovery of systemic
human rights.

Objectives
 To collect different samples from the different areas of the protein, andforest fire
data.
 To include identification and maintenance of techniques and exploreclustering,
classification techniques for this problem.
 To perform comparative analysis between different areas (protein, and forestfire
data).
 To explore different techniques for protein, and forest fire data.
 To use the data mining algorithms.
 To explore the study of environmental data, Decision Tree, prediction data.

CHAPTER - 2
LITERATURE SURVEY

2.1 Information Retrieval & Extraction


Information retrieval (IR) and extraction (IE) are concepts that derive from
information science. Information retrieval is searching for documents and for
information in documents.In daily life, humans have the desire to locate and obtain
information. A human tries to retrieve information from an information system by
posing a question or query. Nowadays there is an overload of information available,
while humans need only relevant information depending on their desires. Relevance in
this context means returning a set of documents that meets the information need.
Information extraction (IE) in computing science means obtaining structured
data from an unstructured format. Often the format of structured data is stored in a
dictionary or an ontology that defines the terms in a specific domain with their relation
to other terms. IE processes each document to extract (find) possible meaningful entities
and relationships, to create a corpus. The corpus is a structured format to obtain
structured data.
Information retrieval is an old concept. In a physical library books are stored on
a shelf in a specific order e.g. per topic and then alphabetic. When a person needs
information on a specific topic, he or she can run to the shelf and locate a book that fits
the most to his or her needs. With the advent of computers, this principle can also be
used by information systems. Well-known information-retrieval systems are search
engines on the web. For instance, Google tries to find a set of available documents on
the web, using a search phrase. It tries to find matches for the search phrase or parts of
it. The pre-processing work for the search engines is the information extraction process;
to create order in a chaos of information. Google crawls the web for information,
interprets it, and stores in a specific structure so that it can be quickly accessed when
users are firing search phrases

Data mining consists of an iterative sequence of the following steps


1) Data cleaning (to remove noise and inconsistent data)
2) Data integration (where multiple data sources may be combined)
3) Data selection (where data relevant to the analysis task are retrieved from the
database)
4) Data mining (an essential process where intelligent methods are applied in order
to extract data patterns)
5) Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
Fig: Data mining process

2.2 Text mining


Text mining uses the same analysis approach and techniques as data mining.
However data mining requires structured data, while text mining aims to discover
patterns in unstructured data (3). For commercial use text mining will be the follow-up
of data mining. With the growing number of digitized documents and having large text
databases, text mining will become increasingly important. Text mining can be a huge
benefit for finding relevant and desired text data from unstructured data sources.
NACTEM1 performs research in Text mining and applies the found methods on the
MEDLINE data source it's a huge database containing medical information. With text
mining, the input will be a text set which can be unstructured or semi structured. For
example a text document can have a few structured parts like title, author, publication
date, and category. The abstract and content might be unstructured components with a
high potential information value. It is hard to retrieve information from those parts with
conventional data mining.
Text mining uses unstructured documents as input data. In other words,
documents that is hard to interpret in terms of meaning. There are few companies
working on profitable applications for text-mining. Because of the challenges involved
in working with text and the differences between languages it is a challenge to create a
general solution or application. The research area is currently "too young" to deal with
all of the aspects of text and natural language processing and linking information to
each other. However, the first results are promising and perform well e.g. the work
performed by TextKernel2. (In light of the current study the author visited Text Kernel
to discuss text mining techniques.) Text kernel is a company specialized in mining data,
and is working on text mining with promising results e.g. parsing and finding structures
in Curriculum Vitae (C.V.) documents. These C.V.'s are being collect and parsed in a
general format for international staffing & recruitment agencies.

Intelligent algorithms
This paragraph begins with a basic explanation of intelligence and learning in
software components. Intelligence - in software is being characterized as:
 adaptability to a new environment or to changes in the current environment;
 capacity for knowledge and the ability to acquire it;
 capacity for reason and abstract thought;
 ability to comprehend relationships;
 ability to evaluate and judge;
 capacity for original and productive thought
Learning - is the process of obtaining new knowledge. It results in a better reaction
to the same inputs at the next session of operation. It means improvement. It is a step
toward adaptation. Learning is a important characteristic of intelligent systems.
There are three important approaches to learning:

 Learning through examples. This means learning from trainings data, For
instance ( X i, Y i ) , whereXi is a vector from the domain space D and Y i is a vector
from the solution space S , i = 1,2,. . . , n, are used to train a system to obtain the
goal function F : D — S . This type of learning is typical for neural networks.
 Learning by being told. This is a direct or indirect implementation of a set of
heuristic rules in a system. For example, the heuristic rules to monitor a car can
be directly represented as production rules. Or instructions given to a system in a
text form by an instructor (written text, speech, natural language) can be
transformed into internally represented machine rules.
 Learning by doing. This way of learning means that the system starts with nil or
little knowledge. During its functioning it accumulates valuable experience and
profits from it, and performs better over time. This method of learning is typical
for genetic algorithms

The first two approaches are top down, because there are many possible solutions
available. These approaches can learn from experience or via an instructor (supervised).
The third one is a bottom- up strategy, beginning with a little knowledge and try to find
the best possible solution. But the final or optimal solution is not known only parts of
the solution domain are given.
In information technology, an agent is an autonomous software component.
Autonomous, because it operates without the direct intervention of humans or others
and has control over its actions and internal states. It perceives its environment through
sensors and acts upon its environment through effectors. Agents communicate with
other agents. By communication, agents can work together which can result in
cooperative problem solving, whereby agents have their own tasks and goals to fulfil in
the environment. See figure 2-2 Multi agent system canonical view where the agents
interact with each other. And an agent perceives a part of the environment with its
actions an agent can influence the environment (partial).
Multi agent system (MAS) is a multiple coupled (intelligent) (software) agents.
Multi agents which interact with one another through communication and are capable of
to perceive the environment. Multi agent systems solve problems, which are difficult or
impossible for a single agent system.

Fig: Multi agent system canonical view


2.3 Properties of environments
Accessible vs. inaccessible
An accessible environment is when agent has a complete state of the
environment. Agents detect all relevant states of the environment to base their decision
on. If the environment is highly accessible it's easier for an agent to make more accurate
decisions (accurate, as being related to - or appropriate given the state of the
environment). Examples of environments with high inaccessible potential are the
internet, and the physical real world.

Deterministic vs. non-deterministic


A deterministic environment is one in which any action has a single guaranteed
effect - there is no uncertainty about the state that will result from performing an action.

Episodic vs. non-episodic


In an episodic environment agent perform in several episodes, without any
relevance or relations between the episodes. Actions executed in previous episodes by
the agent after acting and perceiving the environment has no relevance in the following
(i.e. new) episodes. Agent doesn’t need the ability to reason ahead, the quality of
performance does not depend on previous episodes. In a non-episodic environment
however, agents should "think" about possible next steps.

Static vs. dynamic


Static environment remains unchanged even after actions performed by an
agent. Dynamic environments are changing environment and are harder to handle for
agents. If an environment is static, but agents can change characteristics of the
environment, it is called semi-dynamic.

Discrete vs. continuous


Discrete environment: there are a fixed, finite number of actions and precepts in
it e.g. chess game.
Review of Literature
In this literature (Agarwal et al., 1993)Proteins typically do not act remote in a
cubicle but meaning with in complex cellular paths, interacting with additional proteins
what's more in couples or as modules of superior complexes while lots of protein
complexes have been known by great balance untried studies, due to a huge number of
false-positive communications on hand in existing protein complexes, it is unmoving
not easy to gain an correct considerate of purposeful modules, which cover assembly of
proteins involved in general elementary biological occupation. In this paper, we at hand
a hyperclique example discovery come near for extracting purposeful modules
(hyperclique patterns) from protein compound.
In this research discussed SAS solution for pharmacovigilance.( Kumar et al.,
2004),. (Arthur and Vassilvitskii, 2007) presented k-means++ method is a widely used
clustering technique. Experiments show in this research argument improves both
accuracy of k-means, and speed often quite dramatically.
In this literature (Bashers et al, 2012) authors presented Data mining. Data
mining discover knowledge from a lot of quantity of data scrutiny. In this research
deeply focused on, a genetic algorithm-bottom approach for mining arrangement set of
laws. This research work based on coverage, comprehensibility of the rules, accuracy
and simplifying the implementation of a genetic algorithm. In this research discussed
the design of encoding, which is based on fitness function of genetic algorithm and
genetic operators. Experimental results show that genetic algorithm proposed in this
research. This is suitable for discovered higher classification performance and data
mining methods to unknown data.
In (Bauer and Kohavi, 1999) presented a lot of examples bagging and boosting.
In this research author presented MDT presented the better result voting and stacking.
Bagging works as a method of increasing accuracy. Bagging is also called Bootstrap.
Bootstrap is based on random sampling with replacement. Bagging is a machine learning
ensemble algorithm.
In (Bifet et al., 2009) was present figures torrent construction. In this surround
work author presented data mining framework which is applied with techniques.
In (Dietterich, 2002) described three types of ensembles problems associated
with base learning algorithms: first is the statistical problem second is the computational
problem, and the third is representational problem.
(Dr. Bhatia and DeepikaKhurana, 2013) the authors presented k- Means
clustering algorithm. This research presented the qualified study of customized k-
Means clustering algorithm and original k-means clustering algorithm. Theses
algorithms are execute on different datasets. Routine result came using Original k-
Means and other adapted k-means algorithms. In this research used MATLAB R2009b.
Authors the results are calculated on some performance measures such as, no. of points
which is misclassified, accuracy, Silhouette validity index, no. of iterations and
execution time in this research.
In (Hemalatha and Saranya, 2011) presented a survey paper of spatial data
mining different spatial task. Authors focused on the sole features that distinguish
spatial data mining from traditional data mining. In this research authors gave
applications and Techniques, Issues and challenges on spatial data mining. Author
presented spatial data analysis which is very different task for research area.
In (Dzeroski and Zenko, 2004) worked on Ensemble learning techniques. This
technique created a meta-classifier by combining several classifiers, it is typically by
voting, ensemble learning created on the same data. Ensemble learning improved their
performance.
In this literature (Ester et al., 2001) authors presented a database-oriented
scaffold for spatial data mining. A small set of basic and primary operations on
thesegraphs and paths were discussed as database primitives for spatial data mining.
Many types of techniques discussed in this research such as commercial DBMS were
presented. Inthe research authors covered the main tasks of spatial data mining: spatial
classification, spatialcharacterization, spatial clustering and spatial trend detection. For
each of these tasks, authors presentedalgorithms prototypical applications. The authors
indicated interesting directions for future. Since the system overhead imposedby this
DBMS is rather many types of concepts and techniques of improving the efficiency
should be investigated. Forexample, techniques for processing sets which provide more
informationto the DBMS can be used to improve the overall efficiency of mining
algorithms.

FCM is based on fuzzy logic (Elena, 2013) author presented paper is a survey of
fuzzy logic theory which is applied in cluster analysis. In this work author presented
review the Fuzzy c - means clustering method in MATLAB.
In this literature (Ester et al., 2001) authors presented a database-oriented
scaffold for spatial data mining. A small set of basic and primary operations on
thesegraphs and paths were discussed as database primitives for spatial data mining.
Many types of techniques discussed in this research such as commercial DBMS were
presented. Inthe research authors covered the main tasks of spatial data mining: spatial
classification, spatialcharacterization, spatial clustering and spatial trend detection. For
each of these tasks, authors presentedalgorithms prototypical applications. The authors
indicated interesting directions for future. Since the system overhead imposedby this
DBMS is rather many types of concepts and techniques of improving the efficiency
should be investigated. Forexample, techniques for processing sets which provide more
informationto the DBMS can be used to improve the overall efficiency of mining
algorithms.

(Ghosh and Dubey, 2013) this research authors presented two important
clustering algorithms. First is K-Means and second is an FCM (Fuzzy C-Means)
algorithm and comparative study done between these algorithms.
In this (Jain and Gajbhiye, 2012) research authors presented competitive Neural
Network, K means Clustering algorithm, Fuzzy C Means algorithm and one original
method Quantum Clustering to consider its. The main aim of this research was to
compare the four clustering techniques with lot capable presentation of sum data which
is multivariate. Authors introduced an easy-to-use and astute tool. This tool compares
some clustering methods within the same scaffold in this research.
In (Kalyankar and Alaspurkar, 2013) this research authors presented a lot of
amount of data mining concepts, data, many types of data mining methods such as
Classification, clustering, hyper cyclic patterns, many types of algorithms etc.. The
main aim of this research is study on weather data using data mining technique and
methods like clustering technique.
In (Kavitha and Sarojamma, 2012) this research author presented Disease
Management Programs are initial to encompass providers across the healthcare band. In
this editorial creator express the diabetic monitoring platform, supports Diabetes (D)
diagnosis assessment which is offering functions of DM. It is based on the CART
method. This work explained a decision tree for diabetes diagnostics and showed how
to use it as a basis for generating knowledge base rules.
In (Lavrac and Novak, 2013) this research worked on first outlines relational
data mining approaches and finding of subgroups. In this research described recently
developed approaches to semantic data mining which facilitate the use of sphere
ontology. This ontology is the background acquaintance in analysis of data. The
techniques and tools are useful of illustrated on selected biological applications.
The author in literature (Lai and Tsai, 2012) presented the preliminary results.
This result present in the validation. The author presented risk measurement of total
victory measures which upshot by profound hammering rains in the Shimen reservoir
watershed of Taiwan. In this research author used spatial analysis and data mining
algorithms. In this research researcher focused on Normalized Difference Vegetation
Index, eleven factors such as Digital Elevation Model, DEM, slope, aspect, curvature,
geology, soil, land use, fault, river and road.

2.4 Title: Monitoring of Diabetes with Data Mining via CART Method
Author:Kavitha K, Sarojamma R M

Disease Management Programs are beginning to encompass providers across the


healthcare continuum, including home health care. The premise behind disease
management is that coordinated, evidence-based interventions can be applied to the care
of patients with specific high-cost, high-volume chronic conditions, resulting in
improved clinical outcomes and lower overall costs. The paper presents an approach to
designing a platform to enhance effectiveness and efficiency of health monitoring using
DM for early detection of any worsening in a patient’s condition. In this article, we
briefly describe the diabetic monitoring platform we designed and realized which
supports Diabetes (D) diagnosis assessment offering functions of DM based on the
CART method. The work also gives a description of constructing a decision tree for
diabetes diagnostics and shows how to use it as a basis for generating knowledge base
rules. The system developed achieved accuracy and a precision of 96.39% and 100.00%
in detecting Diabetes these preliminary results were achieved on public databases of
signals to improve their reproducibility. Finally, the article also outlines how an
intelligent diagnostic system works. Clinical trials involving local patients are still
running and will require longer experimentation.

Advantages
 It can generate regression trees, where each leaf predicts a real number and not
just a class.
 CART can identify the most significant variables and eliminate non-significant
ones by referring to the Gini index.

Disadvantages
 CART may have an unstable decision tree. When learning sample is modified
there may be increase or decrease of tree complexity, changes in splitting
variables and values.
2.5 Title: Verification and Risk Assessment for Landslides in the Shimen Reservoir
Watershed of Taiwan Using Spatial Analysis and Data Mining
Author: J-S. Lai, F. Tsai

Spatial information technologies and data can be used effectively to investigate


and monitor natural disasters contiguously and to support policy- and decision-making
for hazard prevention, mitigation and reconstruction. However, in addition to the vastly
growing data volume, various spatial data usually come from different sources and with
different formats and characteristics. Therefore, it is necessary to find useful and
valuable information that may not be obvious in the original data sets from numerous
collections. This paper presents the preliminary results of a research in the validation
and risk assessment of landslide events induced by heavy torrential rains in the Shimen
reservoir watershed of Taiwan using spatial analysis and data mining algorithms. In this
study, eleven factors were considered, including elevation (Digital Elevation Model,
DEM), slope, aspect, curvature, NDVI (Normalized Difference Vegetation Index), fault,
geology, soil, land use, river and road. The experimental results indicate that overall
accuracy and kappa coefficient in verification can reach 98.1 % and 0.8829,
respectively. However, the DT model after training is too over-fitting to carry
prediction. To address this issue, a mechanism was developed to filter uncertain data by
standard deviation of data distribution. Experimental results demonstrated that after
filtering the uncertain data, the kappa coefficient in prediction substantially increased
29.5%.The results indicate that spatial analysis and data mining algorithm combining
the mechanism developed in this study can produce more reliable results for verification
and forecast of landslides in the study site.

2.6 Title: Comparative Analysis of K-Means and Fuzzy C-Means Algorithms


Author:SoumiGhosh, Sanjay Kumar Dubey

In the arena of software, data mining technology has been considered as useful
means for identifying patterns and trends of large volume of data. This approach is
basically used to extract the unknown pattern from the large set of data for business as
well as real time applications. It is a computational intelligence discipline which has
emerged as a valuable tool for data analysis, new knowledge discovery and autonomous
decision making. The raw, unlabeled data from the large volume of dataset can be
classified initially in an unsupervised fashion by using cluster analysis i.e. clustering the
assignment of a set of observations into clusters so that observations in the same cluster
may be in some sense be treated as similar. The outcome of the clustering process and
efficiency of its domain application are generally determined through algorithms. There
are various algorithms which are used to solve this problem. In this research work two
important clustering algorithms namely centroid based K-Means and representative
object based FCM (Fuzzy C-Means) clustering algorithms are compared. These
algorithms are applied and performance is evaluated on the basis of the efficiency of
clustering output. The numbers of data points as well as the number of clusters are the
factors upon which the behaviour patterns of both the algorithms are analyzed. FCM
produces close results to K-Means clustering but it still requires more computation time
than K-Means clustering.

Advantages
 KM is relatively efficient in computational time complexity with its cost

Disadvantages
 KM may not be successful to find overlapping clusters
 It is not invariant to non-linear transformations of data.
 KM also fails to cluster noisy data and non-linear datasets.

2.7 Title: Data Mining Technique to Analyse the Metrological Data


Author:Meghali A. Kalyankar, S. J. Alaspurkar

Data Mining is the process of discovering new patterns from large data sets, this
technology which is employed in inferring useful knowledge that can be put to use from
a vast amount of data, various data mining techniques such as Classification, Prediction,
Clustering and Outlier analysis can be used for the purpose. Weather is one of the
meteorological data that is rich by important knowledge. Meteorological data mining is
a form of data mining concerned with finding hidden patterns inside largely available
meteorological data, so that the information retrieved can be transformed into usable
knowledge. We know sometimes Climate affects the human society in all the possible
ways. Knowledge of weather data or climate data in a region is essential for business,
society, agriculture and energy applications. The main aim of this paper is to overview
on Data mining Process for weather data and to study on weather data using data
mining technique like clustering technique. By using this technique we can acquire
Weather data and can find the hidden patterns inside the large dataset so as to transfer
the retrieved information into usable knowledge for classification and prediction of
climate condition. We discussed how to use a data mining technique to analyze the
Metrological data like Weather data.

Advantages
 Data Logger can automatically collect data on a 24-hour basis
CHAPTER - 3
PRE-PROCESSING IN INFORMATION RETRIEVAL

3.3 Domain administrators


Manually gaining knowledge can be done by asking an expert or administrator
in the specific domain. This person can provide the required information about the
manifest. Which knowledge is needed and give a priority order on each item.

In this specific case on gaining knowledge and retrieving it via domain experts
has some disadvantages. For these experts it's a time consuming job, providing
knowledge to other people or systems. In the case of expert providing to an other
person. The domain knowledge should be provided in a consumable format. Where a
non domain expert can work with the provide knowledge in retrieving rich information
from manifest. In practice the domain knowledge can change, so that's an extra
challenge for the domain experts.This study assumes that a domain expert provides
knowledge in readable machine format as input for finding a corpus (structure) of the
manifest. It is important to have a readable Human machine communication. An
ontology which is very helpful in creating a compromise and an agreement in
communication and the meaning about the given domain knowledge.

3.2 Extracting algorithms


Delegating these processes and responsibilities raises the need for machine-
process form. A machine (extracting algorithms) reads a set of manifests and tries to
extract domain knowledge. This will be used to find a corpus out of a set of manifests.
An advantage that the domain experts have a lower load of work in providing
knowledge. Their task can change from an executing task to a controlling one. For
checking the outcome of these algorithms and give these algorithms feedback, so in the
future unseen data will be handled better. This is only the case in supervised learning,
paragraph 0. Another approach can be that domain experts only change the outcome of
an domain knowledge gaining algorithm
Regular Expressions
The domain knowledge is in this particular case represented by regular
expressions. A regular expression is very powerful, because they represent a search
pattern in a manifest.
This format of domain knowledge is used as input for the genetic algorithms. In
the GRID case for analyzing log files key value pairs are used, see table 3-1 regular
expressions. Log files have mostly a hierarchical structure containing a lot of key value
pairs. Using domain knowledge represented by regular expressions as input for corpus
creating algorithms is very powerful. The corpus can be queried and it consists of
multiple regular expressions. Indeed regular expressions are search patterns, which can
be queried.

Key Value (regular expression) Log file line example


Time TIME\p{Punct}*\s*([\w+\s*:]*) TIME: Sun Dec 7 04:02:09 2008
PID PID\p{Punct}*\s*(\w{1,}) PID: 13048
IP ([01]?\d\d? 12[0-4]\d 125[0-5])\.([01]?\d\d? Got connection
12[0-4]\d 125[0-5])\.([01]?\d\d? 12[0-4]\d 128.142.173.156
125[0-5])\.([01]?\d\d? 12[0-4]\d 125[0-5])

Table: Regular expressions

3.3 Applications of Text Mining Information Retrieval


Information retrieval (IR) concept has been developed in relation with database
systems for many years. Information retrieval is the association and retrieval of
information from a large number of text-based documents. The information retrieval
and database systems, each handle various kinds of data; some database system
problems are usually not present in information retrieval systems, such as concurrency
control, recovery, transaction management unstructured or semi-structured data sets
such as emails HTML files and full text documents etc.. Text Mining is used for finding
the new, previously unidentified information from different written resources.
Structured data is data that resides in a fixed field within a record or file. This
data is contained in relational database and spreadsheets. The unstructured data usually
refers to information that does not reside in a traditional row-column database and it is
the opposite of structured data. Semi-Structured data is the data that is neither raw data,
nor typed data in a conventional database system. Text mining is a new area of
computer science research that tries to solve the issues that occur in the area of data
mining, machine learning, information extraction, natural language processing,
information retrieval, knowledge management and classification. Figure 1 gives the
overview of text mining process and update.

Fig: Text Mining Process

Also, some common information retrieval problems are usually not encountered
in conventional database systems, such as unstructured documents, estimated search
based on keywords, and the concept of relevance. Due to the huge quantity of text
information, information retrieval has found many applications. There exist many
information retrieval systems, such as on-line library catalog systems, on-line document
management systems, and the more recently developed Web search engines

3.4 Information Extraction


The information extraction method identifies key words and relationships within
the text. It does this by looking for predefined sequences in the text, a process called
pattern matching. The software infers the relationships between all the identified places,
people, and time to give the user with meaningful information.
This technology is very useful when dealing with large volumes of text.
Traditional data mining assumes that the information being "mined" is already in the
form of a relational database. Unfortunately, for many applications, electronic
information is only available in the form of free natural language documents rather than
structured databases

Fig: Process of Text Extraction


Categorization
Categorization involves identifying the main themes of a document by inserting
the document into a predefined set of topics. When categorizing a document, a
computer program will often treat the document as a "bag of words." It does not try to
process the actual information as information extraction does. Rather, the categorization
only counts words that appear and, from the counts, identifies the main topics that the
document covers. Categorization often relies on a glossary for which topics are
predefined, and relationships are identified by looking for large terms, narrower terms,
synonyms, and related terms
3.5 Pre-processing Method
Pre-processing method plays a very important role in text mining techniques and
applications. It is the first step in the text mining process. In this paper, we discuss the
three key steps of pre-processing namely, stop words removal, stemming and TF/IDF

Fig: Text Mining Pre-Processing Techniques


3.5.1 Extraction
This method is used to tokenize the file content into individual word.

3.5.2 Stop Words Elimination


Stop words are a division of natural language. The motive that stop-words
should be removed from a text is that they make the text look heavier and less important
for analysts. Removing stop words reduces the dimensionality of term space. The most
common words in text documents are articles, prepositions, and pro-nouns, etc. that
does not give the meaning of the documents. These words are treated as stop words.
Example for stop words: the, in, a, an, with, etc. Stop words are removed from
documents because those words are not measured as keywords in text mining
applications

3.5.3 Stop word removal methods


Four types of stop word removal methods are followed, the methods are used to
remove stop words from the files.
 The Classic Method: The classic method is based on removing stop words
obtained from pre-compiled lists.

 Methods based on Zipf's Law (Z-Methods):

 In addition to the classic stop list, we use three stop word creation methods
moved by Zipf's law, including: removing most frequent words (TF-High) and
removing words that occur once, i.e. singleton words (TF1). We also consider
removing words with low inverse document frequency (IDF).

 The Mutual Information Method (MI)


o The mutual information method (MI) is a supervised method that works
by computing the mutual information between a given term and a
document class (e.g., positive, negative), providing a suggestion of how
much information the term can tell about a given class. Low mutual
information suggests that the term has a low discrimination power and
consequently it should be removed.

 Term Based Random Sampling (TBRS)


o This method was first proposed by Lo et al. (2005) to manually detect
the stop words from web documents. This method works by iterating
over separate chunks of data which are randomly selected. It then ranks
terms in each chunk based on their in format values using the Kullback-
Leibler divergence measure as shown in Equation 1.
dx(t) = Px (t).log2 Px (t) / 𝑝 (t) …..1
Where Px (t) is the normalized term frequency of a term t within a mass x, and P(t) is
the normalized term frequency of t in the entire collection. The final stop list is then
constructed by taking the least informative terms in all chunks, removing all possible
duplications.

3.5.4 Stemming
This method is used to identify the root/stem of a word. For example, the words
connect, connected, connecting, connections all can be stemmed to the word "connect"
[6]. The purpose of this method is to remove various suffixes, to reduce the number of
words, to have accurately matching stems, to save time and memory space

Fig : Stemming Process

3.5.5 Tokenization
Tokenization is a critical activity in any information retrieval model, which
simply segregates all the words, numbers, and their characters etc. from given document
and these identified words, numbers, and other characters are called tokens. Along with
token generation this process also evaluates the frequency value of all these tokens
present in the input documents. All the phases of tokenization process are Pre-
processing involves the set of all documents are gathered and passed to the word
extraction phases in which all words are extracted. In next phase all the infrequent
words are listed and removed for example remove words having frequency less than
two. Intermediate results are passed to the stop word removal phase. In this phase
remove those English words which are useless in information retrieval these English
words are known as stop words. For example stop words include “the, as, of, and, or, to
etc. this phase is very essential in the tokenization because it has some advantages: It
reduces the size of indexing file and it also improve the overall efficiency and make
effectiveness.
Next phase in tokenization is stemming. Stemming phase is used to extract the
sub-part i.e. called as stem/root of a given word. For example, the words continue,
continuously, continued all can be rooted to the word continue. The main role of
stemming is to remove various suffixes as result in the reduction of number of words, to
have exactly matching stems, to minimize storage requirement and maximize the
efficiency of IR Model.

Fig: stemming Process

On the completion of stemming process, next step is to count the frequency of


each word. Information retrieval works on the output of this tokenization process for
achieving or producing most relevant results to the given users
Input:
This is an Information retrieval model and it is widely used in the data mining
application areas. In our project there is team of 2 members. Our project name is IR”.
Output:
Words= this<1> is<4> an<1> Information<1> retrieval<1> model<1> and<1>
it<1> widely<1> used<1> in<2> the<1> data<1> mining<1> application<!> areas<1>
our<2> project<2> there<1> team<1> of<1> members<1> name<1> IR<1>

Numbers= 2<1>
On the completion of stemming process, next step is to count the
frequency of each word. Information retrieval works on the output of this tokenization
process for achieving or producing most relevant results to the given users. For
example, there is a document in which the information likes “This is an Information
retrieval model and it is widely used in the data mining application areas. In our project
there is team of 2 members. Our project name is IR”. If this document is passing to
the tokenization technique or process then the output of the process is like it separate
the words this, is, an, information etc. If there is a number it can also separate from
other words or numbers and finally give the tokens with their occurrences count in
the given document.

3.6 Proposed Algorithm and Example


In the proposed algorithm, tokenization is done based on the set of
training vectors which are initially provided into the algorithm to train the system.
The training documents are of different knowledge domain, are use to create
vectors. The created vector helps algorithm to process the input documents. The
tokenization on documents is performed with respect to the vectors, use of vectors in
pre tokenization helps to make whole tokenization process more precise and successful.
The effect on tokenization of vectors is shown in results section also, where the no
of token generated & time consumed for the process significantly differ.
Input : (Di)
Output : (Tokens)
Begin
Step1: Collect Input documents (Di) where i=1, 2, 3....n;
Step2: For each input Di; Extract Word
(EWi) = Di;
// apply extract word process for all documents i=1, 2, 3...n in and extract words//
Step 3: For each EWi;
Stop Word
(SWi) =EWi;
// apply Stop word elimination process to remove all stop words like is, am, to, as, etc.//
Stemming
(Si) = SWi;
// It create stems of each word, like “use” is the stem of user, using, usage etc. //
Step 4: For each Si; Freq_Count
(WCi)= Si;
// for the total no. of occurrences of each Stem Si. //
Return (Si);
Step 5: Tokens (Si) will be passed to an IR System.
End
Example Phase 1:

Doc1 Military is a good option for a career builder for youngsters. Military is not
covering only defense it also includes IT sector and its various forms are
Army, Navy, and Air force. It satisfies the sacrifice need of youth for their
country.
Doc2 Cricket is the most popular game in India. In crocket a player uses a bat to hit
the ball and scoring runs. It is played between two teams; the team scoring
maximum runs will win the game.
Doc3 Science is the essentiality of education, what we are watching with our eyes
Happening non-happening all include science. Various scientists working on
different topics help us to understand the science in our lives. Science is
continuous evolutionary study, each day something new is determined.
Doc4 Engineering makes the development of any country, engineers are
manufacturing beneficial things day by day, as the professional engineers of
software develops programs which reduces man work, civil engineers gives
their knowledge to construction to form buildings, hospitals etc. Everything
can be controlled by computer systems nowadays.

Table: Documents to the tokenization process

Phase 2:
In this phase, all the words are extracted from these four documents as shown below:

Name: doc1
[Military, is, a, good, option, for, a, career, builder, for, youngsters, Military, is, not,
covering, only, defense, it, also, includes, IT, sector, and, its, various, forms, are, Army,,
Navy,, and, Air, force., It, satisfies, the, sacrifice, need, of, youth, for, their, country.]

Name: doc2
[Cricket, is, the, most, popular, game, in, India., In, crocket, a, player, uses, a,
bat, to, hit, the, ball, and, scoring, runs., It, is, played, between, two, teams;, the, team,
scoring, maximum, runs, will, win, the, game.]

Name: doc3
[Science, is, the, essentiality, of, education,, what, we, are, watching, with, our,
eyes, happening, non-happening, all, include, science., Various, scientists, working, on,
different, topics,help, us, to, understand, the, science, in, our, lives., Science, is,
continuous, evolutionary, study,, each, day, something, new, is, determined.]
Name: doc4
[Engineering, makes, the, development, of, any, country,, engineers, are,
manufacturing, beneficial, things, day, by, day,, as, the, professional, engineers, of,
software, develops, programs, which, reduces, man, work,, civil, engineers, gives, their,
knowledge, to, construction, to, form, buildings,, hospitals, etc., Everything, can, be,
controlled, by, computer, systems, nowadays.]

Phase 3 and Phase 4:


After extracting all the words, next phases is to remove all stop words
and stemming, as shown below:

Name: doc1
[militari, good, option, for, career, builder, for, youngster, militari, not, cover,
onli, defens, it, also, includ, it, sector, it, variou, form, ar,armi, navi, air, forc, it, satisfi,
sacrific, need, youth, for, their, country]

Name: doc2
[Cricket, most, popular, game, in, India, in, crocket, player, us, bat, to, hit, ball,
score, run, it, plai, between, two, team, team, score, maximum, run, win, game]
Fig: Document Tokenization Graph

Fig: Overall-Time Graph

The Tokenization with Pre-processing leads to effective and efficient


approach of processing, as shown in results strategy with pre-processing process
100 input documents and generate 200 distinct and accurate tokens in 156 (ms), while
processing same set of documents in another strategy takes 289 (ms) and
generates more than 300 tokens.

Name: doc3
[scienc, essenti, educ, what, we, ar, watch,our, ey, happen, non, happen, all,
includ, scienc,variou, scientist, wor, on, differ, topic, help, to, understand, scienc, in,
our, live, scienc, continu, evolutionari, studi, each, dai, someth, new, determin]

Name: doc4
[engine, make, develop, ani, countri, engin, ar, manufactur, benefici, thing, dai,
by, dai, profession, engin, softwar, develop, program, which, reduc, man, work, civil,
engin, give, their, knowledg, to, construct, to, form, build, hospit, etc, everyth, can, be,
control, by, comput, system, nowadai]
CHAPTER - 4
BIODIVERSITY

There is a general agreement that biodiversity consists of three main types, or


levels, of diversity as depicted in Figure1. The first type is Genetic diversity, the second
one is Species diversity, and the last one is Ecosystem diversity, or Ecological diversity.
These are the three levels at which biological variety has been identified.

Fig: Types of diversity: Genetic (inner), Species (middle), and Ecosystem (outer)

Genetic diversity refers to the genetic variation and heritable traits within
organisms. All species are related with other species through a genetic network, but the
variety of genetic properties and features makes creatures different in their morphologic
characteristics. Genetic diversity applies to all living organisms having inheritance of
genes, including the amount of DNA per cell and chromosome structures. Genetic
diversity is an important factor for the adaptation of populations to changing
environments and the resistance to certain types of diseases. For species, a higher
genetic variation implies less risk. It is also essential for species evolution.
Species diversity refers to the variety of living organisms within an ecosystem,
a habitat or a region. It is evaluated by considering two factors: species richness and
species evenness. The first corresponds to the number of different species present in a
community per unit area, and the second to the relative abundance of each species in the
geographical area. Both factors are evaluated according to the size of populations or
biomass of each species in the area. Recent studies have shown relationships between
diversity within species and diversity among species. Species diversity is the most
visible part of biodiversity.
Ecosystem diversity refers to the variety of landscape of ecosystems in each
region of the world. An ecosystem is a combination of communities - associations of
species - of living organisms with the physical environment in which they live (e.g., air,
water, mineral soil, topography, and climate). Ecosystems vary in size and in every
geographic region there is a complex mosaic of interconnected ecosystems. Ecosystems
are environments with a balanced state of natural elements (water, plants, animals,
fungi, microbes, molecules, climate, etc.). Ecosystem diversity embraces the variety of
habitats and environmental parameters that occur within a region. pcologists consider
biodiversity according to the three following interdependent primary characteristics:
Ecosystems composition, i.e., the variety and richness of inhabiting species, ecosystems
structure, i.e., the physical and three dimensional patterns of life forms, and ecosystems
function, i.e., biogeochemical cycles and evolving environmental conditions. Even
though many application tools have been developed, evaluating biodiversity still faces
difficulties due to the complexity of precise evaluations of these parameters. Hence, the
overall number of species that can be measured and officially identified all around the
world is only 1.7 to 2 million and 5 to 30 million respectively

4.1 Environmental Issues

In nature, biodiversity is the key to keep natural balance in changing


environmental conditions. It functions as services, such as consumption service, that is
to serve the natural resources to human (e.g., food, clothing, housing, medicines),
industrial production service, that is to serve productivity of forest to be used either
directly or indirectly (e.g., extracting chemicals from plants in the forest), and others
(non-consumptive uses) including values of maintenance of ecosystems to be
sustainable (e.g., soil maintenance, nitrogen to the soil, synthesis of plant power,
humidity control).
All life on the planet needs nutrients and oxygen, which are the main factors for
survival. Especially, species depend on biodiversity resources produced by ecosystem
services. The ecosystem services can regulate climate changes, dispose of wastes,
recycle nutrients, filter and purify water, purify air, buffer against flooding, and
maintain soil fertility. Changes in environmental factors and ecosystems can thus
endanger life forms as reported in several scientific studies.
The report of Global Biodiversity Outlook of the Convention on Biological
Diversity (CBD) highlights Ban Ki-moon's speech, United Nations General Secretary,
on the fact that "the consequences of this collective failure, if it is not quickly corrected,
will be severe for us. Biodiversity underpins the functioning of the ecosystems on which
we depend for food and fresh water, health and recreation, and protection from natural
disasters. Its loss also affects us culturally and spiritually. This may be more difficult to
quantify, but is nonetheless integral to our well-being". The loss of biodiversity
becomes a serious issue for the twenty-first century. Its loss has direct and indirect
negative effects on many factors connecting the elements of biodiversity as well as
ecosystems.
In Environmental factors, the loss of biodiversity means that the natural balance
between environmental conditions and the different types of diversities cannot be
conserved, which will affect stability of ecosystems. This can lead to climate changes,
such as the global warming reported in several scientific studies, and consequently to
natural disasters (landslides, floods, typhoons, cyclones, hurricanes, tsunamis, etc.).
Tourism factors are impacted by environment factors: If they are affected, by
natural disasters or pollution for instance, tourism structure systems, such as aesthetic
natural landscapes and historical places, can be affected, and even destroyed. For
example, the effect of pollution on the structure and environment of the Venice city, in
Italy, is known to endanger buildings. This can impact tourism as natural landscapes
and historical places are attraction sites for tourists, which consequently contribute to
develop economical and social activities.
Human factors are linked to nutrients, oxygen, and other essential needs which
are produced from biodiversity resources. If the number of biodiversity resources is
decreased, the volume of vital products, such as food, water, plants and animals, will
also decrease. This can lead to human's consumption and survivability concerns. For
example, the augmentation of the production costs can lead to difficulties for human
populations to have access to vital resources, such as medicine, food, etc.

Fig: Environmental Issues

Society factors are affected by biodiversity loss as most parts of society


infrastructures and livelihood depend on the basic system and structure of nature. One
factor is nature productivity that depends on land structures for agriculture and
irrigation, wood materials for building habitats, and natural and energy materials for
other forms of consumption.
For example, an important part of people living in rural societies depend mainly
on productivity of agriculture and livestock for their livelihood, while people living in
metropolitan areas need more biodiversity productivity as the demands for food, energy,
materials and other resources are increased (transportation, construction, consumption
products, etc.). Scarceness of resources can thus cause an augmentation in production
costs leading to a reduction of the part of the population that has access to these
resources.
Economics factors are impacted by direct benefits and added-values of natural
resources (e.g., food, bio-fuels and renewable energies, animals and fibers, wood
materials, bio-medical treatments). These resources contribute to the economic
exchanges between countries around the world through internal and external commerce.
Biodiversity loss, and scarceness of resources, can affect populations from an
economical viewpoint. For instance, the important human population (more than 60
percent) that uses bio-medication for main health cares. It can also lead to higher
production costs, implying more competition, financial crisis, and others economic
related issues.

4.2 Methodology
This research deals with the methodological steps adopted in the research study.
The research Procedures followed is described under the following headlines:

A. Selection of Protein database and Forest fire database.


B. Selection of Data mining tools.
C. Use of Tools in research study such as MATLAB, WEKA.
D. Selection of Data mining Algorithms.

4.3 Resources and Technologies for Biodiversity

This section is devoted to the presentation of the major information providers


and different resources available, and the technologies used to represent data and
knowledge related to biodiversity and environment.
Resources

The most prominent information providers that propose data and knowledge
repositories used for biodiversity and environmental studies are shown in Figure2. Each
one provides contents related to different domains and categories depicted by the edges
in the schema. These providers propose contents of different types (documents,
databases, meta-data, spatio-temporal, etc.), in different categories (data, knowledge
and documentations) and for different domains of application. Two main categories of
resources are considered in this classification diagram. The Data category corresponds
to resources depicting facts about species (animal and plants) and environmental
conditions in specific areas. The Frameworks category corresponds to resources
depicting both tacit and formal knowledge related to biodiversity and environment
analytical application domains. Several providers, such as ESABII (East and Southeast
Asia Biodiversity Information Initiative), IUCN (International Union for Conservation
of Nature), and OECD (Organisation for Economic Co-operation and Development),
give access to both data and frameworks resources.
In the Biodiversity Policy knowledge domain, information concern issues of
principles, regulations and agreements on biodiversity. Amid these resources, we can
cite BioNET, CBD, ESABII, IUCN and UNEP that provide information to serve and
follow up among botanists, biologists and researchers in "globalization". For example,
the Agreement of the United Nations Decade on Biodiversity 2011-2020 aims to
support and implement the Strategic Plan for Biodiversity.
The Environment domain refers to results of researches and repositories on this
area to scientists or people who want to know status of environment on the earth,
especially the biologist who works on this issue. The major actors in this category are
BHL, BioNET, CBD, ECNC, IUCN, KNEU and UNEP that are organizations which
regularly publish reports and results of studies on domains related to biodiversity and
environment, such as the Protected Areas Presented as Natural Solutions to Global
Environmental Challenges at RIO +20 published by the IUCN or the Global
Environment Facility (GEF) published by the CDB organization.
The Economics domain corresponds to information from scientists about the
status of economics development, based on effects and values of ecosystem services.
The foremost information providers in this category are CBD, IUCN, OECD and
TEEB. Among reports and studies published by these organizations, we can cite
Restoring World's Forests Proven to Boost Local Economies and Reduce Poverty
by IUCN and Green Growth and Sustainable Development by OECD for example.
In the Health/Society domain, repositories supply knowledge information
referring to natural resources of ecosystem services and effects. This information has
been produced, and their validity was demonstrated, by researchers from world-wide
organizations such as BHL, CBD, IUCN, KNEU and UNEP. These organizations
provide summaries and proposals such as for example, Human Health and
Biodiversity is projected by CBD,Towards the Blue Society is projected by IUCN and
Action for Biodiversity: Towards a Society in Harmony with Nature is published by
CBD.
CHAPTER - 5
ROUGH SETS IN BIOINFORMATICS

Since biological data is today usually very large and multi-parameter, and since we
wanted to provide legible and easily interpretable models for such data, we had to turn
to appropriate machine learning techniques such as rough sets. Rough set is a well The
theory is based on logics and Boolean reasoning and in particular on the upper and
lower approximations of a set, the approximation space, and the modeling of sets. The
theory provides tools for solving classification challenges. The tools produce models the
quality of which is comparable with other widely used methods. The added value,
which makes rough sets a preferred tool for classification problems, is the structure of
the classification model, that is minimal subsets of attributes sufficient for classification
and their combinations. In certain cases we would also like an ordering of the attributes.
Rough sets can be applied to incomplete, inconsistent and noisy data. Rough sets
models consist of sets of minimal rules, which can be easily interpreted by the user.
Following the tutorial by rough sets usually operate on decision systems. A decision
system is a table where each row represents an object and each column represents an
attribute (also known as a feature) and the last attribute is called the decision attribute.
Formally, we have a pair A = (U, A U {d}); d$A, where U is the universe of objects, A
is a set of attributes called conditions and d is the decision attribute (also known as the
classification outcome). Both U and A are non-empty and finite. Each object from U has
a value (can be a missing value) for each attribute from A and for the decision attribute
d.

V a : U -+ Va (1)

where: Va - a set of possible values for attribute a, i.e. the value set of a. and

d : U - Vd (2)

where: Vd - a set of possible values for decision attribute d i.e. the value set of d.
Table 1. An example of decision system.

Sex Age Height Weight APOE Alzheimer risk


expression
X1 M > 50 < 170 < 60 t High
X2 F < 50 < 170 > 60 0 Low
X3 M < 50 > 170 > 60 t Low
X4 M > 50 < 170 < 60 t Low
X5 F > 50 > 170 < 60 0 Low
X6 F > 50 > 170 > 60 0 High
X7 F < 50 < 170 > 60 t High
X8 F > 50 > 170 < 60 0 Low

Many biological tasks lend themselves to machine learning-driven classification.


Aspects of this approach have been described. Here, we focused on legible and
interpretable models that received less attention there. Frequently, the classification
outcome is known and thus we can use the approach of supervised learning. Once such
outcomes are available, it is possible to create a decision system. Table 1 shows a
sample decision system. The system contains information about patients such as Sex or
Age. and a decision attribute - Alzheimer risk.
A decision system can carry redundant information. There may be many indiscernible
objects and there can be many attributes not necessary from the classification point of
view. Let us start from defining the equivalence relation. R C X x X is is called an
equivalence relation if and only if it is reflexive, symmetrical and transitive. The
equivalence class of an element y denoted [y] is defined as the set [y] = {x e X | yRx}.
As one may notice, in our example the objects x5 and x8 are indiscernible, while
objects x1 and x4 are also indiscernible but have different decision values. To illustrate
the ability of rough sets to handle uncertainty explicitly we introduce for the purpose of
this summary presentation a new class HighOr-Low for the indiscernible objects with
different outcomes, x1 and x4 (Table 2).

Table 2. An example of a decision system with unique outcomes.

Sex Age Height Weight APOE Alzheimer risk


expression
X1 M > 50 < 170 < 60 t HighOrLow
X2 F < 50 < 170 > 60 0 Low
X3 M < 50 > 170 > 60 t Low
X4 M > 50 < 170 < 60 t HighOrLow
X5 F > 50 > 170 < 60 0 Low
X6 F > 50 > 170 > 60 0 High
X7 F < 50 < 170 > 60 t High
X8 F > 50 > 170 < 60 0 Low

Given a decision system A = (U, A U {d}) we define C-indiscernibility relation


IND A (C) for any subset of attributes C C A:

I N D A (C) = { (x, x') e U 2 | V a(x) = a(x'), d (x) = d(x')\ (3)

We call x and x' indiscernible using attributes C, if (x,x') e IND A (C). We denote the
equivalence classes of the above relation as [x]C. Employing equivalence classes and
set theoretic operations we may attempt to construct sets such as Low Alzheimer Risk
or High Alzheimer Risk. Sometimes such sets will be exact, at other times they will be
approximate. We have examples of such classes in Table 1. As we noticed previously
objects x1 and x4 cannot be unambiguously assigned to any of these classes. Other
objects in the universe can be assigned strictly to one class. For example, objects x5, x8
having the same set of attribute values, both belong to class Low.
Given a decision system A = (U, A U{d}), a subset of attributes C C A and a set of
objects X C U C-lower and C-upper approximations are constructed, respectively, as
follows:

CX = {x | [ X ] C C X} (4)

and
CX = {x | [ X ] C nx = 0} (5)
The boundary region is now defined to be:

BNC (X)= CX - CX (6)

It means that objects that cannot have a uniquely assigned class label belong to a
boundary region. The boundary region covers the new decision class High or Low
shown in Table 2. If the boundary region for a decision system is not empty, the set
defined by the system is rough. In other words, if BNC (X) = 0, X is crisp, otherwise it
is rough. An example of a rough set is illustrated in research.

5.1 Network Algorithm

Algorithm: Generate Data Set


Input: Training Data, Testing Data
Output: Decision Value
Method:
Step 1: Load Dataset
Step 2: Classify Features (Attributes) based on class labels
Step 3: Estimate Candidate Support Value
While (instances! =null)
Do
Step 4: Support Value=Similarity between each instance in the attribute Find
Total Error Value
Step 5: If any instance < 0
Estimate
Decision value = Support Value/Total Error
Repeat for all points until it will empty
End If

5.2 Classification Tree Algorithm


Algorithm: Generate a Classification from the training tuples of data partition D.
Input:
 Data partition D, which is a set of training tuples and their associated class
labels;
 Attribute list, the set of can didate attributes;
 Attribute selection method, a procedure to determine the splitting criterion that
“best” Partitions the data tuples into individual classes. These criterions consist
of a splitting Attribute and, possibly, either a split point or splitting subset.
Output: A decision tree

5.3 Method:
1. Create a node N;
2. If tuples in D are all of the same class, C then
3. Return N as a leaf node labeled with the class C
4. If attribute list is empty then
5. Return N as a leaf node label
6. d with the majority class in D
7. Apply Attribute selection method (D, attribute list) to find the “best”
splitting criterion
8. Label node N with splitting criterion
9. If splitting attribute is discrete-valued and multiway splits allowed then
10. Attribute list ← attribute list − splitting attribute
11. For each outcome j of splitting criterion
12. Let Dj be the set of data tuples in D satisfying outcome j
13. If Dj is empty then
14. Attach a leaf labeled with the majority class in D to node N
15. Else attach the node returned by Generate decision tree (Dj, attribute list) to
node N
16. End for
17. Return N
CHAPTER - 6
EXPERIMENTS ON MICROARRAY DATA PATIENT SURVIVAL
PREDICTION
In this chapter, a new computational process for patient survival prediction using
microarray gene expression data will be presented. Different from all previous works, in
the first step, we carefully form the training set samples by selecting only short-term
survivors who died within a short period and long-term survivors who were still alive
after a relatively long follow-up time. This idea is motivated by our belief that short-
term and long-term survivors are more informative and reliable (than those cases in
between) for building and understanding the relationship between genes and patient
survival. In the second step, ERCOF is used to identify genes most associated with
survival. In the third step, a linear kernel support vector machine (SVM) is trained on
the selected samples and genes to build a scoring model. The model assigns each
validation sample a risk score to predict patient survival.
6.1 Methods
We will describe in detail the new method for patient survival prediction, focusing on
selecting an informative subset of training samples and building SVM-based scoring
function.
6.1.1 Selection of informative training samples
One of the main features of our new method is to select informative training samples.
Since our focus is on the relationship between gene expression and survival, the
survival time associated with each sample plays an important role here—two types of
extreme cases, patients who died in a short period (termed as “short-term survivors”)
and who were alive after a long period (termed as “long-term survivors”), should be
more valuable than those in the “middle” status. Thus, we use only a part of samples in
training and this is clearly different from other approaches that use all training samples.
Formally, for a sample T, if it’s follow-up time is F (T) and its status at the end
of follow-up time is E(T), then
𝑇 𝑖𝑠{𝑠ℎ𝑜𝑟𝑡 − 𝑡𝑒𝑟𝑚 𝑠𝑢𝑟𝑣𝑖𝑣𝑜𝑟, 𝑖𝑓 𝐹(𝑡) < 𝑐1𝐸(𝑇) = 1
𝑙𝑜𝑛𝑔 − 𝑡𝑒𝑟𝑚 𝑠𝑢𝑟𝑣𝑖𝑣𝑜𝑟, 𝑖𝑓 𝐹(𝑇) > 𝑐2 𝑜𝑡ℎ𝑒𝑟𝑠, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
E(T)=1 (6:1) stands for “dead” or an unfavorable outcome, E(T)=0stands for “alive” or
a favorable outcome,1and2are two thresholds of survival time for selecting short-term
and long-term survivors. Note that long-term survivors also include those patients who
died after the specified long period
The two thresholds c1and c2, can vary from disease to disease, from data set to
data set.For example, in the survival study of early-stage lung adenocarcinomas that
will be presented later, we choose short-term survivors as those who died within one
follow-up year (i.e. c1is year) and long-term survivors as those who were alive after
five follow-up years (i.e. c2is 5 years). There are total 31 extreme training samples (10
short-term survivors and 21 long-term survivors) among a total of 86 available primary
lung adenocarcinomas. These 21 long-term survivors include 2 patients whose status at
the end of follow-up time was “dead”, but follow-up times were 79.5 months and 84.1
months, respectively. Our basic guide lines for the selection of c1 and c2are that the
informative subset should (1) contain enough training samples for learning algorithms
to learn (typically >15 samples in each class and total is between one third and one half
of all available samples), but (2) not have too many samples to avoid including non-
extreme cases.
After choosing informative training samples, we apply ERCOF to them to
identify genes most associated with survival status. With the selected samples and
genes, in the next step, will build a scoring function to estimate the survival risk for
every patient.
6.1.2 Construction of an SVM scoring function
The regression scoring function proposed for survival risk estimation is based on
support vector machines (SVM) described in Section 2.3.2 of Chapter 2. Recall that the
final discriminant function f (T) for a test sample Tgiven in Formula (2.3) of Chapter 2.
If the linear kernel function is used, f (T)will become a linear combination of the
expression values of the identified genes. In this study, we map class label of “short-
term survivors” to 1 and “long-term survivors” to -1. Note that f(T)>0if the sample Tis
more likely to be a “short-term survivor”, and f(T)<0if the sample Tis more likely to be
a “long-term survivor”.
To normalize f(T), we use a transformation function s(T)defined as:
1
𝑠(𝑇) = 𝑓(𝑇)
1+𝑒

Thus, f(T)is normalized by s(T)into the range (0;1). Note that the smaller the
s(T)value is, the better survival the patient corresponding to sample Twill have. We
term s(T)the risk score of T.
If one only categorizes patients into high risk or low risk groups, the value 0.5 is
a natural cutoff fors(T), where if s(T)>0:5then the patient corresponding to sample
Twill have higher risk; otherwise, the patient will have lower risk. If more than two risk
groups are considered — such as high, intermediate, and low — then other cutoffs can
be set based on the risk scores of training samples. E.g., in training set, if most of short-
term survivors have a risk score greater than r1and most of long-term survivors have a
risk score smaller than r2, then,

Generally, r1>0:5, r2<0:5, and they can be derived from the risk scores assigned
to the training samples.
To evaluate the results, after assigning patients into different risk groups, we
draw Kaplan-Meier plots [8] to compare the survival characteristics between groups.
6.1.3 Kaplan-Meier analysis
Kaplan-Meier analysis estimates a population survival curve from a set of samples. A
survival curve illustrates the fraction (or percentage) survival at each time. Since in
realistic clinical trial it often takes several years to accumulate the patients for the trial,
patients being followed for survival will have different starting times. Then the patients
will have various length of follow-up time when the results are analysed at one time.
Therefore, the survival curve can not be estimated simply by calculating the
fractionsurviving at each time. For example, in the following study of lung
adenocarcinomas, the patients follow-up time is varying from 1.5 months to 110.6
months.
A Kaplan-Meier analysis allows estimation of survival over time, even when
patients drop out or are studied for different lengths of time [1]. For example, an alive
patient with 3 years follow-up time should contribute to the survival data for the first
three years of the curve, but not to the part of the curve after that. Thus, thi patient
should be mathematically removed from the curve at the end of 3 years follow-up time
and this is called “censoring” the patient. On a Kaplan-Meier survival curve, when a
patient is censored, the curve does not take a step down as it does when a patient dies;
instead, a tick mark is generally used to indicate where a patient is censored and each
death case after that point will cause a little bit larger step down on the curve. An
alternative way to indicate a censored patient is to show the number of remaining cases
“at risk” at several time points. Patients who have been censored or died before the time
point are not counted as “at risk”. In Figure 6.1, picture (A) is a complete sample of
Kaplan-Meier survival curve with a tick mark representing a censored patient (captured
from http://www. cancerguide.org/scurve_km.html), while picture (B) illustrates how to
calculate the fraction of survival at a time (captured from [1]).
To compare the survival characteristics between different risk groups for our
survival prediction study, we draw Kaplan-Meier survival curves of the risk groups in
one picture and use logrank test to compare the curves. The logrank test generates a p-
value testing the null hypothesis that the survival curves are no difference between two
groups. The meaning of p-value is that “if the null hypothesis is true, what is the
probability of randomly selecting samples whose survival curves are different from
those actually obtained?”. In this chapter, all the Kaplan-Meier
(A)

E(T)=1 (6:1)

(B)
Figure 6.1: Samples of Kaplan-Meier survival curves. (A) is an example of a Kaplan-
Meier survival curve. This group of patients has a minimum follow-up of a little over a
year. (B) is an illustration on how to calculate the fraction of survival at a time
Figure 6.2: A process diagram of patient survival study, including three training steps as
well as testing and results evaluation.
survival curves are generated by Graph Pad Prism (http://www.graphpad.com) and we
always indicate the two-tailed p-value. Figure 6.2 shows a diagram of patient survival
prediction using our proposed method.
6.2 Experiments and Results
We apply the procedure of survival study above to two gene expression data sets.
6.2.1 Lymphoma
Survival after chemotherapy for diffuse large-B-cell lymphoma (DLBCL) patients was
previously studied by Rosendale et al [102] using gene expression profiling and Cox
proportional hazards model. In that study, expression profiles of biopsy samples from
240 patients were used [102]. The data include a preliminary group consisting of 160
patients and a validation group of 80 patients, each of them is described by 7399
microarray features.

Survival curves showing clear distinction


As an initial step, we pre-process the data to remove those genes that are absent in
more than 10% of the experiments in the preliminary group. There remains 4937
features after having 2462 genes removed.
Then, we select short-term survivors and long-term survivors to construct an
informative subset of training samples. For this study, we set1=1year and 2=8years in
Formula (6.1). Among the preliminary 160-patient group, 47 short-term survivors (who
died within one follow-up year) and 26 long-term survivors (who were alive after eight
follow-up years) are thus chosen. So, a total of 73 samples are in this informative subset
of training samples (46% of the preliminary group) .
In the second step, we apply ERCOF to these 73 samples and identify 78 genes
that are related to patient survival status at 5% significant level (for Wilcoxon rank sum
test) and 0.99 Pearson correlation coefficient thresholds. Some of our selected genes are
also listed in Table 2 of [102], where these genes were found to be significantly
associated with survival (p<0:01). E.g., AA805575 (GenBank accession number) is in
germinal-center B-cell signature, X00452 and M20430 in MHC class II signature, and
D87071 is in lymph-node signature. The gene signatures were formed by a hierarchical
clustering algorithm in [102]. Besides, some top-ranked genes (with smaller entropy
value) identified by ERCOF are also in one of these gene signatures. E.g., BC012161,
AF061729 and U34683 are in proliferation signature, BF129543 is in germinal center
B-cell signature, and K01144 and M16276 are in MHC class II signature.
In the third step, an SVM model is trained on the 73 extreme training samples
with the 78 identified features. We find that the well-learned linear kernel SVM can
separate the 47 short-term survivors and 26 long-term survivors completely — the
lowest risk score assigned to the short-term survivors is above 0.7 and most of the long-
term survivors has risk score lower than 0.3. Then, we calculate risk scores for all the
other samples, namely the remaining (non-extreme) 87 samples in the original
preliminary group and the 80 samples in the validation group. These 167 samples are
treated as our test set.

(A) Test cases (p<0.0001 )


01 (B) All cases (p<0.0001 )
1

Figure 6.3: Kaplan-Meier plots illustrate the estimation of overall survival among
different risk DLBCL patients in the testing set containing 167 samples (Panel (A)) and
all 240 samples (Panel (B)). The risk groups are formed on our SVM-based scoring
function. A tick mark on the plot indicates that one sample is censored at the
corresponding time. The 5-year overall survival for high risk versus low risk groups of
patients for testing samples is 32% versus 91%, for all samples is 20% versus 95%.

We categorized patients into four risk groups as follows:


where the threshold 0.5 is the mean value of 0.7 and 0.3. The Kaplan-Meier
curves of overall survival are drawn in Figure 6.3, where we can see clear differences at
the five-year survival rates for the high risk and low risk groups, in both testing sample
set (Panel (A)) and all samples (Panel (B)). Although we cannot see distinct overall
survival between the two intermediate groups, the 5-year survival rates of these two
groups are obviously different from that in the high risk group or thelow risk group.
This also suggests that three or two risk groups would be sufficient for these DLBCL
samples. So in the rest of this study, we simply merge high and intermediate-high risk
patients into a single high risk category, and low and intermediate-low risk patients into
a single low risk category.
Having the risk score, when a new case comes, we will be able to assign it to
the corresponding risk group easily. This kind of prediction was not addressed in [102]
where the DLBCL patients were ranked by their gene-expression-based outcome-
predictor score but divided into several groups with equal number of samples. For an
example: 80 samples in the validation group were stratified according to the quartiles of
the scores with each of quartiles consisting of20 patients. With that kind of
categorization, one cannot find an explicit measure to evaluate anew case.
Comparison with International Prognostic Index
Various clinical features — such as stage, performance status, lactate dehydrogenase
levels — which are known to be strongly related to patient survival, have been
combined to form the International Prognostic Index (IPI) [113]. The IPI has been
effectively adopted to separate aggressive lymphomas into several groups with
significantly different responses to therapy and survival. Since IPI is only built on the
consideration of clinical factors, it provides little insight into disease biology [60].
The risk score obtained from our method is based on gene expression in biopsy
specimens of the lymphoma, so it is an independent predictor from IPI. In fact, we find
that patients in the high IPI group — and similarly for the intermediate and the low IPI
groups — when partitioned by our risk score into high risk and low risk categories, have
significantly different outcomes. In Figure 6.4, Kaplan-Meier plots show significant
difference on overall survival for our high risk and low risk groups among the patients
with IPI low (and similarly for intermediate and high) risk index. In particular, among
21 IPI high risk patients in our testing set, 15 of them are assigned by our method to the
high risk category and 6 of them to the low risk category. When we check the survival
status of these patients, we find 14 of the 15 patients belonging to our high risk category
are indeed dead while only 2 of the 6 patients belonging to our low risk category are
dead. Similarly, for all 32 patients in the whole data set with high IPI, 23 of them (22
dead) are assigned by our method to the high risk category and 9 (5 dead) of them are
assigned to low risk category. This suggests that our method may be a more effective
predictor of DLBCL survival outcome than the IPI.
6.2.2 Lung adenocarcinoma
Adenocarcinoma is the major histological subtype of non-small cell lung cancer
(NSCLC). There is a need to better predict tumor progression and clinical outcome in
lung adenocarcinoma. The lung adenocarcinoma data set contains 86 primary lung
adenocarcinomas. These experiments include 67 stages I and 19 stage III tumors, each
of them is described by 7129 genes. The data set was first analyzed in [14] where a risk
index was derived based on the top 50 good genes
Figure 6.4: Kaplan-Meier Estimates of survival among high risk and low risk DLBCL
patients (according to our method) in each IPI defined group. Plots (A), (B) and (C) are
based on 167 testing samples while (D), (E) and (F) are for all 240 cases.

Figure 6.5: Kaplan-Meier plots illustrate the estimation of overall survival among high
risk and low risk lung adenocarcinoma patients in the testing set containing 55 samples
(Panel (A)) and all 86 samples (Panel (B)).
that were identified to be most related to survival by univariate Cox analysis. In that
study, tests were conducted by randomly splitting 86 samples into equal sized training
and testing sets and by “leave-one-out” cross validation.
First, we form our training set by setting 1=1year and 2=5years in Formula (6.1).
10 short-term survivors and 21 long-term survivors are thus chosen. Applying ERCOF
to these 31 training samples, we find 402 genes that are related to outcome. Our top-
ranked feature by entropy measure, the ATRX gene, is a putative transcription
regulator. It is also reported by Borzoi et al in their recent paper [17] on NSCLC. Our
second-ranked gene, ENPP2, is part of stress pathways involved in oncogenes is. Yang
et al [138] also detected it in NSCLC.
Then we train a linear kernel SVM to obtain the weight for each identified gene
based on the training data. The trained SVM can separate these 31 samples very well,
assigning very high risk scores to short-term survivors (lowest score is as high as 0.73)
while very low risk scores to long-term survivors (highest score is as low as 0.25).
After training, we calculate risk score for each of the remaining 55 samples
which are used for test purpose. These samples are then classified as high risk group
consisting samples Twith s(T)>0:5, or as low risk group consisting samples Twith
s(T)0:5. The Kaplan-Meier curves in Figure 6.5 show clear difference of survival for
patients in our high and low risk groups for both testing cases and all cases. Since we
pick out all short-term and long-term survivors to form the training set, there is no
“death” event happened in the first 12 months’ time and no sample censored after 60
months’ time in the plot drawn only on the test cases (Panel (A)).
Figure 6.6: Kaplan-Meier plots illustrate the estimation of overall survival among high
risk and low risk lung adenocarcinoma patients conditional on tumor stage.
In order to understand the relationship between our prediction and tumor stage (I
or III).Wealso draw Kaplan-Meier curves to delineate survival difference between our
high and low risk patients conditioned on tumor stage. From Figure 6.6, we can see that
outcomes of patients with stage I lung adenocarcinoma in our high and low risk groups
differ from each other, for both test cases (Panel(A)) and all cases (Panel(B)). Again
remarkably, for 13 stage III cases in the testing set, we assigned 11 (5 dead, 6 alive) of
them to high risk group, and the 2 of them assigned to low risk group were all alive at
the end of the follow-up time. Among all 19 stage III cases, 17 (11 dead, 6 alive) of
them were assigned to high risk group according to our risk score.
6.3 Discussions
In the step of training set construction, we select only two extreme cases—long-term
and short-term survivors. See Table 6.1 for size change trends from the original training
samples to the informative training samples on DLBCL and lung adenocarcinomas data
sets. The figures illustrate that we used a small part of samples as training.
On the other hand, if we do not select those extreme cases, and instead use all
available training samples, then what will be the results? To illustrate this, we select
genes and train SVM model on the 160 samples in the preliminary group of DLBCL
study. Although the training
Table 6.1: Number of samples in original data and selected informative training
set. (*):there are 48 DLBCL samples, whose relevant patient was dead at the end of
follow-up time, are selected as informative, 47 of them are short-term survivors while 1
of them is long-term survivor. (**): there are 12 lung adenocarcinomas, whose
corresponding patients were dead, are selected as informative, 10 of them are short-term
survivors while 2 of them are long-term survivors.

Figure 6.7: Kaplan-Meier plots illustrate no clear difference on the overall survival
among high risk and low risk DLBCL patients formed by the 80 validation samples
based on their risk scores that assigned by our regression model built on all 160 training
samples. (A) Using all genes. (B) Using genes selected by ERCOF
Accuracy is good, Kaplan-Meier plots do not show significant survival
difference between the high and low risk groups formed by the 80 validation samples
based on their risk scores that assigned by the trained SVM model. In detail, using all
4937 genes, the pvalue of the survival curves is 0.21 ((A) in Figure 6.7); using 40 genes
selected by ERCOF, the pvalue is 0.38 ((B) in Figure 6.7). Therefore, we claim that our
proposed idea of selecting informative training samples is an effective method. As
pointed out in Section 6.1.1, we have some basic guide lines to determine the thresholds
C1and C2that defined in Formula (6.1). Bearing these minimum constraints in mind, we
try several C1and C2values in our study. In Table 6.2, p-value (of the log rank test)
associated with the Kaplan-Meier survival curves of validation samples under different
selections of the C1and C2from DLBCL study are listed. All results are based on
ERCOF selected genes. We can see that (1) for a range of C1and C2(i.e. C1less than 3
years and C2greater than 8 years).
Table 6.2: Results for different thresholds C1(years) and C2(years) on DLBCL study.
All results are based on ERCOF selected genes and on validation samples only.

We can achieve better predictions by selecting extreme samples. (2) A better p-value
(0.0049) obtained at c1=2 years and c2=8 years than that we reported in Section 6.2.1
for c1=1 year and c2= 8 years. However, when we trace back to the risk scores of
training samples, one of the long-term survivors selected under c1=2 years and c2=8
years has a risk score as high as 0.73. In addition, the number of selected short-term
survivors is 2.4 times of the number of long-term survivors under this choice. In any
case, the selection of c1and c2can be further refined by running cross-validation on
training samples.
In the step of gene identification, built on statistical knowledge, our three-phase
filtering process discards many unrelated genes and only keeps a small number of
informative representatives. According to our experience on gene expression data
analysis, generally, entropy measure can filter out about 90-95% of the total number of
genes [64]. This point has been verified again in this study on survival prediction:
entropy measure retains only 132 genes in DLBCL study (there are around 5000 genes
after removing missing values) and 884 genes in lung adenocarcinoma study (original
data contain 7129 genes). After further filtering by Wilcoxon rank sum test and Pearson
correlation coefficient test, the final selected genes are with smaller size and less
correlated with each other. Table 6.3 shows the number-change trend of features from
the entropy selection, to Wilcoxon test, and to correlation coefficient selection. It can be
seen that the feature reduction is mostly by the entropy selection.
For comparison, in DLBCL study, we also do experiments using all the 4937
genes, the 132 genes output from the Phase I of ERCOF, and the 84 genes output from
the Phase II of ERCOF. The results show that in each of these cases, the overall survival
difference between the high and low risk groups formed by our risk scores on the
testing samples can be seen as well. In Figure 6.8, we draw the corresponding Kaplan-
Meier survival curves.
Table 6.3: Number of genes left after feature filtering for each phase of ERCOF.
The percentage in the brackets indicates the proportion of the remaining genes on
original feature space. (*): the number is after removing genes who were absent in more
than 10% of the experiments.

Although the model builton 3-phase ERCOF makes use the smallest number of
genes, it achieves best pvalue. Again, the good results also demonstrate the
effectiveness of selection the informative samples. In addition, in the study of lung
adenocarcinoma, using all genes (i.e. without gene selection) cannot predict outcome at
all (p>0:1).
In the step of prediction, a simple linear kernel SVM is trained on the selected
samples and genes to build a regression model. The model then assigns each validation
sample a risk score to predict patient outcome. Based on the training results, we can
derive explicit thresholds (e.g., 0.5, 0.3 and 0.7) of our risk score to categorize patients
into different risk groups. Thus, when a new case comes, we are able to assign it to the
corresponding risk group easily according to its risk score. This prediction ability is
important in patient survival study.
For both studies on DLBCL and lung adenocarcinoma, we associate our results
with some clinical features. For example, in the DLBCL study, our high and lowrisk
groups also demonstrate significantly different outcomes in the analysis of patients with
low or intermediate risk according to their International Prognostic Index (IPI) scores
constructed on some clinical features. E.g., for patients having high IPI, we assign most
of them into our high risk category and some into our low risk category, and our
assignment is better correlated to survival outcome of these patients. Some of the genes
identified to have strong association with survival by ERCOF also fall within four
biologic groups defined on the basis of gene expression signatures. In the lung
adenocarcinoma study, most of the samples are from stage I tumors. Among these
samples, although our high and low risk groups differ significantly from each other, we
put quite a few of them into high risk group. This finding “indicates the important
relationship between gene expression profiles and patient survival, independent of
disease stage”, which is one of the conclusions drawn in [14].
Figure 6.8: Kaplan-Meier plots illustrate the estimation of overall survival among high
risk and low risk patients in the validation group of DLBCL study. (A) Using all 4937
genes. (B) Using 132 genes output from the Phase I of ERCOF. (C) Using 84 genes
output from the Phase II of ERCOF. (D) Using 78 genes output from the Phase III of
ERCOF
CHAPTER - 7
RECOGNITION OF FUNCTIONAL SITES IN
BIOLOGICAL SEQUENCES
Not all biomedical data contain explicit signals or features as those in the classification
problems raised by gene expression profiling. For example, DNA sequences and protein
sequences represent the spectrum of biomedical data that possess no explicit features.
Generally, a genomic sequence is just a string consisting of the letters “A”, “C”, “G”,
and “T” in a “random order”. Yet a genomic sequence possesses biologically
meaningful functional sites, which play important roles in the process of protein
synthesis from DNA sequences. Figure 7.1 shows a picture of this process (captured
from the “bioinformatics class notes” of Dr. Nina Rosario L. Rojas at
http://aegis.ateneo.net/nrojas/). This process can be divided into two stages:
transcription and translation

Transcription.
In this stage, the information in DNAis passed on to RNA. This takes place
when one strand of the DNA double helix is used as a template by the RNA polymerase
to create a messenger RNA (mRNA). Then this mRNA moves from the nucleus to the
cytoplasm. In fact, in the cell nucleus, the DNA with all the exons and introns of the
gene is first transcribed into a complementary RNA copy named “nuclear RNA”
(nRNA). This is indicated as “primary transcription” in the picture of Figure 7.1.
Secondly, non-coding sequences of base pairs (introns) are eliminated from the coding
sequences (exons) by RNA splicing. The resulting mRNA is the edited sequence of
nRNA after splicing. The coding mRNA sequence can be described in terms of a unit of
three nucleotides called acodon.
Figure 7.1: Process of protein synthesis.
Translation. In this stage, the information that has been passed to RNA from DNA is
used to make proteins. At the initiation phase of translation, ribosome binds to the
mRNA when it reaches an AUG (adenine, uracil, and guanine) sequence on the RNA
strand in a suitable context. The ribosome is made of protein and ribosomal RNA
(rRNA). The start codon AUG is called translation initiation site (TIS) and is only
recognized by the initiator tRNA (transfer RNA). After binding to the mRNA, the
ribosome proceeds to the elongation phase of protein synthesis by sequentially binding
to the appropriate codon in mRNA to form base pairs with the anticodon of another
tRNAmolecule. Hence, with the ribosome moving from codon to codon along the
mRNA, amino acids are added one by one, translated into polypeptide sequences. At the
end, the newly formed strand of amino acids (complete polypeptide) is released from
the ribosome when a release factor binds to the stop codon. This is the termination
phase of translation.
The functional sites in DNA sequences include transcription start site (TSS),
translation initiation site (TIS), coding region, splice site, polyadenylation (cleavage)
site and so on that are associated with the primary structure of genes. Recognition of
these biological functional sites in a genomic sequence is an important bioinformatics
application [72].
In order to apply traditional machine learning techniques to above functional
sites recognition problem, we propose a 3-step work flow as follows. In the first step,
candidate features are generated using k-gram nucleotide acid or amino acid patterns
and then sequence data are transformed with respect to the newly generated feature
space. In the second step, a small number of good features are selected by a certain
algorithm. In the third step, a classification model is built to recognize the functional
site.
7.1 Method Description
The first and themost important step of our method is to generate a new feature
space under which the original sequences can be transformed to the format to which
general machine learning tools can be easily applied.
7.1.1 Feature generation
We generate the new feature space using k-gram (k=1;2;3;:) nucleotide or amino acid
patterns. A k-gram is simply a pattern of kconsecutive letters, which can be amino acid
symbols or nucleic symbols [143, 72]. We use each k-gram nucleotide or amino acid
pattern as a new feature. For example, nucleotide acid pattern “TCG” is a 3-gram
pattern while amino acid pattern “AR” is a 2-gram pattern constituted by an alanine
followed by an arginine. Our aim is to recognize functional site in a sequence by
analyzing k-gram patterns around it. Generally, upstream and down-stream k-gram
patterns of a candidate functional site (for example, every ATG is a candidate of
translation initiation site) are treated as different features. Therefore, if we use
nucleotide patterns, for each k, there are 24k possible combinations of k-gram patterns;
if we use amino acid patterns, since there are 20 standard amino acids plus 1 stop codon
symbol, there are 221k possible k-gram patterns for each k. If the position of each k-
gram pattern in the sequence fragment is also considered, then the number of features
will increase dramatically. We call these features as position-specific k-gram patterns.
Besides, k-gram can also be restricted those in-frame ones.
The frequency of a k-gram pattern is used as the value of this feature. For
example, 1. UP-X (DOWN-X), which counts the number of times the laterX appears in
the up-stream(down-stream) part of a functional site in its nucleotide acid or amino acid
sequence.
2. UP-XY (DOWN-XY), which counts the number of times the two letters XY
appear as a substring in the up-stream (down-stream) part of a functional site in its
nucleotide acid or amino acid sequence. where X and Y range over the 4 nucleotide acid
letters or the standard 20 amino acid letters and the special stop codon symbol.
In the framework of the new feature space, the initial nucleotide sequences need
to be transformed. The transformation is constructed as follows. Given a DNA
nucleotide sequence, a sequence window is set aside for each candidate functional site
with it in the center and certain bases up-stream (named as up-stream window size) and
certain bases down-stream (named as down-stream window size). If a candidate
functional site does not have enough up-stream or down-stream context, we pad the
missing context with the appropriate number of dot-care (“?”) symbols.
If features are made from amino acid patterns, we will code every triplet
nucleotides, at both up-stream and down-stream of the centered candidate functional
site in a sequence window, into an amino acid using the standard codon table. A triplet
that corresponds to a stop codon is translated into a special “stop” symbol. Thus, every
nucleotide sequence window is coded into another sequence consisting of amino acid
symbols and “stop” symbol. Then the nucleotide or amino acid sequences are converted
into frequency sequence data under the description of our new features. Later, the
classification model will be applied to the frequency sequence data, rather than the
original cDNA sequence data or the intermediate amino acid sequence data.
7.1.2 Feature selection and integration
In most cases, the number of candidate features in the feature space is relatively
big. It is reasonable to expect that some of the generated features would be irrelevant to
our prediction problem while others are indeed good signals to identify the functional
site. Thus, in the second step, feature selection is applied to the feature space to find
those signals most likely to help in distinguishing the true functional site from a large
number of candidates. Besides, feature selection also greatly speeds up the classification
and prediction process, especially when the number ofsamples is large. Among the
many feature selection techniques presented in Chapter 3, we employ the simple
entropy measure (Section 3.2.4) in our following two applications. As used in gene
expression data analysis (with name “all-entropy”), we choose all the features whose
value range can be partitioned into intervals by Fayyad’s discretization algorithm [36]
(Section 3.2.4 of Chapter 3).
To achieve the ultimate goal of predicting the true functional site, our next step
is to integrate the selected features by a classification algorithm. At this step, in the
following two applications, we will focus on the results achieved by support vector
machines (SVM) (with linear or quadratic polynomial kernel function) and our
ensemble method CS4. Detailed techniques of SVM and CS4 can be found in Section
2.3.2 and Section 2.3.4 of Chapter 2, respectively.
In the following two sections, we will make use of our proposed work flow to
predict translation initiation site and poly acetylation signals.
7.2 Translation Initiation Site Prediction
7.2.1 Background
The translation initiation site (TIS) prediction problem is about how to correctly identify
TIS in mRNA, cDNA, or other types of genomic sequences. At the translation stage of
protein synthesis process, in eukaryotic mRNA, the context of the start codon (normally
“AUG”) and the sequences around it are crucial for recruitment of the small ribosome
subunit. Thus, the characterization of the features around TIS will be helpful in a better
understanding of translation regulation and accurate gene predication of coding region
in genomic and mRNA/cDNA sequences. This is an important step in genomic analysis
to determine protein coding from nucleotide sequences.
Since 1987, the recognition of TIS has been extensively studied using
biological approaches, data mining techniques, and statistical models [56, 57, 58, 89,
59, 103, 83, 145, 90, 48and 142]. Pedersen and Nielsen [89] directly fed DNA
sequences into an artificial neural network (ANN) for training the system to recognize
true TIS. They achieved a result of 78% sensitivity on start ATGs (i.e. true TISs) and
87% specificity on non-start ATGs (i.e. false TISs) on a vertebrate data set, giving an
overall accuracy of 85%. Zion et al [145] studied the same vertebrate data set, but
replaced ANN with support vector machines (SVM) using different kinds of kernel
functions. They believe that carefully designed kernel functions are useful for achieving
higher TIS prediction accuracy. One of their kernel functions is called “locality-
improved” kernel, which emphasizes correlations between any two sequence positions
that are close together, and a span of 3 nucleotides up- and down-stream is empirically
determined as optimal. Recently, Hatzigeorgiou [48] built a multi-step ANN system
named “DIANA-TIS” to study the recognition problem. This ANN system combines a
consensus ANN and a coding ANN with the ribosome scanning model. They obtained
an overall accuracy of 94% on a data set containing full-length human cDNA
sequences. All of these methods use nucleotide sequence data directly; they do not
generate any new and explicit features for the differentiation between true and false
TISs.
There are some related works that use statistical features. The program ATGpr
[103] uses a linear discriminant function that combines some statistical features derived
from the sequence. Each of those features is proposed to distinguish true TIS from false
TIS. In a more recent work [83], an improved version of ATGpr called ATGpr sim was
developed, which uses both statistical information and similarities with other known
proteins to obtain higher accuracy of fullness prediction for fragment sequences of
cDNA clones. In our previous study [72], the same vertebrate data set was analyzed by
generating features using nucleotide acid patterns.
7.2.2 Data
We collected three data sets for this study.
The first data set (data set I) is provided by Dr. Pedersen. It consists of
vertebrate sequences extracted from GenBank (release 95). The sequences are further
processed by removing possible introns and joining the remaining exon parts to obtain
the corresponding mRNA sequences [89]. From these sequences, only those with an
annotated TIS, and with at least 10 up-stream nucleotides as well as 150 down-stream
nucleotides are considered in our studies. The sequences are then filtered to remove
homologous genes from different organisms, sequences added multiple times to the
database, and those belonging to same gene families. Since the data are processed DNA,
the TIS site is ATG — that is, a place in the sequence where “A”, “T”, and “G” occur in
consecutive positions in that order. We are aware that some TIS sites may be non-ATG;
however, this is reported to be rare in eukaryotes [59] and is not considered in this
study.
An example entry from this data set is given in Figure 7.2. There are 4 ATGs in
this
Figure 7.2: An example annotated sequence from data set I. The 4 occurrences of ATG
are underlined. The second ATG is the TIS. The other 3 ATGs are non-TIS. The 99
nucleotides up-stream of the TIS are marked by an over line. The 99 nucleotides down-
stream of the TIS are marked by a double over line. The “.”, “i”, and “E” are
annotations indicating whether the corresponding nucleotide is up-stream (.), TIS (i), or
down-stream (E).
example. The second ATG is the TIS. The other 3 ATGs are non-TIS (false TIS). ATGs
to the left of the TIS are termed up-stream ATGs. So the first ATG in the figure is an
up-stream ATG. ATGs to the right of the TIS are termed down-stream ATGs. So the
third and fourth ATGs in the figure are down-stream ATGs. The entire data set contains
3312 sequences. In these sequences, there are a total number of 13375 ATGs, of which
3312 ATGs (24.76%) are true TISs, while 10063 (75.24%) are false. Of the false TISs,
2077 (15.5%) are up-stream ATGs.
The second data set (data set II) is provided by Dr.Hatzigeorgiou. The data
collection was first made on the protein database Swissprot. All the human proteins
whose N-terminal sites are sequenced at the amino acid level were collected and
manually checked. Then the full-length mRNAs for these proteins, whose TIS had been
indirectly experimentally verified, were retrieved. The data set consists of 480 human
cDNA sequences in standard FASTA format. In these sequences, there are as many as
13581 false TIS, 96.59% of total number of ATGs. However, only 241 (1.8%) of them
are up-stream ATGs
Besides these two data sets that have been analyzed by others, we also formed
our own genomic data set (data set III) by extracting a number of well-characterized and
annotated human genes of Chromosome X and Chromosome 21 from Human Genome
Build30 [70]. Note that we eliminated those genes that were generated by other
prediction tools. The resulting set consists of 565 sequences from Chromosome X and
180 sequences from Chromosome 21. These 745 sequences containing true TIS are
used as positive data in our experiment. Meanwhile, in order to get negative data, we
extracted a set of sequences around all ATGs in these two chromosomes but excluded
annotated ones.
7.2.3 Feature generation and sequence transformation
As every 3 nucleotides code for an amino acid, in this study, we use k-gram (k=1;2)
amino acid patterns as candidate features. Thus, there are 924 (=(21+212)2) possible
amino acid patterns, i.e. new features.
In the sequence transformation, we set both up-stream window size and down-
stream window size to 99 bases—given a cDNAor mRNAnucleotide sequence
containing ATGs, a window is set for each ATG with the ATG in the center and 99
bases up-stream and 99 bases down-stream (excluding the ATG itself) aside. As such,
for data set I, we get 3312 sequence windows containing true TIS and 10063 containing
false TIS; for data set II, 480 sequence windows containing true TIS and 13581
containing false TIS. All the windows have same size, i.e. containing 201 nucleotides.
For ease of discussion, given a sequence window, we refer to each position in the
sequence window relative to the target ATG of that window. The “A” in the target ATG
is numbered as +1 and consecutive down-stream positions—that is, to the right—from
the target ATG are numbered from +4 onwards. The first up-stream position — that is,
to the left — adjacent to the target ATG is –1 and decreases for consecutive positions
towards the 5’ end — that is, the left end of the sequence window [72]. These sequence
windows containing nucleotide letters are further transformed to amino acid sequences
by coding every triplet nucleotides into an amino acid or a stop codon. At last, the
amino acid sequences are converted into frequency sequence data under the description
of feature space.
Apart from the k-gram amino acid patterns, we also derive three new features
from some known bio-knowledge: two are based on the famous Kazak’s consensus
matrix and one is on the scanning model. From the original work for the identification
of the TIS in cDNA sequences, Kozak developed the first weight matrix from an
extended collection of data. The consensus motif from this matrix is
GCC[AG]CCATGG, where (1) a G residue tends to follow a true TIS, which indicates
that a “G” appears in position +4 of the original sequence window; (2) a purine (A or G)
tends to be found 3 nucleotides up-stream of a true TIS, which indicates that an “A” or a
“G” appears in position -3 of the original sequence window. Also, according to the
ribosome scanning model [27, 57, 4], an mRNA sequence scanned from left (5’) to right
(3’), and thescanning stops as soon as an ATG is recognized as TIS. The rest of the
ATGs in the mRNA sequence to the right of this ATG are then treated as non-TIS. To
incorporate these knowledge to our feature space, we add three Boolean features
“DOWN4-G”,“UP3-AorG” and “UP-ATG”. Here, UP-ATG means whether an in-frame
up-stream ATG exists. In a nucleotide sequence window extracted for each candidate
TIS, we call those 3-grams in positions , -9, -6, and -3, the in-frame up-stream 3-gram
patterns; and those 3-grams in positions +4, +7, +10, , the in-frame down-stream 3-gram
patterns. Finally, there are 927 features in the new feature space.
After this process of feature generation and data transformation, we get 3312
true TIS samples and 10063 false TIS samples from data set I, 480 true TIS samples and
13581 false TIS samples from data set II. Each sample is a vector of 924 integers and
three boolean values. Figure 7.3 presents a diagram for the data transformation with
respect to our new feature space.
7.2.4 Experiments and results
To verify the effectiveness of our method from different aspects, we designed a
series of experiments on the three data sets:
a. Conducting computational cross validations in data set I and data set II
separately.
b. Selecting features and building classification model using data set I. Applying
the welltrained model to data set II to obtain a blind testing accuracy.
c. Incorporating the idea of ribosome scanning into the classification model.
d. Applying the model built in experiment-b to genomic sequences.
Validation in different data sets
To strictly compare with the results presented in [142, 72], we conduct the same
3-fold cross validation. Table 7.1 shows our results on the data set I and data set II using
the features selected by the entropy-based algorithm. With the simple linear kernel
function, SVM achieves accuracy of 92.04% at 81.13% sensitivity and 95.63%
specificity on data set I. This is better than the accuracy of 89.4% at 74.0% sensitivity
and 94.4% specificity, which is the previous best result reported on the same data set
[142]. On data set II, SVM achieves an accuracy of 98.42% at63.33% sensitivity and
99.66% specificity. Note that we cannot find previously reported results on this data set
under similar cross validation

F
Figure 7.3: A diagram for data transformation aiming for the description of the
new featurespace.
Validation across two data sets
The good cross validation results achieved within the individual data set
encourage us to extend our study to span the two data sets. In this experiment, we use
the whole data set I as training data to select features and build the classification model,
and then we evaluate the well-trained model on data set II to get a test accuracy.
To reduce the similarity between the training and testing data, a BLAST search
between th
Table 7.1: The results by 3-fold cross validation on the two data sets (experiment-
a).SVM(linear/quad) means the classification model is built by linear/quadratic
polynomial kernel function.

Table 7.2: Classification accuracy when using data set I as training and data set II as
testing (experiment-b). The row of II** is the testing accuracy on data set II before
similar sequences being removed.

data set I and II is performed. Two sequences are considered similar if they produce a
BLAST hit with an identity >75%. We find 292 similar sequences and removed them
from data set II. As a result, after being removed similar sequences, data set II contains
188 real TIS, while there are total numbers of 5111 candidates [70].
We train SVM model on data set I and obtain training accuracy 92.77% at
80.68% sensitivity and 96.75% specificity. Using this model, we get a test accuracy of
89.42% at 96.28% sensitivity and 89.15% specificity on data set II. We note that the
testing accuracy on the original data set II (without the removal of the similar
sequences) is quite similar. See Table 7.2 for a summary of these results.
Remarkably, this cross-validation spanning the two data sets achieves a much
better sensitivity on data set II than that obtained in the 3-fold cross-validation on this
data set. A reason may be that only 3.41% of candidate ATGs in data set II is true TISs,
which leads to an extremely unbalanced numbers of samples between the two classes.
However, this bias is rectifiedsignificantly by the model built on data set I where the
population size of true TIS versus false TIS is more balanced.
Incorporation of scanning model
Hatzigeorgiou [48] reported a high accuracy on data set II by an integrated
method which combines a consensus ANN with a coding ANN together with a
ribosome scanning model. The model suggests to scan from the 5’ end of a cDNA
sequence and predicts TIS at the first ATG in a good context [27, 57, 4]. The rest of the
ATGs in the cDNA sequence to the right of this ATG are then automatically classified
as non-TIS. Thus, one and only one ATG is predicted as TIS per cDNA sequence.
We also incorporate this scanning model into our experiment. This time, in a
sequence, we test ATGs in turn from left to right, until one of them is classified as TIS.
A prediction on a sequence is correct if and only if the TIS itself is predicted as a TIS.
Since the scanning model indicates that the first ATGthat in an optimal nucleotide
context would be TIS, a higher prediction accuracy is expected if only up-stream ATGs
and true TIS are used in training. Thus, we ignore all down-stream ATGs in data set I
and obtain a new training set containing only true TISs and their up-stream ATGs. Then
feature selection and classification model learning are based on this new training data.
Table 7.3 shows our results with scanning model being used.
Under this scanning model idea, Artemis reported that 94% of the TIS were
correctly predicted on data set II [48]. As mentioned in her paper [48], the data set was
split into training and testing parts in some way, the results reported there are not
directly comparable with our results.
Testing on genomic sequences
In order to further evaluate the feasibility and robustness of our method, we apply our
model built in experiment-b to our own prepared data (data set III), which contain gene
sequences of Chromosome X and Chromosome 21. Using the simple linear kernel
function, SVM gives 397 correct predictions out of a total of 565 true TISs found in
Chromosome X while 132 correct prediction out of a total of 180 true TISs in
Chromosome 21. The sensitivities are 70.27% and 73.33%, respectively. To obtain the
specificity of our models, we randomly select the same number of sequences containing
non-start ATGs (false TIS) from our own extracted negative
Table 7.3: Classification accuracy under scanning model when using data set I (3312
sequences) as training and data set II (188 sequences) as testing (experiment-c). The
row of II** is the testing accuracy on data set II before similar sequences being
removed (480 sequences). NoCorrectlyPredicted is the number of sequences whose TIS
is correctly predicted.
Figure 7.4: ROC curve of SVM and CS4 on prediction TIS in genomic data
Chromosome X and Chromosome 21 (experiment-d). The SVMmodel is built on the
linear kernel function. The area under the ROC curve: SVM 0.837, CS4 0.772.
Data set SVM correctly predicts 626 of these 745 non-start ATGs, obtaining specificity
at 84.02%. In the same test, CS4 achieves 52.48% sensitivity and 89.80% specificity.
One point needs to be addressed here is that in this validation, we remove the feature
built on the ribosome scanning model since that model is not true for genomic data. To
illustrate the tradeoff between the prediction sensitivity and specificity, Figure 7.4 gives
the ROC curves of SVM and CS4 showing the changes of prediction accuracy on true
and false TISs.
7.2.5 Discussion
Significant Features
“What are the key features to predict TIS?” To answer this question, let us have a look
of an interesting discovery on the features selected in the 3-fold cross validation on data
set I in our experiment-a. Table 7.4 shows the ranking positions of the 10 top-ranked
features based on their entropy value for the each fold. Observe that they are the same
features though their ordering is slightly different from one fold to another. This
suggests that these features, or exactly amino acid patterns, are indeed patterns around
true or false TISs. Furthermore, “UP-ATG” can be explained by the ribosome scanning
model [27, 4] — seeing such an up-stream ATG makes the candidate ATG less likely to
be the TIS. “DOWN-STOP” is the in-frame stop codons down-stream from the target
ATG and it is consistent with the biological process of translating in-frame codons into
amino acids stops upon encountering an in-frame stop codon — seeing such a down-
stream stop codon makes the candidate protein improbably short. “UP3-AorG” is
correspondence to the well-known Kozak consensus sequence [56]. Most of the other
features were also identified in our previous study [142], in which the feature space is
built directly on nucleotides. Remarkably, these amino acid patterns, except “DOWN-
L”, all contain “G” residue. Note also that “UP-M” is one of the top features in each
fold, but we exclude it as it is redundant given that UP-ATG is true if and only if UP-M
>0. The significance of these features is further verified when we find that both
sensitivity and specificity drop down greatly if these features are all excluded from the
classification model. However, we do not observe obvious decrease when we remove
any one of them from the model. This may suggest that in real biological process of
translation there are some factors other than Kozak consensus that may regulate the
recognition of TIS.
In addition to the result when only selected features are used, we also obtain
cross-validation results on the whole feature space (i.e. without feature section). We
find that using the whole feature space cannot let us achieve better results on all of our
experiments. For example, SVM with linear kernel function achieves accuracy 90.94%
at 79.86% sensitivity and 94.58% specificity for data set I when running 3-fold cross
validation on data set I. This result is not as good as that on the selected features.
Table 7.4: Ranking of the top 10 features based on their entropy value as relevant in
each of the 3 folds of data set I. Feature “UP-ATG” indicates whether an in-frame up-
stream ATG exists (boolean type). Feature “UP3-AorG” tests whether purine A or G
tends to be found 3 nucleotides up-stream of true TIS (boolean type). Feature
“UP(DOWN)-X” counts the occurrence that an in-frame (relative to the candidate
ATG) triplet coding for the amino acid letter X appears in the up-stream (down-stream)
part of a candidate ATG. Feature “DOWN-STOP” is the occurrence of in-frame stop
codons down-stream of a candidate ATG.
Classification algorithms
For the classification methods, overall speaking, SVM performs slightly better than our
CS4 method, in terms of prediction accuracy. However, CS4 achieves very good
sensitivity when running 3-fold cross validation on data set II where the number of true
TISs is much less than the number of false TISs. On the prediction of TIS in genomic
sequences, the performance of CS4 is close to that of SVM. This can be illustrated by
the ROC curves drawn in the Figure 7.4—the areas under the curves are SVM 0.837
and CS4 0.772, respectively. Besides, decision trees can output comprehensive rules to
disclose the essence of learning and prediction. Some discovered interesting and
biologically sensible rules with large coverage are listed below
1. If UP-ATG=‘Y’ and DOWN-STOP>0, then prediction is false TIS.
2. If UP3-AorG=‘N’ and DOWN-STOP>0, then prediction is false TIS.
3. If UP-ATG=‘N’ and DOWN-STOP=0 and UP3-AorG=‘Y’, then prediction is true
TIS.
On the other hand, in our series of experiments, SVM built on quadratic
polynomial kernels do not show much advantage over those built on simple linear
kernel functions. Note that quadratic kernels need much more time on training process
Comparison with model built on nucleotide acid patterns
In [142], data set I was studied using k-gram nucleotide acid patterns and several
classification methods including SVMs, Naive Bayes, Neural Network and decision
tree. In that study, featureselection was also conducted, but by CFS (Correlation-based
Feature Selection) which is introduced in Section 3.2.6 of Chapter 3. The best accuracy
achieved on the 3-fold cross validation was 89.4% at 74.0% sensitivity and 94.4%
specificity when some 3-gram nucleotide acid patterns were used. This result is not as
good as that presented in this section — 92.45% accuracy at 80.19% sensitivity and
96.48% specificity. However, the good features selected by these two experiments are
highly consistent. Besides those 3 features built on bio-knowledge, CFS picked out
down-stream TAA (stop codon), TAG (stop codon), TGA (stop codon), CTG (amino
acid L), GAC (D), GAG (E) and GCC (A). If we code these 3-gram nucleotide patterns
into 1-gram amino acid patterns, we will find they are all among the best features listed
in Table 7.4. On the other hand, although there are no 2-gram amino acid patterns
among the 10 best features in Table 7.4, some of them are indeed included in the set of
selected features that has been used to achieve better results in this study. Note that, our
previous study [142] also illustrated that using 4-gram; 5-gram nucleotide acid patterns
could not help improve the prediction performance
Comparison with ATGpr
As mentioned earlier, ATGpr [103, 83] is a TIS prediction program that makes
use of a linear discriminant function, several statistical measures derived from the
sequence and the ribosome scanning model. It can be accessed via
http://www.hri.co.jp/atgpr/. When searching TIS in a given sequence, the system will
output several (5 by default) ATGs in the order of decreasing confidence. Let us take
the ATG with highest confidence as TIS. Then for the 3312 sequences in our data set I,
ATGpr can predict correctly true TIS in 2941 (88.80%) of them. This accuracy is 6.64%
lower than that we achieved. For our data set II, true TIS in 442 (92.0%) of 480
sequences are properly recognized, which is about 2.38% lower than the accuracy
obtained by us. Our results quoted here are based on SVM model using the linear kernel
function.
When we feed the genomic data used in our experiment-d to ATGpr, the
program gives correct TIS predictions on 128 (71.11%) of 180 Chromosome 21 gene
sequences and 417 (73.81%) of 565 Chromosome X gene sequences, giving the overall
sensitivity as 73.15%. On the other hand, ATGpr achieves 70.47% prediction accuracy
on the same number of negative sequences that were used in our experiment-d. From
the ROC curves shown in Figure 7.4, we can find our prediction specificities are around
80% (SVM) and 73% (CS4) when sensitivity is 73.15% —9.5% and 2.5% higher than
that of ATGpr on specificity. This indicates that our program may also outperform
ATGpr when dealing with genomic data sequences.
7.3 Polyadenylation Signal Predictions
7.3.1 Background
The general polyadenylation machinery of mammalian cells has been well
studied for decades. The polyadenylation (poly (A)) reaction of mammalian pre-
mRNAs proceeds in two phases: the cleavage of pre-mRNA and the addition of poly
(A) tail to the newly formed 3’ end. The cleavage reaction requires the cleavage/poly
(A) specificity factor (CPSF), the cleavage stimulation factor (CStF), the cleavage
factors I and II (CF I and CF II), and poly (A) polymerase (PAP) in most cases. CPSF,
PAP and poly (A) binding protein 2 are involved in poly(A) [144]. The assembly of the
cleavage/poly(A) complex, which contains most or all of the processing factors and the
substrate RNA, occurs cooperatively. CPSF consists of four subunits and binds to the
highly conserved AAUAAA hexane up-stream of the cleavage site. CStF, which is
necessary for cleavage but not for the addition of poly(A) tail, interacts with the U/GU
rich element located down-stream of the AAUAAA hexane. Two additional factors, the
cleavage factor I and II (CF I and CF II) act only in the cleavage step. CF I has been
purified to homogeneity and shown to be an RNA-binding factor. CF II has been only
partially purified so far, and its function is not known.
After the formation of the cleavage/polyadenylation complex, the selection of
poly(A) site is primarily determined by the distance between a hexameric poly(A)
signal (PAS) of sequence AAUAAA (or a one-base variant) and the down-stream
element(denoted as DSE). The spacing requirements for the PAS and DSE reflect the
spatial requirements for a stable interaction between CPSF and CStF. The DSE is
poorly conserved and two main types have been described as a U-rich, or GU-rich
element, which locates 20 to 40 bases down-stream of the cleavage site (for reviews,
please refer to [28, 144, 141]). DSE is present in a large proportion of genes and can
affect the efficiency of cleavage [75, 141]. Although in a few cases, an up-stream
element (denoted as USE) is required for the poly(A) signal to be fully activated [5, 18,
79], the position and sequence of the USE are undefined. In summary, the organization
of mammalian poly(A)
Figure 7.5: Schematic representation of PAS in human mRNA 3’end processing site.
Distances are as described in [28].
sites may have an unexpected flexibility and their activity depends on not only
the hexametric signal but also the up/down elements. Figure 7.5 is a schematic
representation of PAS in human mRNA 3’end processing site [144].
There are several software programs that have been developed to detect PASes
in human DNA and mRNA sequences by analyzing the characteristics of up-stream and
down-stream sequence elements around PASes. In one of early studies, Tabasco and
Zhang [119] developed a program named Polyadq, which finds PASes using a pair of
quadratic discriminant functions. Besides, they also created a database of known active
poly(A) sites and trained their program on 280 mRNA sequences and 136 DNA
sequences. In their tests of finding PASes, they claimed a correlation coefficient of
0.413 on whole genes and 0.512 in the last two exons of genes. Polyadq is available at
http://argon.cshl.org/tabaska/polyadq_form.html. Recently, Legendre and Gautheret
[61] used bioinformatics analysis of EST and genomic sequences to characterize biases
in the regions encompassing 600 nucleotides around the cleavage site. The computer
program they developed is called Erwin which uses 2-gram position-specific nucleotide
acid patterns to analyses 300 aces up-stream and down-stream region of a candidate
PAS. Being trained by 2327 terminal sequences, Erwin was reported to achieve a
prediction specificity of 75.5% to 90.4% for a sensitivity of 56% on several sets of
validation data. The program can be found at http://tagc.univ-mrs.fr/pub/erpin/.
In this study, we will apply our method to characterize the features in the
regions encompassing 200 nucleotides around the PAS, i.e. with PAS in the center and
both up-stream window size and down-stream window size as 100 bases
7.3.2 Data
In a series of experiments, a large number of sequences are used to train and test
our classification model. They are from two sources.
(1) Training and testing sequences used by program Erwin [61]. The training set
contains 2327 terminal sequences including 1632 “unique” and 695 “strong” poly(A)
sites. The testing set consists of 982 positive sequences containing annotated PASes
from EMBL and four sets of same sized negative sequences: 982 CDS sequences, 982
intrinsic sequences of the first intron, 982 randomized UTR sequences of same 1st order
Markov model as human 3’ UTRs, and 982 randomized UTR sequences of same mono
nucleotide composition as human 3’ UTRs. The 2327 training sequences can be
downloaded from http://tagc. univ-mrs.fr/pub/erpin/ and have been trimmed in
accordance to our window segments i.e. every sequence contains 206 bases, having a
PAS in the center. We obtained testing data sets from DrGautheret via personal
communication.
(2) Human RefSeq mRNA data set: we obtained 312 human mRNA sequences
from RefSeq [94] release 1. Each of these sequences contains a “polyA-signal” feature
tag carrying an “evidence=experimental” label. We use these sequences to build model
for PAS prediction in mRNA sequences. Besides, we also extracted a set of human
mRNA sequences from RefSeq containing a “polyA-site” feature tag carrying an
“evidence=experimental” label. In this set, we removed the sequences that have been
included in the training set used in building our model. We use these sequences for
testing purpose assuming that the annotated PAS positions are correct. Our negative
data set was generated by scanning for the occurrences of AATAAA at coding region
and those AATAAA sites near the end of sequence were excluded purposely.
7.3.3 Experiments and Results
First, we use simple 1-gram, 2-gram and 3-gram nucleotide acid patterns to
construct feature space [69]. Thus, there are 168 (=(4+42+43)2) features
Table 7.5: Validation results by different programs on a set of 982 annotated UTR
sequences from the EMBL database [61]. TP is the number of true positives. FN is the
number of false negatives.

Preliminary results
In the first experiment, we use the 2327 sequences introduced in [61] (data source (1))
as our true PAS training data. To obtain negative sequences, same sized false PAS data
is randomly selected from our own extracted negative data set (data source (2)). Using
entropy-based feature selection algorithm and linear kernel SVM classifier, the
sensitivity and specificity of 10-fold cross-validation on training data are 89.3% and
80.5%, respectively. In order to compare with other programs, we test our model on the
same validation sets whose testing results on programs Erpin and Polyadq were reported
in [61]. As described in data source (1) , these validation sets include true PASes
sequences came from 982 annotated UTRs and four same sized control sets known not
to contain PASes: coding sequences (CDS), introns and randomized UTRs (simply
shuffled UTRs and 1st order Markov model UTRs). For a direct comparison, we also
adjust the prediction sensitivity on the 982 true PASes to around 56.0% so that
evaluation can be made on the predictions for those four control sets.
Table 7.5 shows the validation results on true PASes and Table 7.6 illustrates the results
on four control sets. Figure 7.6 is the ROC curve for this series of tests. All the numbers
regarding to the performance of programs Erpin and Polyadq in Table 7.5 and Table 7.6
are copied or derived from [61]. The results in Table 7.6 demonstrate that our model
can give better performance than Erpin and Polyadq did on false PASes prediction of
CDS, intron and simple shuffling sequences, and almost same prediction accuracy on
sequences with 1st order Markov randomization.
In this experiment, we select 113 features via entropy measure. These features
are then integrated with SVM to form the classification and prediction model. Table 7.7
lists the top 10 of these features ranking by their entropy values (the less the entropy
value is, the more important the feature is). Some of these top features can be
interpreted by those reported motifs, for example, it clearly visualizes both USE and
DSE as characterized by G/U rich segments sinceUP-TGT, UP-T, DOWN-TGT, DOWN-
T, UP-TG and UP-TT are among top features.
Table 7.6: Validation results by different programs on different sequences not
containing PASes: coding sequences (CDS), introns, and two types of randomized UTR
sequences (simple shuffling and 1st order Markov simulation) [61]. TN is the number of
true negatives. FP is the number of false positives. CC is correlation coefficient,
CC=p(TP+FP) ((TTPP+ TFN
N􀀀) F(TPN F+NF)P)
(TN+FN) and . Calculations of Precision and CC use TP
and FN from Table 7.5.
Model for prediction PAS in mRNA sequences
When we apply our model to 312 true PASes that were extracted from mRNA
sequences by ourselves (data source (2)), the results obtained are not good — only
around 20% of them can be predicted correctly. Besides, the program Erpin performs
even worse on these PASes—with prediction accuracy at only 13%. These poor results
may indicate that the good features used in the model for PAS prediction in DNA
sequences are not efficient for mRNA. Therefore, we
Table 7.7: The top 10 features selected by entropy-based feature selection method for
PAS classification and prediction in human DNA sequences.

decide to build another model for mRNA sequences without poly(A) tails. This model is
also expected to provide a new way for predicting the mRNA cleavage site/poly(A)
addition site.
CHAPTER - 8
CONCLUSION
In thesis research since the new model is aimed to predict PASes from mRNA
sequences, we only consider the up-stream elements around a candidate PAS.
Therefore, there are only 84 features (instead of 168 features). To train the model, we
use 312 experimentally verified true PASes and same number of false PASes that
randomly selected from our prepared negative data set. The validation set comprises
767 annotated PASes and same number of false Passes also from our negative data set
but different from those used as training (data source (2)). This time, we achieve
reasonably good results. Sensitivity and specificity for 10-fold cross-validation on
training data are 79.5% and 81.8%, respectively. Validation result is 79.0% sensitivity
at 83.6% specificity. Besides, we observe that the top ranked features are different from
those listed in Table 7.7 (detailed features not shown).
Since every 3 nucleotides code for an amino acid when DNA sequences
translate to mRNA sequences, it is legitimate to investigate if an alternative approach
that generating features based on amino acids can produce more effective
PASesprediction for mRNA sequence data. In fact, this idea is also encouraged by the
good results we achieved in the TIS prediction described in the previous section..
Similarly as what we did in TIS prediction, we transform the up-stream
nucleotides of a sequence window set for each candidate PAS into an amino acid
sequence segment by coding every triplet nucleotides as an amino acid or a stop codon.
The feature space is constructed by using 1-gram and 2-gram amino acid patterns. Since
only up-stream elements around a candidate PAS are considered, there are 462
(=21+212 ) possible amino acid patterns. In addition to these patterns, we also present
existing knowledge via an additional feature — denoting number of T residue in up-
stream as “UP-T-Number”. Thus, there are 463 candidate features in total.
In the new feature space, we conduct feature selection and train SVM on 312
true PASes and same number of false PASes. The 10-fold cross-validation results on
training data are 81.7% sensitivity with 94.1% specificity. When apply the trained model to
our validation set containing 767 true PASes and 767 false PASes, we achieve 94.4% sensitivity
with 92.0% specificity (correlation coefficient is as high as 0.865). Figure 7.7 is the ROC curve
of this validation. In this experiment, there are only 13 selected features and UP-T-Number is
the best feature. This indicates that the up-stream sequence of PAS in mRNA sequence may
also contain T-rich segments. However, when we apply this model built for mRNA sequences
using amino acid patterns to predict PASes in DNA sequences, we cannot get as good results as
that achieved in the previous experiment. This indicates that the down-stream elements are
indeed important for PAS prediction in DNA sequences

Future work
Here live in a world where vast amounts of data are collected daily. Such data is
an important to need so data mining play the important role. Data mining can meet this
need by providing tools to discovery knowledge from data. Now days can see that data
mining use in any area. It presents survey on the data mining for biological and
environment problem. So here observe that much kind of concepts and technique used
in these problems and tries the removed complicated and hard type of data. Highlights
on biological sequences problem such as protein and genomic sequences and other
biological segments such as cancer prediction. In environment presents earth quakes,
land slide, spatial data and environment al tool also discuss. Data mining algorithms,
tools and concepts used in these problems Such as MATLAB, WEKA, SWIISPORT ,
Clustering , Bioclustering and any other thing in this survey.

REFERENCE
1. Agarwal, t.limielinskiand A.Swami mining association rules between sets of
items inlarge databases,in ACM SIGMOD,1993.
2. Arthur David and Vassilvitskii Sergei (2007), “k-means++: The Advantages of
Careful Seeding”, SODA ’07: Proceedings of the eighteenth annual ACM-
SIAM,1027-1035.
3. Basheer M., Al-Maqaleh, Hamid Shahbazkia, (2012),” A Genetic Algorithm for
Discovering Classification Rules in Data Mining”, International Journal of
Computer Applications (0975 – 8887), 41(18): 40-44.
4. Bifet A., Holmes G., Pfahringer B., Kirkby R., Gavalda R.(2009), “New
ensemblemethods for evolving data streams”, In KDD, ACM, 139-148.
5. Dietterich T. (2002), “Ensemble learning, in The Handbook of Brain Theory and
Neural Networks”, 2nd ed., M. Arbib, Ed., Cambridge MA: MIT Press.
6. Dr. Bhatia, M.P.S. and Khurana, Deepika (2013), “Experimental study of Data
clustering using k-Means and modified algorithms”, International Journal of
Data Mining & Knowledge Management Process (IJDKP), 3(3): 17-30.
7. Dr. Hemalatha M. and Saranya Naga N. (2011), “A Recent Survey on
Knowledge Discovery in Spatial Data Mining” IJCSI International Journal of
Computer Science, 8 (3):473-479.
8. Dzeroski S. and Zenko B. (2004), “Is combining classifiers with stacking better
than selecting the best one? Machine Learning”, 255–273.
9. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
10. Elena Makhalova, (2013), “Fuzzy C means Clustering in MATLAB”, The 7th
International Days of Statistics and Economics, Prague, 19(21): 905-914.
11. Ester Martin, Kriegel Peter Hans, Sander Jörg (2001), “Algorithms and
Applications for Spatial Data Mining Published in Geographic Data Mining and
Knowledge Discovery, Research Monographs in GIS”, Taylor and Francis, 1-
32.
12. Ghosh, Soumi and Dubey, Sanjay, Kumar (2013), “Comparative Analysis of
KMeans and Fuzzy C-Means Algorithms”, (IJACSA) International Journal of
Advanced Computer Science and Applications, 4(4):35-39.
13. Jain Shreya and GajbhiyeSamta (2012), “A Comparative Performance Analysis
of Clustering Algorithms”, International Journal of Advanced Research in
Computer Science and Software Engineering, 2(5):441-445.
14. Kalyankar A. Meghali and Prof. Alaspurkar S. J. (2013), “Data Mining
Technique to Analyse the Metrological Data”, International Journal of
Advanced Research in Computer Science and Software Engineering, 3(2):114-
118.
15. Kavitha K. and Sarojamma R M (2012), “Monitoring of Diabetes with Data
Mining via CART Method”, International Journal of Emerging Technology and
Advanced Engineering, 2 (11):157-162.
16. Lavrac Nada and Novak Kralj Petra (2013), “Relational and Semantic Data
Mining for Biomedical Research”, Informatica (Slovenia), 37(1): 35-39.
17. Lai J-S and Tsai F.(2012),“Verification And Risk Assessment For Landslides In
The Shimen Reservoir Watershed Of Taiwan Using Spatial Analysis And Data
Mining”, International Archives of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, XXXIX(B2): 67-70.
18. Li Yifeng, Ngom, Alioune (2013), “The Non-Negative Matrix Factorization
Toolbox for Biological Data Mining”, Source Code for Biology &Medicince, 8
(1):1-18.
19. Mrs. Sujatha B. and Akila C., (2012), “Prediction of Breast Cancer Using Data
Mining”, International Journal of Computer Science and Management Research,
1 (3): 384-389.

You might also like