You are on page 1of 9

Algorithmic Approach for Removing the

Redundancy in Diabetic Gene Categories


Based on Semantic Similarity and Gene
Expression Data
Atul Kumar & D.Jeya Sundara Sharmila

Interdisciplinary Sciences:
Computational Life Sciences
Computational Life Sciences
ISSN 1913-2751
Interdiscip Sci Comput Life Sci
DOI 10.1007/s12539-015-0113-z

1 23

Your article is protected by copyright


and all rights are held exclusively by
International Association of Scientists in
the Interdisciplinary Areas and SpringerVerlag Berlin Heidelberg. This e-offprint is
for personal use only and shall not be selfarchived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com.

1 23

Author's personal copy


Interdiscip Sci Comput Life Sci
DOI 10.1007/s12539-015-0113-z

ORIGINAL RESEARCH ARTICLE

Algorithmic Approach for Removing the Redundancy in Diabetic


Gene Categories Based on Semantic Similarity and Gene
Expression Data
Atul Kumar1 D. Jeya Sundara Sharmila2

Received: 14 October 2014 / Revised: 27 November 2014 / Accepted: 21 January 2015


International Association of Scientists in the Interdisciplinary Areas and Springer-Verlag Berlin Heidelberg 2015

Abstract Even after so much advancement in gene


expression microarray technology, the main hindrance in
analyzing microarray data is its limited number of samples
as compared to a number of factors, which is a major
impediment in revealing actual gene functionality and
valuable information from the data. Analyzing gene
expression data can indicate the factors which are differentially expressed in the diseased tissue. As most of these
genes have no part to play in causing the disease of interest,
thus, identification of disease-causing genes can reveal not
just the case of the disease, but also its pathogenic mechanism. There are a lot of gene selection methods available
which have the capacity to remove irrelevant genes, but
most of them are not sufficient enough in removing
redundancy in genes from microarray data, which increases
the computational cost and decreases the classification
accuracy. Combining the gene expression data with the
gene ontology information can be helpful in determining
the redundancy which can then be removed using the
algorithm mentioned in the work. The gene list obtained
after these sequential steps of the algorithm can be analyzed further to obtain the most deterministic genes
responsible for type 2 diabetes.

& Atul Kumar


atul.0298@gmail.com
1

Department of Bioinformatics, Karunya University,


Coimbatore, Tamil Nadu, India

Department of Nanosciences and Technology, Tamil Nadu


Agriculture University, Coimbatore, Tamil Nadu, India

Keywords Microarray technology  Gene expression 


Diabetes  Greedy algorithm  Gene ontology  Sematic
similarity  Pearson correlation  GEO database

1 Introduction
Rapid advancement in gene expression microarray technology has enabled simultaneous measurement of the
expression levels for tens of thousands of genes in a single
experiment [1]. Analyzing gene expression data can show
the factors which are differentially expressed in the diseased tissue [2]. The main hindrance in analyzing
microarray data is its limited number of samples as compared to number of genes. Most of these genes have no role
to play in causing the disease of interest; thus, identification of disease-causing genes can reveal not just the cause
of the disease, but also its pathogenic mechanism [3].
Available gene selection methods have the capacity to
remove irrelevant genes, but most of them are inadequate
in removing redundancy in genes from microarray data,
which increases the computational cost and decreases the
classification accuracy. Due to the presence of noise and
low number of samples in microarray data, the actual gene
functionality and valuable information cannot be easily
revealed from the data. Gene expression data in combination with gene ontology information can be helpful in
determining the redundancy which can then be removed
using the algorithm mentioned in the work. Pearson correlation and semantic similarity measure have been combined to find the similarity between the two genes [3]. Due
to the low sample number, Pearson correlation alone cannot be considered for finding the similarity, and due to
incomplete information in gene ontology, the semantic
similarity measure is insufficient to determine the

123

Author's personal copy


Interdiscip Sci Comput Life Sci

similarity between the genes, so the average of the scores


of both Pearson correlation and semantic similarity measure is used to determine the similarity between the two
genes.
 Rexpgi ;gj Rsemgi ;gj
R gi ; gj
2
Rsem gi ; gj and Rexp gi ; gj represent the semantic similarity and the expression similarity of genes gi and gj
respectively.
Semantic similarity measures can be used to calculate
the similarity of two concepts organized in ontology. The
ontology structure defines the function parents (c) that,
given a concept c, returns the set of more generic concepts
directly linked to c [4]. Based on this, Resnik, Jiang and
Conrath and Lin proposed three different ways for calculating the semantic similarity.
Pearson correlation coefficient is used to find the
expression similarity of two genes: giav denotes the average
value of gene gi expressions and gik represents the value of
kth sample in gene gi . In the current work based on the
previously mentioned concept, algorithm has been
designed which helps in reducing the redundancy from the
microarray data sample.

2 Materials and Methods


Seventy-one samples from different tissues of Homo sapiens
(Diabetic and Normal) were collected from GEO database
[5] and Diabetes Genome Anatomy Project (DGAP). Out of
these, 37 samples are of normal human beings and 34 are of
diabetic humans (Table 1) [6]. Using the gene ontology
information of all the genes in the given datasets, semantic
similarity was calculated for all the combination of the GO
terms present in a particular dataset based on three methods
given by Lin, Resnik and Jiang and a combination of Resniks and Lins similarity measures (simRel) (Schlicker and
Albrecht 2007). Expression similarity for all the combination of the GO terms present in a particular dataset was
calculated through Pearson correlation coefficient. Semantic

similarity values (simRel) and Pearson values were averaged, and based on the average value, a greedy algorithm was
followed to obtain the genes which have a value less than the
threshold value of 0.8. The threshold of 0.8 was taken after
several experimental trials which showed that taking a
threshold value[0.8 resulted in a number of similar genes in
the output file, whereas taking a value\0.8 was resulting in
the loss of many of the important genes.

3 Results and Discussion


3.1 Removing the Gene Duplicity
by the Elimination of the Genes Having
the Same GO ID
Datasets were subjected to SOURCE server (http://smd.
princeton.edu/cgi-bin/source/sourceBatchSearch) to obtain
the gene ontology information for all the genes present in
the datasets. The server returned an output file with genes
and their corresponding GO IDs and categorized the genes
based on three hierarchies that define functional attributes
of gene products: molecular function (MF), biological
process (BP) and cellular component. In the dataset Effect
of insulin infusion on human skeletal muscle out of
22,172 genes, 16,167 (73 %) genes were reported to be
involved in molecular function, 857 (4 %) genes in biological and 633 (3 %) genes in cellular component. Among
these 22,172 genes, there were 4515 (20 %) genes for
which there was no information present in the gene
ontology database (Fig. 1). Except for Human pancreatic
islets from normal and type 2 diabetic subjects (B), all the
datasets taken for studies have shown almost similar distribution of genes in different hierarchy; this is because the
gene chip array chosen for its study was HG-U133 B [8]
unlike others where it was HG-U133 A (Table 2). The use
of the different gene chip array has caused a major change
in the distribution of genes in different hierarchy for gene
set Human pancreatic islets from normal and type 2 diabetic subjects (B) (Fig. 2). Out of 22,550 genes, 8664
(38 %) genes were reported to be involved in molecular

Table 1 Dataset samples taken for studies [6]


Accession

Data

No. of samples
Normal

No. of genes

Country

Diabetic

GSE7146

Effect of insulin infusion on human skeletal muscle [7]

22,215

Sweden

DGAP

Human pancreatic islets from normal and type 2 diabetic subjects (A) [8]

22,191

Caucasian and
Asian

DGAP

Human pancreatic islets from normal and type 2 diabetic subjects (B) [8]

22,550

DGAP

Human skeletal muscletype 2 diabetes [9]

17

18

22,177

123

Sweden

Author's personal copy


Interdiscip Sci Comput Life Sci

Fig. 1 Distribution of genes in different hierarchy for all datasets


except Human pancreatic islets from normal and type 2 diabetic
subjects (B)

Fig. 2 Distribution of genes in different hierarchy for Human


pancreatic islets from normal and type 2 diabetic subjects (B)

function, 857 (4 %) genes in biological and 633 (5 %)


genes in cellular component, whereas there were 12,036
(53 %) genes for which there was no information present in
the gene ontology database.

The results obtained after executing the algorithm


showed a drastic decrease in gene number by removing the
redundant genes from the datasets. Table 3 summarizes the
result of the first step of redundancy reduction.

The result file obtained through SOURCE server showed


high redundancy in the GO IDs. Thus, Ablebits-a commercial software (free trial version) (http://www.ablebits.
com/) plugin was used to generate a status column, mentioning duplicate against the GO IDs that were repeated.
Based on the Fischer score [6], except the top-scored gene
among the duplicate gene set, all the genes were removed
using the algorithm given below.
Input: GO file with duplicate status column
Output: Nonredundant GO terms file
Initialize:
Read file
Push each line into an array
Count till end of the file
% Repeat until i [ count
Split each line and put in another array
If status column is Duplicate
Eliminate the line
End

3.2 Semantic Similarity Between Genes in Datasets


Out of all the gene sets obtained from the above results, the
genes which were categorized under molecular functions
were taken to identify the semantic similarity among them
using funsimmat [10] (http://funsimmat.bioinf.mpi-inf.
mpg.de/). Since the molecular function represents the
ability or job performed by a gene product, whereas biological function and cellular function represent recognized
series of events or molecular functions and locations, at the
levels of subcellular structures and macromolecular complexes, respectively, biological function gene and cellular
component gene were not considered for finding semantic
similarity. The semantic similarity was determined using
Resnik, Jiang and Conrath, Lin and a combination of
ResnikLin method [4]. All the possible combinations of
genes in different datasets were generated, and semantic
value for each combination was generated using the abovementioned methods. The different combination generated

Table 2 Distribution of genes in different hierarchy for each dataset under study
Accession

Data

Molecular
function

Biological
function

Cellular
component

No gene
information

GSE7146

Effect of insulin infusion on


human skeletal muscle

16,167

857

633

4515

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (A)

16,176

860

633

4522

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (B)

8664

834

1016

12,036

DGAP

Human skeletal muscletype 2 diabetes

16,165

859

631

4522

123

Author's personal copy


Interdiscip Sci Comput Life Sci
Table 3 Number of genes in different hierarchy after first step of redundancy reduction
Accession

Data

Molecular
function

Biological
process

Cellular
component

GSE7146

Effect of insulin infusion on human skeletal muscle

1297

257

38

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (A)

1297

258

38

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (B)

878

216

45

DGAP

Human skeletal muscletype 2 diabetes

1296

256

38

Table 4 Number of different combinations of gene in datasets


Accession

Data

GSE7146

Effect of insulin infusion on human skeletal muscle

840,456

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (A)

840,456

DGAP

Human pancreatic islets from normal


and type 2 diabetic subjects (B)

385,881

DGAP

Human skeletal muscletype 2 diabetes

840,456

in each dataset is shown in Table 4. Out of these, the


semantic values of Resnik Lin (simRel) were used for
calculation as this method takes into account the relevance
information and provides a high relevance in generic terms
for the comparison of the exact function of different gene
products [11].
3.3 Pearson Correlation Coefficient for Expression
Similarity Between Genes in Dataset
To find the expression similarity between two genes,
Pearson correlation coefficient was used. It was required to
make the same set of combinations of genes as it was given
by the server for semantic similarity so as the average value
for both semantic and expression similarity can be calculated. To obtain the same set of combinations of genes as of
semantic, the following algorithm was executed, which
generated the genes with the same combination as of
semantic similarity
Input 1: Semantic similarity file
Input 2: Nonredundant GO terms file
Output: GO terms with same set of combinations
Initialize:
Read input 1 and input 2
Push each line into an array
Count till end of the file
% Repeat until i [ count
Split each line and put in another array
If GO IDs in columns of both the arrays are same
Print the line
End

123

Total number
of combinations

3.4 The Greedy Algorithm


The expression and the semantic value (Resnik Lin Value)
for the all the different combinations of gene set were
averaged, and the score obtained was used for the removing
the genes for whom average value was more than 0.8 [3].
Thus, using the threshold value of 0.8, only those genes
were obtained which were highly dissimilar and mainly
contribute to causing a disease. A greedy algorithm
approach was used to obtain the unique set of genes
Input: File with average score
Output: File with unique genes
Initialize:
Read input
Push each line into an array
Count till end of the file
Assign cutoff 0.8
% Repeat until i [ count
Split each line and put in another array
Splice each gene with different combination in separate
arrays
If any of the scores in combination is greater than cutoff
Reject the gene
Else
Accept
End
The output file obtained through it contained all the
unique genes which are dissimilar to each other with a
similarity score of less than 0.8. The number of
unique genes in each of the datasets is summarized in
Table 5.

Author's personal copy


Interdiscip Sci Comput Life Sci
Table 5 Number of unique
genes in each dataset

Table 6 Unique genes based on


the Fischer score [6] for Effect
of insulin infusion on human
skeletal

Table 7 Unique genes based on


the Fischer score [6] for
Human pancreatic islets from
normal and type 2 diabetic
subjects (A)

Accession

Data

Number of
unique genes

GSE7146

Effect of insulin infusion on human skeletal muscle

1223

DGAP

Human pancreatic islets from normal and type 2 diabetic subjects (A)

1210

DGAP

Human pancreatic islets from normal and type 2 diabetic subjects (B)

803

DGAP

Human skeletal muscletype 2 diabetes

1238

Top 10 genes based on Fischer score

Unique or repeated

G0S2: G0/G1switch 2

Unique

SLC22A6: solute carrier family 22 (organic anion transporter), member 6

Unique

CDC6: CDC6 cell division cycle 6 homolog (S. cerevisiae)

Repeated

SCNN1G: sodium channel, nonvoltage-gated 1, gamma

Unique

LOC441601 /// LOC652471: septin 7 pseudogene /// similar to septin 7

Repeated

DNAJC1: DnaJ (Hsp40) homolog, subfamily C, member 1

Unique

KIAA0692: KIAA0692

Repeated

TLE1: transducin-like enhancer of split 1 (E (sp1) homolog, Drosophila)

Unique

UBXD8: UBX domain containing 8

Repeated

ANKRD15: ankyrin repeat domain 15

Repeated

Top 10 genes based on Fischer score

Unique or repeated

TMEM111: transmembrane protein 111

Repeated

CYP7A1: cytochrome P450, family 7, subfamily A, polypeptide 1

Unique

Hs.247983.0

Repeated

NSUN5B: NOL1/NOP2/Sun domain family, member 5B

Repeated

HPRT1: hypoxanthine phosphoribosyltransferase 1 (LeschNyhan syndrome)

Unique

C11orf61: chromosome 11 open reading frame 61

Repeated

MRNA; cDNA DKFZp686F1844 (from clone DKFZp686F1844)

Unique

CYP3A4: cytochrome P450, family 3, subfamily A, polypeptide 4

Unique

CTBP1: C-terminal binding protein 1

Unique

FMO5: flavin containing monooxygenase 5

Unique

The unique genes found through this approach were


compared with the top 10 genes obtained through Fischer
discriminate analysis [6] to check whether the genes which
obtained a high ranking in Fischer score are unique or
repeated. In the dataset Effect of insulin infusion on
human skeletal, out of the top 10 genes only 5 genes were
found to be unique. In the same way for Human pancreatic islets from normal and type 2 diabetic subjects (A)
out of top 10, 6 genes, for Human pancreatic islets from
normal and type 2 diabetic subjects (B) 4 genes and for
Human skeletal muscletype 2 diabetes 5 genes were
found to be unique (Table 6, 7, 8, 9). Most of the unique
genes identified in the dataset and reported to have a high
Fischer score are involved in some pathway which is

reported to be involved directly or indirectly in causing


type 2 diabetes [6].

4 Conclusion
The major problem with the microarray data is the high
redundancy in the genes which restricts from getting useful
and valuable information from the data. The genes in the
microarray data either have the same GO ID or may have
common ancestors, which makes them similar in molecular
function. The redundant genes have been removed based
on same GO ID at the first step and then based on the
average score of Pearson and semantic similarity. The

123

Author's personal copy


Interdiscip Sci Comput Life Sci
Table 8 Unique genes based on
the Fischer score [6] for
Human pancreatic islets from
normal and type 2 diabetic
subjects (B)

Table 9 Unique genes based on


the Fischer score [6] for
Human skeletal muscletype 2
diabetes

Table 10 Reduction in
redundancy of genes at each
stage

Top 10 genes based on Fischer score


THRAP6: thyroid hormone receptor-associated protein 6

Unique

LPHN3: latrophilin 3

Repeated

VPS36: vacuolar protein sorting 36 (yeast)

Unique

CAPS: calcyphosine

Unique

Transcribed locus

Unique

Transcribed locus, moderately similar to XP_517655.1 PREDICTED: similar to


KIAA0825 protein [Pan troglodytes]

Repeated

MAP2K5: mitogen-activated protein kinase kinase 5


ZNF559: zinc finger protein 559

Repeated
Repeated

ZNF638: zinc finger protein 638

Repeated

ZNF605: zinc finger protein 605

Repeated

Top 10 genes based on Fischer score

Unique or
repeated

ZAK: sterile alpha motif and leucine zipper containing kinase AZK

Repeated

ANKHD1 /// MASK-BP3: ankyrin repeat and KH domain containing 1 /// MASK-4EBP3 alternate reading frame gene

Repeated

ProSAPiP1: ProSAPiP1 protein

Unique

PCDHB3: protocadherin beta 3

Unique

ZNF688: zinc finger protein 688

Repeated

CADPS2: Ca2?-dependent activator protein for secretion 2

Unique

COX7A1: cytochrome c oxidase subunit VIIa polypeptide 1 (muscle)


ZNF267: zinc finger protein 267

Repeated
Repeated

TMEM106C: transmembrane protein 106C

Unique

PLK1 /// RPL37A: polo-like kinase 1 (Drosophila) /// ribosomal protein L37a

Unique

Data

No. of genes obtained


from GEO database

Effect of insulin infusion on human


skeletal muscle

22,215

6107

1223

Human pancreatic islets from normal


and type 2 diabetic subjects (A)

22,191

6115

1210

Human pancreatic islets from normal


and type 2 diabetic subjects (B)

22,550

13,175

803

Human skeletal muscletype 2


diabetes

22,177

6113

1238

reduction in the number of genes at each step is summarized in Table 10.


The genes obtained after final redundancy reduction step
showed that almost 50 % of genes which got a high score
in Fischer discriminant analysis [6] are repeated and they
are eliminated, thus leaving the genes which are only

123

Unique or
repeated

No. of genes in
nonredundant GO
set

No. of genes
obtained after final
step

unique. Further, to obtain the most discriminatory genes


which may be a major target for a disease can be identified
by subjecting the unique genes to a classifier like support
vector machine or any of the machine learning approaches
which can classify the discriminatory gene and eliminate
the others.

Author's personal copy


Interdiscip Sci Comput Life Sci

References
1. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray. Science 270:467470
2. Zhang A (2006) Advanced Analysis of Gene Expression
Microarray Data. World Scientific Publishing Co., Danvers
3. Mohammadi A, Saraee MH, Salehi M (2011) Identification of
disease-causing genes using microarray data mining and Gene
Ontology. BMC Med Genom 4:1219
4. Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic
similarity between Gene Ontology terms. Data Knowl Eng
61:137152
5. Edgar R, Domrachev M, Lash AE (2002) Gene Expression
Omnibus: NCBI gene expression and hybridization array data
repository. Nucleic Acids Res 30(1):207210
6. Kumar A, Sharmila DJS, Kant R (2014) Selection of discriminatory gene set for Type II diabetes using fisher linear discriminant. Int J Adv Comput Mathe Sci 5(2):3642
7. Parikh H, Carlsson E, Chutkow WA, Johansson LE, Storgaard H,
Poulsen P, Saxena R, Ladd C, Schulze PC, Mazzini MJ, Jensen
CB, Krook A, Bjornholm M, Tornqvist H, Zierath JR,

8.

9.

10.
11.

Ridderstrale M, Altshuler D, Lee RT, Vaag A, Groop LC,


Mootha VK (2007) TXNIP regulates peripheral glucose metabolism in humans. PLoS Med 4(5):868879
Gunton JE, Kulkarni RN, Yim S, Okada T, Hawthorne WJ, Tseng
YH, Roberson RS, Ricordi C, OConnell PJ, Gonzalez FJ, Kahn
CR (2005) Loss of ARNT/HIF1beta mediates altered gene
expression and pancreatic-islet dysfunction in human type 2
diabetes. Cell 122(3):337349
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag
S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E,
Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo
P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop
LC (2003) PGC-1a responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34(3):267273
Schlicker A, Albrecht M (2008) FunSimMat: a comprehensive
functional similarity database. Nucleic Acids Res 36:434439
Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006)
A new measure for functional similarity of gene products based
on Gene Ontology. BMC Bioinform 7:302317

123

You might also like