Algorithmic Approach For Removing The Redundancy in Diabetic

Algorithmic Approach for Removing the
Redundancy in Diabetic Gene Categories

Based on Semantic Similarity and Gene
Expression Data
Atul Kumar & D.Jeya Sundara Sharmila
Interdisciplinary Sciences:
Computational Life Sciences
Computational Life Sciences
ISSN 1913-2751
Interdiscip Sci Comput Life Sci
DOI 10.1007/s12539-015-0113-z
1 23
Your article is protected by copyright

and all rights are held exclusively by
International Association of Scientists in
the Interdisciplinary Areas and SpringerVerlag Berlin Heidelberg. This e-offprint is
for personal use only and shall not be selfarchived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com.
1 23
Author's personal copy

DOI 10.1007/s12539-015-0113-z
ORIGINAL RESEARCH ARTICLE
Algorithmic Approach for Removing the Redundancy in Diabetic

Gene Categories Based on Semantic Similarity and Gene
Expression Data
Atul Kumar1 D. Jeya Sundara Sharmila2
Received: 14 October 2014 / Revised: 27 November 2014 / Accepted: 21 January 2015

International Association of Scientists in the Interdisciplinary Areas and Springer-Verlag Berlin Heidelberg 2015
Abstract Even after so much advancement in gene

expression microarray technology, the main hindrance in
analyzing microarray data is its limited number of samples
as compared to a number of factors, which is a major
impediment in revealing actual gene functionality and
valuable information from the data. Analyzing gene
expression data can indicate the factors which are differentially expressed in the diseased tissue. As most of these
genes have no part to play in causing the disease of interest,
thus, identification of disease-causing genes can reveal not
just the case of the disease, but also its pathogenic mechanism. There are a lot of gene selection methods available
which have the capacity to remove irrelevant genes, but
most of them are not sufficient enough in removing
redundancy in genes from microarray data, which increases
the computational cost and decreases the classification
accuracy. Combining the gene expression data with the
gene ontology information can be helpful in determining
the redundancy which can then be removed using the
algorithm mentioned in the work. The gene list obtained
after these sequential steps of the algorithm can be analyzed further to obtain the most deterministic genes
responsible for type 2 diabetes.
& Atul Kumar

atul.0298@gmail.com
1
Department of Bioinformatics, Karunya University,

Coimbatore, Tamil Nadu, India
Department of Nanosciences and Technology, Tamil Nadu

Agriculture University, Coimbatore, Tamil Nadu, India
Keywords Microarray technology Gene expression

Diabetes Greedy algorithm Gene ontology Sematic
similarity Pearson correlation GEO database
1 Introduction
Rapid advancement in gene expression microarray technology has enabled simultaneous measurement of the
expression levels for tens of thousands of genes in a single
experiment [1]. Analyzing gene expression data can show
the factors which are differentially expressed in the diseased tissue [2]. The main hindrance in analyzing
microarray data is its limited number of samples as compared to number of genes. Most of these genes have no role
to play in causing the disease of interest; thus, identification of disease-causing genes can reveal not just the cause
of the disease, but also its pathogenic mechanism [3].
Available gene selection methods have the capacity to
remove irrelevant genes, but most of them are inadequate
in removing redundancy in genes from microarray data,
which increases the computational cost and decreases the
classification accuracy. Due to the presence of noise and
low number of samples in microarray data, the actual gene
functionality and valuable information cannot be easily
revealed from the data. Gene expression data in combination with gene ontology information can be helpful in
determining the redundancy which can then be removed
using the algorithm mentioned in the work. Pearson correlation and semantic similarity measure have been combined to find the similarity between the two genes [3]. Due
to the low sample number, Pearson correlation alone cannot be considered for finding the similarity, and due to
incomplete information in gene ontology, the semantic
similarity measure is insufficient to determine the
123

similarity between the genes, so the average of the scores

of both Pearson correlation and semantic similarity measure is used to determine the similarity between the two
genes.
Rexpgi ;gj Rsemgi ;gj
R gi ; gj
2
Rsem gi ; gj and Rexp gi ; gj represent the semantic similarity and the expression similarity of genes gi and gj
respectively.
Semantic similarity measures can be used to calculate
the similarity of two concepts organized in ontology. The
ontology structure defines the function parents (c) that,
given a concept c, returns the set of more generic concepts
directly linked to c [4]. Based on this, Resnik, Jiang and
Conrath and Lin proposed three different ways for calculating the semantic similarity.
Pearson correlation coefficient is used to find the
expression similarity of two genes: giav denotes the average
value of gene gi expressions and gik represents the value of
kth sample in gene gi . In the current work based on the
previously mentioned concept, algorithm has been
designed which helps in reducing the redundancy from the
microarray data sample.
2 Materials and Methods

Seventy-one samples from different tissues of Homo sapiens
(Diabetic and Normal) were collected from GEO database
[5] and Diabetes Genome Anatomy Project (DGAP). Out of
these, 37 samples are of normal human beings and 34 are of
diabetic humans (Table 1) [6]. Using the gene ontology
information of all the genes in the given datasets, semantic
similarity was calculated for all the combination of the GO
terms present in a particular dataset based on three methods
given by Lin, Resnik and Jiang and a combination of Resniks and Lins similarity measures (simRel) (Schlicker and
Albrecht 2007). Expression similarity for all the combination of the GO terms present in a particular dataset was
calculated through Pearson correlation coefficient. Semantic
similarity values (simRel) and Pearson values were averaged, and based on the average value, a greedy algorithm was
followed to obtain the genes which have a value less than the
threshold value of 0.8. The threshold of 0.8 was taken after
several experimental trials which showed that taking a
threshold value[0.8 resulted in a number of similar genes in
the output file, whereas taking a value\0.8 was resulting in
the loss of many of the important genes.
3 Results and Discussion

3.1 Removing the Gene Duplicity
by the Elimination of the Genes Having
the Same GO ID
Datasets were subjected to SOURCE server (http://smd.
princeton.edu/cgi-bin/source/sourceBatchSearch) to obtain
the gene ontology information for all the genes present in
the datasets. The server returned an output file with genes
and their corresponding GO IDs and categorized the genes
based on three hierarchies that define functional attributes
of gene products: molecular function (MF), biological
process (BP) and cellular component. In the dataset Effect
of insulin infusion on human skeletal muscle out of
22,172 genes, 16,167 (73 %) genes were reported to be
involved in molecular function, 857 (4 %) genes in biological and 633 (3 %) genes in cellular component. Among
these 22,172 genes, there were 4515 (20 %) genes for
which there was no information present in the gene
ontology database (Fig. 1). Except for Human pancreatic
islets from normal and type 2 diabetic subjects (B), all the
datasets taken for studies have shown almost similar distribution of genes in different hierarchy; this is because the
gene chip array chosen for its study was HG-U133 B [8]
unlike others where it was HG-U133 A (Table 2). The use
of the different gene chip array has caused a major change
in the distribution of genes in different hierarchy for gene
set Human pancreatic islets from normal and type 2 diabetic subjects (B) (Fig. 2). Out of 22,550 genes, 8664
(38 %) genes were reported to be involved in molecular
Table 1 Dataset samples taken for studies [6]

Accession
Data
No. of samples
Normal
No. of genes
Country
Diabetic
GSE7146
Effect of insulin infusion on human skeletal muscle [7]
22,215
Sweden
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (A) [8]
22,191
Caucasian and
Asian
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (B) [8]
22,550
DGAP
Human skeletal muscletype 2 diabetes [9]
17
18
22,177
123
Sweden

Fig. 1 Distribution of genes in different hierarchy for all datasets

except Human pancreatic islets from normal and type 2 diabetic
subjects (B)
Fig. 2 Distribution of genes in different hierarchy for Human

pancreatic islets from normal and type 2 diabetic subjects (B)
function, 857 (4 %) genes in biological and 633 (5 %)

genes in cellular component, whereas there were 12,036
(53 %) genes for which there was no information present in
the gene ontology database.
The results obtained after executing the algorithm

showed a drastic decrease in gene number by removing the
redundant genes from the datasets. Table 3 summarizes the
result of the first step of redundancy reduction.
The result file obtained through SOURCE server showed

high redundancy in the GO IDs. Thus, Ablebits-a commercial software (free trial version) (http://www.ablebits.
com/) plugin was used to generate a status column, mentioning duplicate against the GO IDs that were repeated.
Based on the Fischer score [6], except the top-scored gene
among the duplicate gene set, all the genes were removed
using the algorithm given below.
Input: GO file with duplicate status column
Output: Nonredundant GO terms file
Initialize:
Read file
Push each line into an array
Count till end of the file
% Repeat until i [ count
Split each line and put in another array
If status column is Duplicate
Eliminate the line
End
3.2 Semantic Similarity Between Genes in Datasets

Out of all the gene sets obtained from the above results, the
genes which were categorized under molecular functions
were taken to identify the semantic similarity among them
using funsimmat [10] (http://funsimmat.bioinf.mpi-inf.
mpg.de/). Since the molecular function represents the
ability or job performed by a gene product, whereas biological function and cellular function represent recognized
series of events or molecular functions and locations, at the
levels of subcellular structures and macromolecular complexes, respectively, biological function gene and cellular
component gene were not considered for finding semantic
similarity. The semantic similarity was determined using
Resnik, Jiang and Conrath, Lin and a combination of
ResnikLin method [4]. All the possible combinations of
genes in different datasets were generated, and semantic
value for each combination was generated using the abovementioned methods. The different combination generated
Table 2 Distribution of genes in different hierarchy for each dataset under study
Accession
Data
Molecular
function
Biological
function
Cellular
component
No gene
information
GSE7146
Effect of insulin infusion on

human skeletal muscle
16,167
857
633
4515
DGAP
Human pancreatic islets from normal

and type 2 diabetic subjects (A)
16,176
860
633
4522
DGAP

and type 2 diabetic subjects (B)
8664
834
1016
12,036
DGAP
Human skeletal muscletype 2 diabetes
16,165
859
631
4522
123

Table 3 Number of genes in different hierarchy after first step of redundancy reduction
Accession
Data
Molecular
function
Biological
process
Cellular
component
GSE7146
Effect of insulin infusion on human skeletal muscle
1297
257
38
DGAP

1297
258
38
DGAP

878
216
45
DGAP
1296
256
38
Table 4 Number of different combinations of gene in datasets

Accession
Data
GSE7146
840,456
DGAP

840,456
DGAP

385,881
DGAP
840,456
in each dataset is shown in Table 4. Out of these, the

semantic values of Resnik Lin (simRel) were used for
calculation as this method takes into account the relevance
information and provides a high relevance in generic terms
for the comparison of the exact function of different gene
products [11].
3.3 Pearson Correlation Coefficient for Expression
Similarity Between Genes in Dataset
To find the expression similarity between two genes,
Pearson correlation coefficient was used. It was required to
make the same set of combinations of genes as it was given
by the server for semantic similarity so as the average value
for both semantic and expression similarity can be calculated. To obtain the same set of combinations of genes as of
semantic, the following algorithm was executed, which
generated the genes with the same combination as of
semantic similarity
Input 1: Semantic similarity file
Input 2: Nonredundant GO terms file
Output: GO terms with same set of combinations
Initialize:
Read input 1 and input 2
If GO IDs in columns of both the arrays are same
Print the line
End
123
Total number
of combinations
3.4 The Greedy Algorithm

The expression and the semantic value (Resnik Lin Value)
for the all the different combinations of gene set were
averaged, and the score obtained was used for the removing
the genes for whom average value was more than 0.8 [3].
Thus, using the threshold value of 0.8, only those genes
were obtained which were highly dissimilar and mainly
contribute to causing a disease. A greedy algorithm
approach was used to obtain the unique set of genes
Input: File with average score
Output: File with unique genes
Initialize:
Read input
Assign cutoff 0.8
Splice each gene with different combination in separate
arrays
If any of the scores in combination is greater than cutoff
Reject the gene
Else
Accept
End
The output file obtained through it contained all the
unique genes which are dissimilar to each other with a
similarity score of less than 0.8. The number of
unique genes in each of the datasets is summarized in
Table 5.

Table 5 Number of unique
genes in each dataset
Table 6 Unique genes based on

the Fischer score [6] for Effect
of insulin infusion on human
skeletal

the Fischer score [6] for
Human pancreatic islets from
normal and type 2 diabetic
subjects (A)
Accession
Data
Number of
unique genes
GSE7146
1223
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (A)
1210
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (B)
803
DGAP
1238
Top 10 genes based on Fischer score
Unique or repeated
G0S2: G0/G1switch 2
Unique
SLC22A6: solute carrier family 22 (organic anion transporter), member 6
Unique
CDC6: CDC6 cell division cycle 6 homolog (S. cerevisiae)
Repeated
SCNN1G: sodium channel, nonvoltage-gated 1, gamma
Unique
LOC441601 /// LOC652471: septin 7 pseudogene /// similar to septin 7
Repeated
DNAJC1: DnaJ (Hsp40) homolog, subfamily C, member 1
Unique
KIAA0692: KIAA0692
Repeated
TLE1: transducin-like enhancer of split 1 (E (sp1) homolog, Drosophila)
Unique
UBXD8: UBX domain containing 8
Repeated
ANKRD15: ankyrin repeat domain 15
Repeated
Unique or repeated
TMEM111: transmembrane protein 111
Repeated
CYP7A1: cytochrome P450, family 7, subfamily A, polypeptide 1
Unique
Hs.247983.0
Repeated
NSUN5B: NOL1/NOP2/Sun domain family, member 5B
Repeated
HPRT1: hypoxanthine phosphoribosyltransferase 1 (LeschNyhan syndrome)
Unique
C11orf61: chromosome 11 open reading frame 61
Repeated
MRNA; cDNA DKFZp686F1844 (from clone DKFZp686F1844)
Unique
CYP3A4: cytochrome P450, family 3, subfamily A, polypeptide 4
Unique
CTBP1: C-terminal binding protein 1
Unique
FMO5: flavin containing monooxygenase 5
Unique
The unique genes found through this approach were

compared with the top 10 genes obtained through Fischer
discriminate analysis [6] to check whether the genes which
obtained a high ranking in Fischer score are unique or
repeated. In the dataset Effect of insulin infusion on
human skeletal, out of the top 10 genes only 5 genes were
found to be unique. In the same way for Human pancreatic islets from normal and type 2 diabetic subjects (A)
out of top 10, 6 genes, for Human pancreatic islets from
normal and type 2 diabetic subjects (B) 4 genes and for
Human skeletal muscletype 2 diabetes 5 genes were
found to be unique (Table 6, 7, 8, 9). Most of the unique
genes identified in the dataset and reported to have a high
Fischer score are involved in some pathway which is
reported to be involved directly or indirectly in causing

type 2 diabetes [6].
4 Conclusion
The major problem with the microarray data is the high
redundancy in the genes which restricts from getting useful
and valuable information from the data. The genes in the
microarray data either have the same GO ID or may have
common ancestors, which makes them similar in molecular
function. The redundant genes have been removed based
on same GO ID at the first step and then based on the
average score of Pearson and semantic similarity. The
123

Human pancreatic islets from
normal and type 2 diabetic
subjects (B)

Human skeletal muscletype 2
diabetes
Table 10 Reduction in
redundancy of genes at each
stage

THRAP6: thyroid hormone receptor-associated protein 6
Unique
LPHN3: latrophilin 3
Repeated
VPS36: vacuolar protein sorting 36 (yeast)
Unique
CAPS: calcyphosine
Unique
Transcribed locus
Unique
Transcribed locus, moderately similar to XP_517655.1 PREDICTED: similar to

KIAA0825 protein [Pan troglodytes]
Repeated
MAP2K5: mitogen-activated protein kinase kinase 5

ZNF559: zinc finger protein 559
Repeated
Repeated
Repeated
Repeated
Unique or
repeated
ZAK: sterile alpha motif and leucine zipper containing kinase AZK
Repeated
ANKHD1 /// MASK-BP3: ankyrin repeat and KH domain containing 1 /// MASK-4EBP3 alternate reading frame gene
Repeated
ProSAPiP1: ProSAPiP1 protein
Unique
PCDHB3: protocadherin beta 3
Unique
Repeated
CADPS2: Ca2?-dependent activator protein for secretion 2
Unique
COX7A1: cytochrome c oxidase subunit VIIa polypeptide 1 (muscle)

Repeated
Repeated
TMEM106C: transmembrane protein 106C
Unique
PLK1 /// RPL37A: polo-like kinase 1 (Drosophila) /// ribosomal protein L37a
Unique
Data
No. of genes obtained

from GEO database
Effect of insulin infusion on human

skeletal muscle
22,215
6107
1223

22,191
6115
1210

22,550
13,175
803
Human skeletal muscletype 2

diabetes
22,177
6113
1238
reduction in the number of genes at each step is summarized in Table 10.

The genes obtained after final redundancy reduction step
showed that almost 50 % of genes which got a high score
in Fischer discriminant analysis [6] are repeated and they
are eliminated, thus leaving the genes which are only
123
Unique or
repeated
No. of genes in
nonredundant GO
set
No. of genes
obtained after final
step
unique. Further, to obtain the most discriminatory genes

which may be a major target for a disease can be identified
by subjecting the unique genes to a classifier like support
vector machine or any of the machine learning approaches
which can classify the discriminatory gene and eliminate
the others.

References
1. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray. Science 270:467470
2. Zhang A (2006) Advanced Analysis of Gene Expression
Microarray Data. World Scientific Publishing Co., Danvers
3. Mohammadi A, Saraee MH, Salehi M (2011) Identification of
disease-causing genes using microarray data mining and Gene
Ontology. BMC Med Genom 4:1219
4. Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic
similarity between Gene Ontology terms. Data Knowl Eng
61:137152
5. Edgar R, Domrachev M, Lash AE (2002) Gene Expression
Omnibus: NCBI gene expression and hybridization array data
repository. Nucleic Acids Res 30(1):207210
6. Kumar A, Sharmila DJS, Kant R (2014) Selection of discriminatory gene set for Type II diabetes using fisher linear discriminant. Int J Adv Comput Mathe Sci 5(2):3642
7. Parikh H, Carlsson E, Chutkow WA, Johansson LE, Storgaard H,
Poulsen P, Saxena R, Ladd C, Schulze PC, Mazzini MJ, Jensen
CB, Krook A, Bjornholm M, Tornqvist H, Zierath JR,
8.
9.
10.
11.
Ridderstrale M, Altshuler D, Lee RT, Vaag A, Groop LC,

Mootha VK (2007) TXNIP regulates peripheral glucose metabolism in humans. PLoS Med 4(5):868879
Gunton JE, Kulkarni RN, Yim S, Okada T, Hawthorne WJ, Tseng
YH, Roberson RS, Ricordi C, OConnell PJ, Gonzalez FJ, Kahn
CR (2005) Loss of ARNT/HIF1beta mediates altered gene
expression and pancreatic-islet dysfunction in human type 2
diabetes. Cell 122(3):337349
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag
S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E,
Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo
P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop
LC (2003) PGC-1a responsive genes involved in oxidative
phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34(3):267273
Schlicker A, Albrecht M (2008) FunSimMat: a comprehensive
functional similarity database. Nucleic Acids Res 36:434439
Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006)
A new measure for functional similarity of gene products based
on Gene Ontology. BMC Bioinform 7:302317
123

Algorithmic Approach For Removing The Redundancy in Diabetic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Algorithmic Approach For Removing The Redundancy in Diabetic

Uploaded by

Copyright:

Available Formats

Algorithmic Approach for Removing the

Redundancy in Diabetic Gene Categories

Your article is protected by copyright

Author's personal copy

ORIGINAL RESEARCH ARTICLE