Professional Documents
Culture Documents
Interdisciplinary Sciences:
Computational Life Sciences
Computational Life Sciences
ISSN 1913-2751
Interdiscip Sci Comput Life Sci
DOI 10.1007/s12539-015-0113-z
1 23
1 23
1 Introduction
Rapid advancement in gene expression microarray technology has enabled simultaneous measurement of the
expression levels for tens of thousands of genes in a single
experiment [1]. Analyzing gene expression data can show
the factors which are differentially expressed in the diseased tissue [2]. The main hindrance in analyzing
microarray data is its limited number of samples as compared to number of genes. Most of these genes have no role
to play in causing the disease of interest; thus, identification of disease-causing genes can reveal not just the cause
of the disease, but also its pathogenic mechanism [3].
Available gene selection methods have the capacity to
remove irrelevant genes, but most of them are inadequate
in removing redundancy in genes from microarray data,
which increases the computational cost and decreases the
classification accuracy. Due to the presence of noise and
low number of samples in microarray data, the actual gene
functionality and valuable information cannot be easily
revealed from the data. Gene expression data in combination with gene ontology information can be helpful in
determining the redundancy which can then be removed
using the algorithm mentioned in the work. Pearson correlation and semantic similarity measure have been combined to find the similarity between the two genes [3]. Due
to the low sample number, Pearson correlation alone cannot be considered for finding the similarity, and due to
incomplete information in gene ontology, the semantic
similarity measure is insufficient to determine the
123
similarity values (simRel) and Pearson values were averaged, and based on the average value, a greedy algorithm was
followed to obtain the genes which have a value less than the
threshold value of 0.8. The threshold of 0.8 was taken after
several experimental trials which showed that taking a
threshold value[0.8 resulted in a number of similar genes in
the output file, whereas taking a value\0.8 was resulting in
the loss of many of the important genes.
Data
No. of samples
Normal
No. of genes
Country
Diabetic
GSE7146
22,215
Sweden
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (A) [8]
22,191
Caucasian and
Asian
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (B) [8]
22,550
DGAP
17
18
22,177
123
Sweden
Table 2 Distribution of genes in different hierarchy for each dataset under study
Accession
Data
Molecular
function
Biological
function
Cellular
component
No gene
information
GSE7146
16,167
857
633
4515
DGAP
16,176
860
633
4522
DGAP
8664
834
1016
12,036
DGAP
16,165
859
631
4522
123
Data
Molecular
function
Biological
process
Cellular
component
GSE7146
1297
257
38
DGAP
1297
258
38
DGAP
878
216
45
DGAP
1296
256
38
Data
GSE7146
840,456
DGAP
840,456
DGAP
385,881
DGAP
840,456
123
Total number
of combinations
Accession
Data
Number of
unique genes
GSE7146
1223
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (A)
1210
DGAP
Human pancreatic islets from normal and type 2 diabetic subjects (B)
803
DGAP
1238
Unique or repeated
G0S2: G0/G1switch 2
Unique
Unique
Repeated
Unique
Repeated
Unique
KIAA0692: KIAA0692
Repeated
Unique
Repeated
Repeated
Unique or repeated
Repeated
Unique
Hs.247983.0
Repeated
Repeated
Unique
Repeated
Unique
Unique
Unique
Unique
4 Conclusion
The major problem with the microarray data is the high
redundancy in the genes which restricts from getting useful
and valuable information from the data. The genes in the
microarray data either have the same GO ID or may have
common ancestors, which makes them similar in molecular
function. The redundant genes have been removed based
on same GO ID at the first step and then based on the
average score of Pearson and semantic similarity. The
123
Table 10 Reduction in
redundancy of genes at each
stage
Unique
LPHN3: latrophilin 3
Repeated
Unique
CAPS: calcyphosine
Unique
Transcribed locus
Unique
Repeated
Repeated
Repeated
Repeated
Repeated
Unique or
repeated
ZAK: sterile alpha motif and leucine zipper containing kinase AZK
Repeated
ANKHD1 /// MASK-BP3: ankyrin repeat and KH domain containing 1 /// MASK-4EBP3 alternate reading frame gene
Repeated
Unique
Unique
Repeated
Unique
Repeated
Repeated
Unique
PLK1 /// RPL37A: polo-like kinase 1 (Drosophila) /// ribosomal protein L37a
Unique
Data
22,215
6107
1223
22,191
6115
1210
22,550
13,175
803
22,177
6113
1238
123
Unique or
repeated
No. of genes in
nonredundant GO
set
No. of genes
obtained after final
step
References
1. Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative
monitoring of gene expression patterns with a complementary
DNA microarray. Science 270:467470
2. Zhang A (2006) Advanced Analysis of Gene Expression
Microarray Data. World Scientific Publishing Co., Danvers
3. Mohammadi A, Saraee MH, Salehi M (2011) Identification of
disease-causing genes using microarray data mining and Gene
Ontology. BMC Med Genom 4:1219
4. Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic
similarity between Gene Ontology terms. Data Knowl Eng
61:137152
5. Edgar R, Domrachev M, Lash AE (2002) Gene Expression
Omnibus: NCBI gene expression and hybridization array data
repository. Nucleic Acids Res 30(1):207210
6. Kumar A, Sharmila DJS, Kant R (2014) Selection of discriminatory gene set for Type II diabetes using fisher linear discriminant. Int J Adv Comput Mathe Sci 5(2):3642
7. Parikh H, Carlsson E, Chutkow WA, Johansson LE, Storgaard H,
Poulsen P, Saxena R, Ladd C, Schulze PC, Mazzini MJ, Jensen
CB, Krook A, Bjornholm M, Tornqvist H, Zierath JR,
8.
9.
10.
11.
123