Professional Documents
Culture Documents
KEYWORDS
Degree distribution, Pearson system, real data, systems biology.
INTRODUCTION
92
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
the links [12]. If they are separated with respect to their components, the major components of the system give its name to the system
such as metabolic networks as they are composed of metabolites or protein-protein interaction networks as they consist of large number
of proteins [9, 16]. On the other hand, if they
are classified regarding their links, they are
named as directed and undirected networks.
Finally, if they are grouped based on the distribution of the links, they are called as homogeneous and non-homogeneous networks.
The homogenous networks, also known as the
Erdos-Renyi networks, present the networks
where each node has the same number of connections. Thus, the distribution of these links
or connections is presented via the Poisson distribution [3, 21, 15, 12].
(1)
1p
=
p(x) t
p
x=0
Ux
)i
(2)
E(T n ) =
0
( )
(3)
in which is the gamma function and T implies the shape parameter. Hereby, the density
93
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
f (x) k a for 0 a b
where b is the truncation point.
2.2
The Pearson system [17] is a parametric family of distributions that uses a four-parameter
probability density functions to describe the
density of the random variables. The underlying four parameters which are denoted by a, b0 ,
b1 and b2 in the Pearson family have a direct relation with the central moments (1 , 2 , 3 , 4 )
as below [20]. The explicit forms of these parameters are presented in the following expressions.
b0 =
b1 =
b2 =
1 =
2 =
2 (42 312 )
,
A
1/2
2 1 (2 + 3)
a=
,
A
(22 312 6)
,
A
23
,
32
4
.
22
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
94
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
HIV diseases interactions with different numbers of genes in the sense that the dataset 7 is
composed of 1469 genes. The datasets 8, 9, 10
have 1152, 722 and 306 genes, respectively.
In the analyses for the Pearson family-type of
distributions, we initially generate a frequency
table for the out-degree of each gene. Then,
we compute their first four-moments in such a
way that a and b values in their distributions
indicate the shape parameters for the left and
right sides, respectively. For the remaining two
parameters, we use the location and the scale
parameters. Here, the former is the minimum
and is shown via l and the latter indicates the
difference in the maximum and the minimum
of the range as displayed by s in the tabulated
terms from each dataset.
Finally, we perform the goodness of fit test to
check whether any suggested density for the
out-degree can fit these datasets. Here, we construct the hypothesis based on the following
test.
H0 : The data come from the specified (stretched exponential, pareto
and geometric) distribution.
H1 : The null hypothesis is not true.
In Tables 1-2, we list the estimated parameters,
i.e., means, variances, skewness and kurtosis
values, for the selected real datasets and then
specify a suitable density in the Pearson family.
From the results, it is shown that the majority (8 out of 10) of the data falls into the Pearson Type I distribution which presents the beta
family. On the other side, two datasets follow the Pearson Type VI distribution. The beta
family belongs to a family of the continuous
probability distributions parameterized by the
two positive shape parameters (a and b), location (l) and scale (s). In the beta family, the location parameter controls the position and the
scale parameter indicates the spread of the distribution on the x-axis in the coordinate system. The network of the inflammatory bowel
disease and the network of the glycogen storage disease fall in the Pearson Type VI family
of distributions. This type of distributions indicates an area defined as the region between the
gamma and the Pearson Type V family. Special cases of this family can be the beta distribution of the second kind and the Fisher F distribution [14]. From our results based on the
four-moment estimates, we find that these two
networks indicate the beta distribution of the
second kind.
In the literature, it has been suggested that the
degree of the departing connectivity in biological networks follows the generalized pareto,
the geometric or the weibull distribution (alternative to the stretched exponential). We test
the original 10 datasets under each of these
distributions via the chi-square goodness of fit
and the Kolmogorov-Smirnov tests. From the
findings, we detect that most of the original
datasets follow any of the tested alternative distribution, except in the case of the inflammatory disease, lafora and HIV with 1469 genes
in the system.
Table 1. Estimated parameters and Pearson Type for all
networks with their dimensions d, i.e., total number of
genes.
Network
and
dimension
Paget
disease
d = 724
Mendes
disease
d = 223
Inflamatory
bowel
d = 1008
Glycogen
storage
d = 987
Muscle
d = 643
Type
16.58
300.57
2.50
4.97
a:1.07
11.62
b:39.07
134.18
l:-0.83
5.60
s:630.03
7.58
a:0.60
2.05
b:4.21
11.30
l:1.03
1.03
s:84.26
4.78
VI
a:5.72
21.02
b:44.24
464.33
l:0.51
12.26
s:19.37
23.79
VI
a:1.14
104.82
a:0.81
b:12.80
996.51
b:7.08
l:0.02
1.57
l:4.23
s:217.87
4.84
s:981.16
95
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
Type
48.47
a:0.57
83.28
a:0.66
11.51
a:0.78
7.56
a:0.88
3.53
a:0.00
1867.86
b:2.71
590.19
b:1.87
140.80
b:11.86
67.83
b:13.68
709.38
b:1.09
1.34
l:7.27
0.90
l:-2.67
3.84
l:0.28
7.60
l:0.70
2.95
l:1.96
4.31
s:235.41
2.91
s:328.56
7.37
s:182.26
13.04
s:136.46
5.90
s:947.72
CONCLUSION
I
I
I
I
performed and the calculation of the estimation has been conducted by the Bayesian methods [19]. Hence, this study has also showed
that the Pearson Type I family can be a plausible prior distribution of the degree distribution, rather than mostly believed choices such
as power-low and weibull densities, in the calculations.
As a future work, we consider to control this
density in simulated data and long-run Monte
Carlo analyses. Furthermore, we think to
extend the study for the metabolic networks
in order to detect whether we can identify
a common type of degree distributions for
both protein-protein interaction networks and
metabolic networks.
REFERENCES
[1] M.A. Al-Fawzan, Methods for estimating the parameters of the weibull distribution, King Abdulaziz City for Science and Technology, Riyadh,
Saudi Arabia. 2000.
[2] A. Andreev, A. Kanto A, and P. Malo, Simple approach for distribution selection in the Pearson system, Helsinki School of Economics, Working Paper W-388, Helsinki, Finland. 2005.
[3] A.L. Barabasi and Z.N. Oltvai, Network biology:
96
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
[4] S. Bergmann, J. Ihmels, and N. Barkai, Similarities and differences in genome-wide expression data
of six organisms, PLoS Biology, vol. 2, 1, doi:
10.1371/journal.pbio.0020009, 2004.
[12] B.H. Junker and N. Schreiber, Analysis of biological networks, John Wiley and Sons, 2008.
[5] S.M. Burroughs and S.F. Tebbens, Uppertruncated power law distributions, Fractals, vol. 9,
pp. 209-222, 2001.
[6] S.M. Burroughs and S.F. Tebbens, Uppertruncated power laws in natural systems, Journal
of Pure and Applied Geophysics, vol. 158, pp. 741757, 2001.
[7] Y. Dodge, The Oxford dictionary of statistical
terms, Oxford Press, 2003.
[8] M. Evans, N. Hastings, and B. Peacock, Statistical
distributons, John Wiley and Sons. 2000.
[9] N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes,
Topological and causal structure of the yeast transcriptional regulatory network, Nature Genetics,
vol. 31, pp. 60-63, 2002.
[10] N.L. Johnson, S. Kotz, and N. Balakrishnan, Continuous univariate distributions, WileyInterscience, New York, 1994.
[11] N.L. Johnson, A.W. Kemp, and S. Kotz, Univari-
[13] R. Khanin and E. Wit, How scale-free are biological networks, Journal of Computational Biology,
vol. 13, pp. 810-818, 2006.
[14] B. Lahcene, On Pearson families of distributions
and its applications, African Journal of Mathematics and Computer Science Research, vol. 6, 5, pp.
108-117, 2013.
[15] S. Mossa, M. Barthelme, H.E. Stanley, and
L.A. Amaral, Truncation of power-law behavior in scale-free, Physical Review Letters, doi:
10.1103/PhysRevLett.88.138701, 2002.
[16] V. Van Noort, B. Snel, and M.A. Huynen, The
yeast coexpression network has a small-world,
scale-free architecture and can be explained by a
simple model, EMBO, vol. 5, 3, pp. 280-284, 2004.
[17] K. Pearson, Contributions to the mathematical
theory of evolution, II: skew variation in homogeneous material, Philosophical Transactions of the
Royal Society, vol. 186, pp. 343414, 1895.
[18] M.N.B. Santos, E.N. Bodunov, and B. Valeur,
Mathematical functions for the analysis of luminescence decays with underlying distributions 1:
97
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
98
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
99
Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(j)
(k)
100