You are on page 1of 10

Applied Soft Computing 11 (2011) 827–836

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

An incremental adaptive neural network model for online noisy data regression
and its application to compartment fire studies
Eric Wai Ming Lee ∗
Department of Building and Construction, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Hong Kong (SAR), PR China

a r t i c l e i n f o a b s t r a c t

Article history: This paper presents a probabilistic-entropy-based neural network (PENN) model for tackling online data
Received 2 June 2009 regression problems. The network learns online with an incremental growth network structure and
Received in revised form 11 January 2010 performs regression in a noisy environment. The training samples presented to the model are clustered
Accepted 17 January 2010
into hyperellipsoidal Gaussian kernels in the joint space of the input and output domains by using the
Available online 25 January 2010
principles of Bayesian classification and minimization of entropy. The joint probability distribution is
established by applying the Parzen density estimator to the kernels. The prediction is carried out by
Keywords:
evaluating the expected conditional mean of the output space with the given input vector. The PENN
Artificial neural network
Compartment fire
model is demonstrated to be able to remove symmetrically distributed noise embedded in the training
Kernel regression samples. The performance of the model was evaluated by three benchmarking problems with noisy
data (i.e., Ozone, Friedman#1, and Santa Fe Series E). The results show that the PENN model is able to
outperform, statistically, other artificial neural network models. The PENN model is also applied to solve
a fire safety engineering problem. It has been adopted to predict the height of the thermal interface
which is one of the indicators of fire safety level of the fire compartment. The data samples are collected
from a real experiment and are noisy in nature. The results show the superior performance of the PENN
model working in a noisy environment, and the results are found to be acceptable according to industrial
requirements.
© 2010 Elsevier B.V. All rights reserved.

1. Introduction most widely adopted templates for kernel-based regression model


development, where Kh (x) = K(x/h) is the kernel function with a
Data regression is one of the major applications of artificial neu- smoothing parameter h, and Xi and Yi are the center and the label
ral network (ANN) models. The behavior of a nonlinear system of the ith kernel, respectively.
is modeled by an ANN model via the process of network train- n
ing. Traditional ANN models (e.g., multilayer perceptron (MLP) [1], Yi Kh (x − Xi )
m(x) = i=1
n . (1)
radial basis function (RBF) [2], general regression neural network K (x − Xi )
t=1 h
(GRNN) [3]) are, in general, offline learning models which require
extensive computational time and resources for every incoming To date, many kernel-based regression models [11–15] have
new sample. Realizing this limitation, the traditional ANN models been developed that are based on the Nadaraya-Watson estimator.
have been upgraded to online learning [4]. For the MLP network, an The kernels are spherical in shape and identical. As shown in Eq.
online learning version that adopts the error function as the acti- (1), the original Nadaraya-Watson estimator uses a global hyper-
vation function of the neuron was developed in [5]. An online RBF spherical kernel. The spread of the global kernel (i.e., h) is optimized
model was proposed in [6]. The model, however, was not devel- during the process of model training. In comparison with the origi-
oped to operate in a noisy environment. In our previous work [7], nal Nadaraya-Watson kernel, the correlated hyperellipsoidal kernel
the GRNN model was also modified, and an online learning version demonstrates a higher malleability for better description of the
was developed by implementing Fuzzy ART [8] as the pre-processor nonlinearity of the underlying function, and requires fewer ker-
for online clustering of the samples into a fewer number of kernels. nels to describe the nonlinear underlying function. As such, the
The Nadaraya-Watson estimator [9,10], shown in (1), is one of the correlated hyperellipsoidal kernel is adopted in this work.
The GRNN model [3] is a powerful regression model adopting
the Nadaray-Waston estimator. The structure of the GRNN model
is self-defined because every sample that is presented to the model
∗ Tel.: +852 2194 2307; fax: +852 2788 7612. is recruited as a kernel of the model. The major shortcoming of the
E-mail address: ericlee@cityu.edu.hk. GRNN model is the huge model structure required in the case of a

1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.asoc.2010.01.002
828 E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836

large-sized training batch. This limitation, however, can be over-


come by the introduction of a clustering preprocessor. Different
clustering processors [16–18] have been proposed to cluster the
training samples into a fewer number of kernels. When a new sam-
ple is presented for network training, it is clustered into one of
the existing kernels by a similarity checking process. The informa-
tion of the selected kernel (i.e., kernel center, width, and label) is
updated to include the information of the new sample. It should
be noted that the clustering process is performed only in the input
space. This means that only the similarities between the input vec-
tor of the presented sample and the input vectors of the kernels are
considered. The output of the sample is not involved in the clus-
tering process. Also, the kernel output is only used to update the
kernel label after the clustering process. In this case, the label of a
kernel becomes the average of the outputs of the samples that are
clustered into the kernel.
This paper presents the development of a PENN (probabilistic-
entropy-based neural network) model that features online learning
and an incremental growth network structure, and works in a noisy
environment. In the development of the PENN model, we, however,
propose that clustering is performed in the joint input and output
space.
Fig. 1. The architecture of the PENN model.
The organization of this paper is as follows. Section 2 presents
the architecture and the mechanism of the PENN model in both
the network training and prediction phases. The noise elimination
feature of the model is also described. As demonstrated in Section 2.1.1. Normalization
M+1
3, the performance of the PENN model is assessed by using four A training sample representing by a row vector ŝ = {ŝi }i=1 ,
benchmarking problems. Section 4 details the application of the prior to its presentation to the PENN model, is first normalized to
PENN model to predict the height of the thermal interface inside s ∈ [0, 1]M+1 = {si }M+1
i=1
by (3) and stored in the input field where
a compartment fire. Note that the training samples are extracted lower(ŝi ) and upper(ŝi ) are, respectively, the lower and upper lim-
from the actual results of a full-scale compartment fire experiment. its of the domain in dimension i which is either provided by the
Concluding remarks are then presented in Section 5. user according to his/her knowledge in the system or determined
from the training samples. The normalization conforms to the rec-
ommendation in [19] that the domains of different dimensions be
2. The PENN model roughly the same.
 M+1
The PENN model is a kernel-based regression model with scalar ŝi − lower(ŝi )
outputs. The prediction is carried out by evaluating the expected s= si = . (3)
upper(ŝi ) − lower(ŝi )
i=1
conditional mean of a given input vector on a probability distribu-
tion in the joint input and output space that is approximated by
the kernel approach. The number of kernels of the PENN model 2.1.2. Clustering
grows incrementally during the course of network training by In general, like the variants [20–22] of Adaptive Resonance The-
online clustering. When a new sample is presented to the PENN ory (ART) [23–25], this clustering approach is developed on the
model, the proposed probabilistic-entropy-based clustering pro- basis of the sequential leader clustering algorithm [26]. Such vari-
cess will update the responding kernel only by the information ants of ART have also been widely adopted in different engineering
of the presented sample. The correlated hyperellipsoidal Gaussian applications [27–29]. This algorithm selects the first input pattern
kernel (i.e., no restriction is imposed on the covariance terms), as as the prototype of the first kernel. The next pattern is compared to
described in Eq. (2), is adopted in the PENN model. It offers a higher the first kernel against a threshold of a similarity measure which is
malleability of the kernels, and hence fewer kernels are required the two-tier clustering approach of the PENN model. If this criterion
to approximate the nonlinear underlying function as compared to is not satisfied, a new cluster is established and the input pattern
the orthogonal Gaussian kernel. is assigned as the new cluster kernel. This algorithm enables the
number of cluster to grow with time to accommodate new input
1
 1

patterns sequentially. For the PENN model, the idea of performing
(s) = exp − (s − )˙ −1 (s − )T , (2)
(2) (M+1)/2
|˙|1/2 2 clustering is that each normalized sample is clustered into either
K
one of the existing kernels ˚ = {k }k=1 . The proposed unsuper-
where s ∈ [0, 1]M+1 = {x = {si }M , y = sM+1 } is the training sample vised clustering algorithm consists of a two-tier process, namely,
i=1
Nomination and Confirmation.
of which x = {si }M
i=1
and y = sM+1 are, respectively, the input vector
and the scalar output of the sample;  ∈ M+1 and ˙ ∈ (M+1)×(M+1)
are the position vector and the covariance matrix of kernel . (1) Nomination. In the first tier, when a normalized sample is pre-
sented to the PENN model in the training mode, all the kernels
will respond to the presentation of the sample by the posterior
2.1. Model training K
probabilities {p(k |s)}k=1 in accordance with the Bayes’ theo-
rem, as described in Eq. (4).
The PENN model consists of two modules: an unsupervised clus-
tering module and a conditional expectation evaluation module. p(s|k )p(k )
p(k |s) = . (4)
The architecture of the model is shown in Fig. 1. p(s)
E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836 829

The conditional probability, p(s|k ), is obtained using Eq. (5) by Eq. (10).
giving the kernel parameters, where k = {kj }
M+1
and k , respec- ⎧
j=1 ⎪
⎪ nJ ← nJ + 1
tively, are the mean position row vector and the covariance matrix ⎪
⎨ J (nJ − 1) + s
J ←
of the Gaussian kernel, k . nJ . (10)

⎪ ˙ (n − 1) + (s − J )T (s − J )
1
 1  ⎪
⎩ ˙J ← J J
p(s|k ) = exp − (s − k )˙ −1 (s − k )T (5) nJ
(2)(M+1)/2 |˙k |1/2 2
If kernel J that is nominated by Eq. (7) cannot satisfy the cri-
The value of p(k ) is evaluated by Eq. (6). It is the ratio of the terion described in Eq. (9), the value of p(J |s) is set to zero, and
total number of samples that are clustered into kernel k (nk ) to kernel J remains inactive until the presentation of the next training
K sample. In this case, the next kernel with the highest probability is
the total number of samples n.
i=1 i
nominated by Eq. (7). This search continues until a kernel satisfies
nk both Eqs. (7) and (9). This kernel updating scheme facilitates the
p(k ) = K . (6)
noise removal feature of the PENN model as described as follows.
n
i=1 i
The centers (i.e., positive vectors) and spreads (i.e., covariance
The value of p(s) is the same for all kernels as it is the nor- matrices) are the parameters that define the Gaussian kernels.
malization factor by considering the conditional probability p(s) = These parameters are updated autonomously in the course of net-
K work training. Assume  ˜ j = x̃j  is the jth component of the position
p(s|k )p(k ).
k=1 vector of a kernel, where x̃j is the set of jth components of the noise-
The kernels are arranged in descending order by their values of
K corrupted data. It is assumed that the noise-corrupted data consists
{p(k |s)}k=1 . The kernel that has the highest conditional probability
of two components x̃j = xj + εj , where xj is the clean data to which
as described in (5) is selected and nominated for confirmation by
εj , the set of symmetrically distributed noise, acts upon. The value
the second tier of the clustering process.
of  ˜ j is the expectation of the noise-corrupted samples as shown
in Eq. (11). If noise εj that is embedded in the samples is symmet-
J = arg Max{p(j |s)}. (7)
j rically distributed (i.e., zero mean), the value of εj  approximates
zero in the long run. Then, the value of  ˜ j developed from the noise-
corrupted samples is equal to that developed from the clean data
(2) Confirmation. It is believed that when a sample is clustered into
xj .
a kernel, the information of the kernel, after the inclusion of
this sample, should be clearer and the uncertainty of the kernel ˜ j = x̃j  = xj + εj  = xj  + εj  = xj .
 (11)
should be reduced because the sample provides extra infor-
mation to the kernel. Studies [30–33] have demonstrated the The second parameter of the Gaussian kernel is the covariance
effectiveness of adopting entropy-constrained approaches for matrix. Every element of the covariance matrix is represented by
classification and clustering. This has inspired the introduction ˜ ij2 , which is the covariance between dimensions i and j of the
of information entropy into the clustering algorithm in this kernel. It is defined as ˜ ij2 = cov(xi + εi , xj + εj ), which is further
second-tier checking. The uncertainty of the nominated kernel expanded to Eq. (12).
yielded from (7) should be reduced by the inclusion of the new
sample. The uncertainty of the kernel can be represented by the ˜ ij2 = cov(xi , xj ) + cov(xi , εj ) + cov(xj , εi ) + cov(εi , εj ). (12)
entropy described by information theory. Because the kernels
are Gaussian in shape, the differential entropy of the Gaussian Assuming the noise contents at different dimensions of the input
kernel, as exemplified by Ahmed and Gokhale [34] and shown vector are random and uncorrelated to each other, the values of
in Eq. (8) where || is the determinant of the positive definite cov(xi , εj ), cov(xj , εi ), and cov(εi , εj ) can be approximated as zero.
covariance matrix of the Gaussian kernel J in RM+1 space, is Eq. (12) becomes (13), which is the covariance of the clean data.
adopted. The second tier of the clustering process follows the ˜ ij2 |i =/ j = cov(xi , xj ). (13)
argument that the entropy of the kernel after the inclusion of
the new sample (i.e., HJnew ) should be less than that before the
2.1.4. Kernel creation
inclusion of the new sample (i.e., HJold ), as shown in Eq. (9). It can
If none of the kernels satisfies the criteria, a new hyperspherical
be observed that the determinant of the covariance matrix of kernel is created to code this sample by Eq. (14). The sample will
the kernel (i.e., the spread of the kernel) will be reduced mono- be taken as the kernel centre, K+1 , and an initial uniform spread
tonically in the course of clustering and that the kernels inside  will be assigned to the created kernel.
the domain will more easily be distinguished
nK+1 = 1
1 K+1 = s . (14)
HJ = ln{(2e)M+1 |˙J |}, (8)
2 ˙K+1 = I

HJnew < HJold . (9)


2.2. Model prediction

If kernel J satisfies the conditions in Eqs. (7) and (9), then sam- In a regression task, the predicted result can be approximated by
ple s is clustered into kernel J, i.e., J is updated with the information the expected conditional mean of a given input vector x which can
of the sample. be evaluated from the joint continuous probability density function
p(x, y) by Eq. (15), where x and y, respectively, are the input vector
and the scalar output of the underlying function f.
2.1.3. Kernel updating +∞
The proposed kernel updating scheme is designed to facilitate yp(x, y) dy
the adaptive change of the kernels. When updating kernel J for the f (x) = −∞
+∞ . (15)
p(x, y) dy
inclusion of sample s, the parameters of the kernel are adjusted by −∞
830 E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836

The value of k (x) in Eq. (20) can be obtained by putting the


value of x into kernel k . The value of k (x) can be evaluated by
setting the derivative of k (s) in Eq. (20) with respect to sM+1 zero.
The result is shown in Eq. (21).
M
i=1
(si − ki )akM+1,i
k (x) = k,M+1 − , (21)
akM+1,M+1
M+1
where {akM+1,i } are the entries as included in the inverse of the
i=1
covariance matrix ˙ k , as follows.

Fig. 2. The regression in the 2 domain by the kernel-based approach to evaluate the
conditional expectation. The noise-corrupted samples are clustered into different
correlated Gaussian kernels by the unsupervised clustering algorithm. The mean
position vector i and the covariance matrix i of each kernel i are determined The expected conditional mean can then be evaluated by Eqs.
from the clustered samples of the kernel. The probability density distribution of the (20) and (21). The predicted output of the model is obtained by de-
joint domain is approximated by using the Parzen density estimator. The prediction normalizing the expected conditional mean by Eq. (22), which is
is carried out by evaluating the conditional expected mean by the given input value. the inverse process of Eq. (3).
ŝM+1 = lower(ŝM+1 ) + sM+1 [upper(ŝM+1 ) − lower(ŝM+1 )]. (22)
Parzen [35] proposed a nonparametric estimation of the proba-
bility density function p(x, y) from the information of the available Fig. 3 summarizes the mechanisms of training and prediction of
samples by the kernel approach. The probability density function the PENN model.
can be approximated by Eq. (16), where the kth kernel, k (s), is
defined in Eq. (2). 3. Experimental studies

1
K
A series of empirical studies is presented in this section to assess
p(x, y) = k (x, y). (16)
K the effectiveness of the proposed PENN model in tackling noisy
k=1
regression tasks. The performance and dynamics of the PENN model
By applying Eq. (16) to Eq. (15), the expected conditional mean are first examined by using the example of a noise-corrupted sine
can be estimated by Eq. (17), curve. The main aim is to demonstrate the reconstruction of the
K  +∞  sine curve from a sample that is randomly taken from a clean
k=1 −∞
yk (x, y) dy sine curve with Gaussian noise introduced. Then, for performance
f (x) = K  +∞  , (17) comparison among the PENN model and other approaches, three
i=1 −∞
k (x, y) dy benchmark problems are investigated. The first and second prob-
lems, i.e., Ozone and Friedman#1, comprise real and synthetic data
where x = {sj }M
j=1
and y = sM+1 . The evaluation of the conditional (with Gaussian noise introduced), respectively. The third problem
expectation in the  domain is illustrated graphically in Fig. 2 The is a real, astrophysical dataset, i.e., Santa Fe Series E, which is noisy,
centroid of the cross-sectional area, which is created by cutting the discontinuous, and nonlinear in nature. These three benchmark-
joint probability function at the given input vector xo , represents ing problems are selected to evaluate the performance of the PENN
the conditional expectation. model in a noisy environment.
The integral at the denominator of Eq. (17) can be evaluated by
Eq. (18). 3.1. Noise-corrupted sine curve
 +∞ A sine curve y = sin(x) was created within the domain
i (x, y) dy = i (x). (18) x ∈ [0, 2]. A total of 100 samples were randomly taken from the
−∞
created sine curve, and Gaussian noise N(0, 0.2) was introduced
The integral at the numerator of Eq. (17) can be evaluated by to the output of the samples. These 100 noise-corrupted samples
the probabilistic approach, as follows. Let i (x) = arg Max{i (x, y)} were used as the training samples of the PENN model which was
y
applied to reconstruct the curve. In the normalization process,
be the mean of the distribution i (x, y) at location x. Then, the
numerator of Eq. (17) can be evaluated as follows.
 +∞  +∞  +∞  +∞  +∞
y · (x, y) dy = [y − i (x) + i (x)] · i (x, y) dy = [y − i (x)](x, y) dy + i (x)(x, y) dy = 0 + i (x) (x, y) dy
−∞
+∞
−∞ −∞ −∞ −∞

y · (x, y) dy = i (x)(x).
−∞
(19)
the outputs of the samples were normalized to [0, 1] by Eq. (3)
where the lower and upper limits of the domain were determined
By putting Eqs. (18) and (19) back to Eq. (17), the expected
respectively by the maximum and minimum values of the sam-
conditional mean can be evaluated by Eq. (20).
ples. The goodness of fit between the reconstructed and actual sine
K curves was measured by the mean squared error (MSE) of the val-
 (x)k (x)
k=1 k
y(x) =  K
. (20) ues predicted by the PENN model and the clean output values of
 (x)
i=1 k the 100 samples.
E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836 831

Fig. 3. Mechanisms of the PENN model training and prediction.

Different values of  were tried, and the results are shown in samples. It can be observed that the prediction error is governed
Fig. 4 which shows the probability density distributions established by the number of kernels created. The predicted error in Fig. 4(a)
by the Parzen Density Estimator [35] based on the Gaussian kernels. and (c) is larger than that in Fig. 4(b). It is expected that an opti-
Fig. 4(a) shows the result of  = 0.1. Six kernels were created. As mal number of kernels may exist for which the prediction error is
can be seen, the estimated probability density function is rather minimal. This phenomenon is shown in Fig. 6. It demonstrates that
flat. This implied an under-fitting scenario, as the PENN model an optimal number of kernels does exist and is around 20. Based
treated the nonlinearity of the system as the noise content. Fig. 4(b) on this observation, the estimation of initial kernel radius in this
shows the results predicted by the PENN model with  = 0.01. A study is carried out by trial search in the domain (0, 1) of the kernel
good agreement is shown between the clean and reconstructed radius.
sine curves. A total of 14 kernels were created in this case. The We also compared the performance of the probabilistic-
probability density function in this case is less uncertain than that entropy-based clustering process of the PENN model with the
of the previous case (Fig. 4(a)). The initial kernel radius was further traditional Linear Vector Quantization Clustering (LVQC) algorithm
reduced, to 0.001, and the predicted results are shown in Fig. 4(c). As [36]. As a fair comparison, the LVQC used the same number of clus-
can be seen, the over-fitting scenario occurred as the PENN model ter centers created by the PENN model created (i.e. 14 centers in
treated the noise content of the samples as the nonlinearity of the the case with  = 0.01). The LVQC algorithm as described in [36] are
system. summarized as follows.
Note that probability density distributions become spiky when a
small initial kernel radius is adopted. This effect of the initial kernel Step 1 Randomly select 14 samples {mi }14 from the total 100 avail-
i=1
radius on the number of kernels that are created is depicted in Fig. 5 able samples as the cluster centers.
which demonstrates that the increase in the initial kernel radius Step 2 Present one sample x(t) to the LVQC and determine the
results the reduction in the total number of kernels created. This is center ml (t) = arg min ||x(t) − mi (t)|| to which the sample
because a small kernel radius induces a spiky kernel into which a 1≤i≤14
new sample cannot easily be clustered, because the entropy of the should be clustered where t is the number of epoch.
kernel, after the inclusion of the new sample, is likely to increase Step 3 Update the center ml by ml (t + 1) = ml (t) + ˛[x(t) − ml (t)]
as it does not satisfy the confirmation of the kernel as stated in Eq. where ˛ = 1/t.
(9). Hence, a new kernel is more likely to be created to code this Step 4 Repeat steps 2–3 until all samples presented.
new sample. Conversely, a new sample is more easily covered by Step 5 Repeat steps 2–4 until the preset number of epochs is
a kernel with a large spread (i.e., large initial kernel radius). The reached. In this test, the total number of epoch is taken to
inclusion of a single new sample will not greatly reduce the spread be 100.
of the kernel. Thus, the new sample is more easily clustered into one
of the existing kernels. Hence, the total number of kernels created By replacing the proposed probabilistic-entropy-based cluster-
is reduced. ing process by the traditional LVQC, the prediction result is shown
Fig. 4 indicates that a large initial kernel radius under-fits the in Fig. 7. Comparing to Fig. 4(b), the kernels are spiky and some of
noisy samples while a small initial kernel radius over-fits the noisy the kernel centers are deviated from the original sine curve. The
832 E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836

Fig. 5. The number of kernels created is inversely proportional to the initial kernel
radius (). A small initial kernel radius creates more kernels while a large initial
kernel radius creates comparatively fewer kernels.

Fig. 6. A total of 5000 trials were carried out with a randomly assigned initial kernel
radius. The predicted errors of the 5000 trials and the corresponding number of
kernels created are plotted. The prediction errors are minimum when the number
of kernels is about twenty (20), which is considered to be the optimal number of
Fig. 4. (a–c) The sine curve is reconstructed with different values of . The dashed kernels in the problem.
lines are the clean sine signals and the dots are the 100 training samples randomly
drawn from the clean sine curve with their outputs corrupted by Gaussian noise N(0,
0.2). The solid lines are the curves reconstructed by the PENN model based on the
information of the noise-corrupted samples. The probability density distributions
established by the Parzen Density Estimator are presented in the figures by the
contour lines.

MSE of the prediction is 0.0132 which is larger than the MSE of the
PENN prediction with  = 0.01.

3.2. Ozone

This dataset was obtained from the University of California


at Berkeley (ftp://ftp.stat.berkeley.edu/pub/users/breiman). It has
330 samples with eight inputs and one output. The input sam-
ples comprise meteorological information such as humidity and
temperature. The target output is the maximum daily ozone at a
location in the Los Angeles basin. In accordance with [37], 250 sam- Fig. 7. By replacing the proposed probabilistic based clustering algorithm by the
ples were randomly selected from the dataset for network training. traditional Learning-Vector-Quantization Clustering (LVQC) algorithm, it shows that
The remaining 80 samples were used for network testing. In the the kernels become spiky. Also, some of the kernel centers are quite far from the
normalization process, the outputs of the samples were normal- sine curve. The dashed line is the clean sine signal and the solid line is the curves
reconstructed by the LVQC and the expected conditional mean approach.
ized to [0, 1] by Eq. (3) where the lower and upper limits of the
E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836 833

Table 1
MSEs of different prediction models on ozone.
where ε is a Gaussian random noise N(0, 1) and each of the
x1 , . . . , x5 are uniformly distributed over the domain [0, 1]. Simi-
Model MSE lar to [37], 1400 samples were created, of which 400 samples were
NBAG 18.37 (3.59) randomly chosen for network training. The remaining 1000 sam-
Bench 18.58 (3.40) ples were used for network testing. In the normalization process,
Simple 19.14 (3.21)
the outputs of the samples were normalized to [0, 1] by Eq. (3)
PENN 17.78 (2.88)
where the lower and upper limits of the domain were determined
Standard deviations of the MSEs are bracketed. respectively by the maximum and minimum values of the samples.
The initial radius of the kernel was determined by trials to be 0.07
Table 2 which achieved the minimum value of the MSE in the trials. Table 2
MSEs of different prediction models on Friedman#1.
summarizes the results that are predicted by the PENN model and
Model MSE other models that are listed in [37].
NBAG 4.502 (0.268) For the PENN model, the MSE obtained by averaging the results
Bench 5.372 (0.646) from 100 trials is 4.796 with a standard deviation of 0.480. Totally
Simple 4.948 (0.589) 24 kernels were created by the PENN model by setting the initial
PENN 4.796 (0.480) radius to be 0.07. It is considered that the MSE of the PENN model
Standard deviations of the MSEs are bracketed. is higher than that of the NBAG model, but lower than that of the
other models.
domain were determined respectively by the maximum and min-
imum values of the samples. The initial radius of the kernel was 3.4. Santa Fe Series E
estimated to be 0.2.
In total, 100 experimental runs were performed. The average This dataset is obtained from Series E of the Santa Fe Time Series
mean squared error (MSE) and its standard deviation obtained from Competition [40]. It is a univariate time series of astrophysical data
the test set were calculated. The results of the PENN model are com- (variation in the light intensity of a star), and can be downloaded
pared with those from the Neural-BAG (NBAG), Bench, and Simple from http://www-psych.stanford.edu/∼andreas/Time-Series/. The
models, as shown in Table 1. Note that the Bench model [38] uses data series is noisy, discontinuous, and nonlinear in nature. In
bagging to produce an ensemble of neural network sub-models that accordance with [41], a total of 2048 samples were used, each with
are trained by different datasets created by re-sampling of the orig- five inputs and one output, i.e., xt = f (xt−1 , xt−2 , xt−3 , xt−4 , xt−5 ),
inal dataset by the bootstrap technique. It takes the average of the where xt is the intensity of the star at time t. The data presenta-
predicted outputs of the neural network sub-models as the final tion order was the same as the original. The first 90% of the dataset
predicted output. The Simple model is similar to the Bench model, was extracted for network training and validation. The last 10% was
but is equipped with a fast-stop training algorithm [37]. The NBAG extracted for testing. In the normalization process, the outputs of
model is similar to the Simple model, but uses an algorithm to con- the samples were normalized to [0, 1] by Eq. (3) where the lower
trol the diversity among the neural network sub-models to increase and upper limits of the domain were determined respectively by
the generalization performance of the overall model. As summa- the maximum and minimum values of the samples. The initial ker-
rized in Table 1, the MSE of the PENN model is 17.78 with a standard nel radius is estimated to be 0.8 by trials. This initial kernel radius
deviation of 2.88. These results are better than those of other mod- was kept unchanged throughout this study. Total 100 experimental
els reported in [37]. Note also that only six kernels were created by runs were carried out. Fig. 8 shows a comparison between the time
the training samples. series of the test set (thin line) and the series predicted by the PENN
model (bold line). The average number of kernels created was only
3.3. Friedman#1 four which appears that the nonlinear of this time series may not
be that high comparing to the above benchmarking problems.
This is a synthetic benchmark dataset that is proposed in [39]. The average MSE is shown in Table 3. The results reported in
Each sample consists of five inputs and one output. The formula for [41], i.e., by the pattern modeling and recognition system (PMRS),
data generation is exponential smoothing (ES), and neural network (NN), are included
for comparison. Note that PRMS is designed for noisy time series
t = 10 sin(x1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 + ε, prediction by employing one-step forecasting, while ES is a regres-

Fig. 8. Comparison of the last 10% of the original untreated actual time series of the Sante Fe Series E and the time series predicted by the PENN model. Thin line represents
the original time series. Thick line represents the result predicted by the PENN model.
834 E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836

Table 3 determine the consequences of building fires is still very limited


MSEs of different prediction models on Sante Fe
[48–50], and this area of work constitutes the main study of this
Series E.
paper.
Model MSE During the development of a compartment fire, hot smoke is
PMRS 0.015 created from the fire source. It rises to the ceiling of the compart-
ES 0.033 ment owing to the buoyancy force of the hot smoke. When the hot
NN 0.078 smoke reaches the ceiling, it adheres to the ceiling and spreads to
PENN 0.003
the boundaries of the ceiling. When it reaches the boundaries, the
hot smoke accumulates and descends. The hot smoke emerges from
sion method with an exponential smoothing parameter. The NN the compartment when it reaches the door soffit. At the same time,
model is a feed-forward multilayer perceptron with one hidden cold air is entrained into the compartment from the door open-
layer, and the number of hidden nodes was determined by using ing and enters into the fire plume (i.e., the column of hot smoke
the procedure in [42] to achieve the minimum generalization error rising from the fire bed to the ceiling) to support the combustion.
or maximum generalization performance. It can be seen that the The mechanism is shown in Fig. 9(a). It can be seen that the com-
MSE of the PENN is lower than those of the other models. partment is divided into two layers. The upper layer contains the
hot smoke and the lower layer contains the cold air. The thermal
4. Application to compartment fires interface is the boundary between the hot and cold gas layers in a
compartment fire. The height of thermal interface (HTI) is one of
ANN techniques have been widely applied to fire detection the major parameters in the determination of the untenable condi-
[43–47]. Studies have confirmed the applicability of ANN tech- tion of a fire scenario. When the temperature of the hot smoke is too
niques to fire detection systems. The superior performance of these high, substantial heat radiation is emitted from the hot smoke layer,
techniques compared with those from traditional models has also and this endangers the evacuees underneath. The HTI depends pre-
been addressed. However, the application of ANN techniques to dominantly on the mass of air that is entrained into the fire plume.
However, the analytical determination of the air mass flow rate is
complicated because it is highly nonlinear in nature. Currently, the
computational fluid dynamics (CFD) technique is widely employed
to simulate the fire and smoke spread, but a major shortcoming
of this technique is the extensive computational resources and the
lengthy computational time incurred.
In this study, the PENN model, as an alternative to the CFD tech-
nique, was applied to predict the location of the thermal interface
in a single compartment fire. The data provided by the 55 exper-
iments carried out by Steckler et al. [51] were recruited as the
training samples. In the Steckler’s experiment, the size of the fire
compartment was kept unchanged but the fire bed was moved to
different locations inside the compartment in different test cases.
Detail experimental setup may refer to [51].
Table 4 summarizes the controlled parameters and measured
results of the experiments. All the controlled parameters were
taken as the input variables while the HTI (bolded in the measured
results of Table 4) was taken as the scalar output. Because the HTI is
presented in the format of the mean ± error, only the mean values
were taken as the target values for network training. In other words,
the error range was hidden from the network training phase.

Table 4
Summary of the controlled parameters and the measured results of the experiments
of Steckler et al.

Controlled parameters
Opening configuration
Door sill above floor (m)
Door width (m)
Fire strength (kW)
Fire location
Distance from the centerline of the opening to the
center of the fire bed (parallel to the opening) (m)
Distance from the vertical centerline of the opening to
the center of the fire load (perpendicular to the opening)
(m)
Fire bed center above floor (m)
Fig. 9. (a) The dynamics of a compartment fire. The interface between the hot gases Ambient temperature (◦ C)
and the ambient air at Zi above the floor is defined as the thermal interface. (b)
Measured results
The dimensions of the fire compartment are 2.8 m (W) × 2.8 m (L) × 2.18 m (H). A
Air mass flow rate (kg/s)
methane burner with a porous diffuser is placed on the floor of the compartment.
Neutral plane location (%)
Thermocouples and velocity probes are provided at the doorway to measure the
Height of thermal interface (m)
properties of the ambient air and hot gases flowing across the door opening. The
Average temperature of the upper gas layer (◦ C)
fire bed with constant heat release rate 62.9 kW is moved at different locations on
Average temperature of the lower air layer (◦ C)
the floor of the compartment in different cases of the experiment. Detail setup of
Maximum mixing rate (kg/s)
the experiment may refer to [51].
E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836 835

tal growth structure and noise reduction. The clustering scheme of


the PENN model adopts Bayes’ theorem and the minimization of
entropy to form a two-tier bidirectional algorithm that either clus-
ters a new sample into one of the existing kernels or creates a new
kernel to code the sample. The kernel updating scheme is able to
eliminate symmetrically distributed noise embedded in the train-
ing samples, as demonstrated in the study of the reconstruction of
noise-corrupted sine curve.
The effectiveness of the PENN model in noisy data regression
problems has been evaluated by using the Ozone, Friedman#1,
and Sante Fe Series E benchmark problems. The results ascertain
that the PENN model is comparable, if not superior, to the other
neural network and regression models for noisy data regression
problems. One obvious advantage of the PENN model, in compar-
ison with those model, is the computational requirements. The
PENN model requires fewer kernels to describe the nonlinearity
of a system because hyperellipsoidal Gaussian kernels are adopted.
Consequently, the network requires less computational time for
network training and prediction. As a practical case study, the PENN
model has been applied to the prediction of the HTI inside a fire
compartment by using real experimental data. Statistical analysis
of the results positively demonstrates the applicability of the PENN
model to compartment fire studies.
Fig. 10. The predicted outputs of the PENN model are plotted against the experi- Although the performance of the PENN model is encouraging,
mental results to demonstrate the performance of the prediction. Solid line is the a number of aspects need further investigation. The next stage of
mean of the total 50,000 numbers of predicted results. Dashed lines represent the model development will focus on the determination of the initial
95% confidence interval of the predicted results. Dotted lines are the error envelop
radii of the kernels (i.e., the value of ). In the above studies, the
of the experimental results.
best values of the radii in different benchmarking problems were
determined empirically. It was observed that a small radius incurs
Leave-one-out cross validation was applied to evaluate the per- more kernels while a large radius results in fewer kernels. It was
formance of the PENN prediction. For each trial, out of the total also observed from the noise-corrupted sine curve example that the
number of 55 samples, 1 sample was randomly drawn and taken initial kernel radius is a parameter that distinguishes between the
out. The other 54 samples were presented to the network for train- system nonlinearity and noise embedded in the samples. Without
ing. After several trials, the best value of the initial kernel radius prior knowledge of the system, one is unable to distinguish between
was estimated to be 0.02. The average number of kernels created them. On the other hand, studies [52,53] have indicated that the ini-
was only eight. In total, 50,000 runs were carried out with different tial radius of a kernel may be determined by its neighboring kernels.
order of sample presentation and 50,000 error values were obtained Thus, it is worthwhile to investigate how to determine the radius
after the simulation. Fig. 10 plots the summary of the 50,000 results of the first created kernel by following the approaches in [52,53].
that were predicted by the PENN model against the target values
given in the experimental data.
It can be observed that the mean of the predicted results (i.e. Acknowledgement
solid line in Fig. 10) are close to the experimental results. The coef-
ficient of correlation is 0.953. It indicates a reasonable agreement The work that is described in this paper was fully supported by
between the predicted and experimental results. Also, the coeffi- a grant from the Research Grants Council of the Hong Kong Special
cient of correlation is higher than that predicted by the GRNNFA Administrative Region, China [project no. CityU/115506].
model in [50] (i.e. 0.929). The prediction results are further inves-
tigated by overlapping the error envelop of the experimental data
(i.e. dotted lines in Fig. 10) onto the 95% confidence interval of the References
predicted results (i.e. dashed lines in Fig. 10). Note that the 95%
[1] F. Rosenblatt, Principles of Neurodynamics, Spartan Books, New York, 1962.
confidence interval was evaluated by assuming that the predicted [2] D.S. Broomhead, D. Lowe, Multi-variable functional interpolation and adaptive
results of the same target HTI value were normally distributed. networks, Complex Systems 2 (1988) 321–355.
As the error envelop was unseen in the training phase and was [3] D.F. Specht, A general regression neural network, IEEE Transaction on Neural
Networks 2 (6) (1991) 568–576.
not involved in model learning, these predicted results are consid- [4] D. Saad, On-line Learning in Neural Networks, Cambridge University Press,
ered acceptable because most of the 95% confidence intervals of 1998.
the predicted results were within the error envelop of the experi- [5] D. Saad, S.A. Solla, Exact solution for on-line learning in multilayer neural net-
works, Physical Review Letters 74 (21) (1995) 4337–4340.
mental data, except at HTI = 1.49 m. Numerically, 48,654 predicted [6] J.A. Freeman, D. Saad, Dynamics of on-line learning in radial basis function
results out of the total 50,000 results fall within the error envelop networks, Physical Review E 56 (1) (1997) 907–918.
of the experimental data, and the percentage of correct predictions [7] R.K.K. Yuen, E.W.M. Lee, C.P. Lim, G.W.Y. Cheng, Fusion of GRNN and FA
for online noisy data regression, Neural Processing Letters 19 (3) (2004)
is 97.3% which is higher than the 94.5% correct prediction by the 227–241.
GRNNFA model in [50]. [8] G.A. Carpenter, S. Grossberg, D.B. Rosen, Fuzzy ART: fast stable learning and
categorization of analog patterns by an adaptive resonance system, Neural
Networks 4 (1991) 759–771.
5. Conclusions [9] E.A. Nadaraya, On estimating regression, Theory of Probability and Its Applica-
tions 9 (1964) 141–142.
In this paper, the PENN model, which is a kernel-based regres- [10] G.S. Watson, Smooth regression analysis, Sankhya: The Indian Journal of Statis-
tics – Series A 26 (1964) 359–372.
sion model with correlated hyperellipsoidal Gaussian kernels, is [11] Z. Cai, Weighted Nadaraya-Watson regression estimation, Statistics & Proba-
proposed. It is an ANN model with the properties of an incremen- bility Letters 51 (2001) 307–318.
836 E.W.M. Lee / Applied Soft Computing 11 (2011) 827–836

[12] S.K. Padhy, S.P. Panigrahi, P.K. Patra, S.K. Nayak, Non-linear channel equaliza- [33] G. Karystinos, On overfitting, generalization, and randomly expanded train-
tion using adaptive MPNN, Applied Soft Computing 9 (2009) 1016–1022. ing sets, IEEE Transactions on Neural Networks 11 (5) (2000) 1050–
[13] L. Devroye, The Hilbert kernel regression estimate, Journal of Multivariate Anal- 1057.
ysis 65 (1998) 209–227. [34] N.A Ahmed, D.V. Gokhale, Entropy expression and their estimators for multi-
[14] P. Meinicke, S. Klanke, R. Memisevic, H. Ritter, Principal surfaces from unsu- variate distributions, IEEE Transactions on Information Theory 35 (3) (1989)
pervised kernel regression, IEEE Transactions on Pattern Analysis and Machine 688–692.
Intelligence 27 (9) (2005) 1379–1391. [35] E. Parzen, On estimation of a probability density function and mode, Annual
[15] E.W.M. Lee, C.P. Lim, R.K.K. Yuen, S.M. Lo, A hybrid neural network model for Mathematical Statistics 33 (1962) 155–167.
noisy data regression, IEEE Transactions on Systems, Man and Cybernetics, Part [36] R. Inokuchi, S. Miyamoto, LVQ clustering and SOM using a kernel function,
B 34 (2) (2004) 951–960. Proceedings of IEEE International Conference on Fuzzy Systems 3 (2004) 25–29.
[16] J.B. Bezdek, A convergence theorem for the fuzzy ISODATA clustering algo- [37] J. Carney, P. Cunningham, Tuning diversity in bagged ensembles, International
rithms, IEEE Transaction on Pattern Analysis and Machine Intelligence PAMI-2 Journal of Neural Systems 10 (4) (2000) 267–279.
2 (1980) 1–8. [38] L. Breiman, Bagging Predictors, Technical Report No. 421, Department of Statis-
[17] T. Kohonen, The self-organizing map, Proceedings of the IEEE 78 (9) (1990) tics, University of California at Berkeley, California, 1994.
1464–1480. [39] J. Freidman, Multivariate adaptive regression splines (with discussion), Annals
[18] G.A. Carpenter, S. Grossberg, B.R. David, Fuzzy ART: fast stable learning and of Statistics 19 (1991) 1–141.
categorization of analog patterns by an adaptive resonance system, Neural [40] A.S. Weigend, N.A. Gersehnfield, Time Series Prediction: Forecasting the Future
Networks 4 (1991) 759–771. and Understanding the Past, Addison-Wesley, Reading, MA, 1994.
[19] J.R. William, Gaussian ARTMAP: a neural network for fast incremental learning [41] S. Singh, Noise impact on time-series forecasting using an intelligent pattern
of noisy multidimensional maps, Neural Networks 9 (5) (1996) 881–887. matching technique, Pattern Recognition 32 (1999) 1389–1398.
[20] S.C. Tan, M.V.C. Rao, C.P. Lim, Fuzzy ARTMAP dynamic decay adjustment: an [42] S.M. Weiss, C.A. Kulikowski, Computer Systems That Learn, Morgan Kaufmann,
improved fuzzy ARTMAP model with a conflict resolving facility, Applied Soft San Mateo, CA, 1991.
Computing 8 (2008) 543–554. [43] Y. Okayama, A primitive study of a fire detection method controlled by artificial
[21] G.A. Carpenter, S. Grossberg, N. Markuzone, J.H. Reynolds, D.B. Rosen, Fuzzy neural net, Fire Safety Journal 17 (6) (1991) 535–553.
ARTMAP: a neural network structure for incremental supervised learning of [44] H. Ishii, T. Ono, Y. Yamauchi, S. Ohtani, Fire detection system by multi-layered
analog multidimensional maps, IEEE Transactions on Neural Networks 3 (5) neural network with delay circuit, in: Fire Safety Science – Proceedings of the
(1992) 698–713. Fourth International Symposium, Ottawa, Ontario, Canada, July 13–17, 1994,
[22] J.R. Williamson, Gaussian ARTMAP: a neural network for fast incremental learn- pp. 761–772.
ing of noisy multidimensional maps, Neural Networks 9 (5) (1996) 881–897. [45] J.A. Milke, T.J. McAvoy, Analysis of signature patterns for discriminating fire
[23] G.A. Carpenter, S. Grossberg, A massively parallel architecture for a self- detection with multiple sensors, Fire Technology 31 (2) (1995) 120–136.
organising neural pattern recognition machine, Computer Vision, Graphics and [46] G. Pfister, Multisensor/multicriteria fire detection: a new trend rapidly
Image Processing 37 (1987) 54–115. becomes state of art, Fire Technology 33 (2) (1997) 115–139.
[24] G.A. Carpenter, S. Grossberg, ART2: stable self-organisation of pattern recogni- [47] Y. Chen, S. Sathyamoorthy, M.A. Serio, New fire detection system using FT-
tion codes for analogue input patterns, Applied Optics 26 (1987) 4919–4930. IR spectroscopy and artificial neural network, in: NISTIR6242, NIST Annual
[25] G.A. Carpenter, S. Grossberg, ART3 hierarchical search: chemical transmitters Conference on Fire Research, Gaithersburg, MD, 1982.
in self-organising pattern recognition architectures, Neural Networks 3 (1990) [48] E.W.M. Lee, P.C. Lau, K.K.Y. Yuen, Application of artificial neural network to
129–152. building compartment design for fire safety, in: Proceedings of the 7th Inter-
[26] J.A. Hartigan, Clustering Algorithm, John Wiley and Sons, New York, 1975. national Conference on Intelligent Data Engineering and Automated Learning
[27] M.L.M. Lopes, C.R. Minussi, A.D.P. Lotufo, Electric load forecasting using fuzzy (IDEAL 2006), Burgos, Spain, September, 2006, pp. 265–274.
ART&ARTMAP neural network, Applied Soft Computing 5 (2005) 235–244. [49] E.W.M. Lee, Y.Y. Lee, C.P. Lim, C.Y. Tang, Application of noisy data classifica-
[28] A. Quteishat, C.P. Lim, A modified fuzzy min-max neural network with rule tion technique to determine the occurrence of flashover in compartment fires,
extraction and its application to fault detection and classification, Applied Soft Advanced Engineering Informatics 20 (2) (2006) 213–222.
Computing 8 (2008) 985–995. [50] E.W.M. Lee, R.K.K. Yuen, S.M. Lo, K.C. Lam, G.H. Yeoh, A novel artificial neu-
[29] H.S. Soliman, M. Omari, A neural network approach to image data compression, ral network fire model for prediction of thermal interface location in single
Applied Soft Computing 6 (2006) 258–271. compartment fire, Fire Safety Journal 39 (1) (2004) 67–87.
[30] C. Holmes, D. Denison, Minimum-entropy data partitioning using reversible [51] K.D. Steckler, J.D. Quintiere, W.J. Rinkinen, Flow Induced by Fire in a Compart-
jump Markov Chain Monte Carlo, IEEE Transaction on Pattern Analysis and ment, NBSIR 82-2520, National Bureau of Standards, Washington, DC, 1982.
Machine Intelligence 23 (8) (2001) 909–914. [52] M.A. Kraaijveld, A Parzen classifier with an improved robustness against devi-
[31] N.B. Karayiannis, An axiomatic approach to soft learning vector quantization ations between training and test data, Pattern Recognition Letters 17 (1996)
and clustering, IEEE Transactions on Neural Network 10 (5) (1999) 1153–1165. 679–689.
[32] E. Gokcay, J.C. Principe, Information theoretic clustering, IEEE Transactions on [53] C.P. Lim, R.F. Harrison, An incremental adaptive network for on-line supervised
Pattern Analysis and Machine Intelligence 24 (2) (2002) 158–171. learning and probability estimation, Neural Networks 10 (5) (1997) 925–939.

You might also like