You are on page 1of 6

Neural Information Processing - Letters and Reviews Vol. 4, No.

3, September 2004

LETTER

Practical Implementation of Back-Propagation Networks in a Low-Cost PC


Cluster
Cheng-Chang Jeng
Center of Computer and Communication, National Taitung University
No.684, Sec. 1, Zhonghua Rd., Taitung City, Taitung County 950, Taiwan
E-mail: cjeng@cc.nttu.edu.tw

I-Ching Yang
Department of Nature Science Education, National Taitung University
No.684, Sec. 1, Zhonghua Rd., Taitung City, Taitung County 950, Taiwan
E-mail: icyang@cc.nttu.edu.tw

(Submitted on June 15, 2004)

Abstract − The computational nature of feed forward networks is parallel or distributed. In


these years, low cost PCs are grouped to arrange flexible and affordable computer cluster
for distributed computation. For these reasons, we implement a model with a parallelized
algorithm of artificial neural network, back-propagation network (BPN), in an experimental
Windows-based computer cluster with locally Ethernet/Fast Ethernet networked 3 PIII 500
MHz PCs, and try utilizing the computational power of the distributed environment. The
results of experiments show that the load of network traffics may ruin the benefits taking
from a PC cluster. Therefore, we suggest an algorithm that reduces the load of
communication among PCs but data consistency is still kept. However, unlike the
completion time of BPNs with the standard gradient descent algorithm executed by one
computer is linear, the completion time of BPNs with the standard gradient descent
algorithm run in a PC cluster seem to be nonlinear. This phoneme is discussed in the
section of conclusion and suggested to investigate it in future research.

Keywords − Computer Cluster, Artificial Neural Network, Gradient Descent Algorithm,


Message Passing Interface, Distributed Computing.

1. Introduction
The bottleneck of parallel implementations of artificial neural networks (ANN) is highly based on how
communications among neurons of an ANN is effectively implemented. Most research related to the
implementation of parallel ANN are based on the environment with internal hardware bus architectures or high
performance networks, but optimal hardware platforms with internal high-speed hardware bus or high
performance networks are not always available. In university computer labs, Windows-based personal computers
(PC) interconnected via Ethernet/Fast Ethernet are most available and affordable and have the potentials of
implementing parallel or distributed computing. Therefore, we will only discuss the implementation of parallel
computing in the environment with Ethernet/Fast Ethernet networked Windows-based PCs in order to reflect the
most common setting of computer clusters.
Although the architectures of ANNs inherit the nature of parallel or distributed computation, not every
architecture of ANNs is suitable to be implemented in Ethernet/Fast Ethernet networked PCs. Alspector and
Lippe (1996) indicates that feed forward networks with a parallel weight perturbative gradient descent algorithm
have higher performance than back-propagation networks (BPN) with the standard gradient descent algorithm do.
Other researchers who adopt alternative algorithms instead of a standard gradient descent algorithm also find that
BPN takes fewer advantages from parallel computing environment (Baba, 1989; Seiffert, 2001). The issue of

33
Practical Implementation of Back-Propagation Networks in a Low-Cost PC Cluster C.-C. Jeng and I-C. Yang

BPN’s latency is because of the standard gradient descent algorithm. During a learning cycle of a BPN, errors
between an output vector and a target vector have to be “propagated back” to the entire network and cause many
network communications among neurons. Such a restrict limits the suitability of implementing BPN in a
distributed computing environment with Ethernet/Fast Ethernet interconnected PCs, unless PCs are connected by
fiber optical network, such as the environment described in Malluhi, Bayoumi, and Rao’s research (1994), to
reduce time distribution caused by transmitting data in a network.

2. Windows-Based PC Cluster
If we can utilize PCs interconnected by an Ethernet/Fast Ethernet for distributed computing, the networked
environment is called PC cluster (Sterling, Salmon, Becker & Savarese, 1999). To implement parallel
programming in a PC cluster, the major issue is the distribution of information among PCs. In order to
implement distributed computing, it is necessary to choose a programming interface that helps programmer to
distribute messages among PCs. For a Windows-based PC cluster, there are several programming interfaces or
software packages available. Message Passing Interface (MPI), the standard of message passing programming
libraries, provides portable function calls to C, C++, and FORTRAN programmers. Since it is designed to be
open sourced and used with homogeneous computer clusters, Windows-based MPI packages can be free
downloaded from the Internet. MPICH, a portable implementation of MPI, is available from Mathematics and
Computer Science Division (MCS) (2003). This package is designed to work on multiple platforms.
MP-MPICH is a modification and extension to MCS’s work. This alternative of MPICH also includes an
enhanced version called NT-MPICH that improves the portability and performance of message passing to
Windows NT/2000 environment (Bemmerl, 2003). Bemmerl’s NT-MPICH is easy to install and has good
performance on message passing. Therefore, in this study, we choose NT-MPICH developed by Bemmerl to
carry out our experiments.
In this study, a simple PC cluster running Windows 2000 with only 3 P III 500 MHz PCs interconnected via
Fast Ethernet is established. The package of NT-MPICH is installed on each node (PC). Since the environment is
protected under a simple firewall, outside network traffics will be limited to keep message passing among nodes
from disturbance. As shown in Figure 1, there are 3 nodes in the computer cluster. We assume that the
computation load on each node is equal before experiments are carried out. Note that each note has only one
processor (CPU), i.e., only one processor per node is used.

Internet

PC NT-MPICH

Router

PC NT-MPICH
Firewall

PC NT-MPICH

Figure 1. The Experimental Environment of a PC Cluster.

3. Implementation of BPN in a Windows-based PC Cluster


One of the key features of BPN is those neurons’ outputs are fed forward to next layers. In a BPN, the
output of the jth neuron at the nth layer, Anj , is

( )
Anj = f net j
n
(1)

34
Neural Information Processing - Letters and Reviews Vol. 4, No. 3, September 2004

and

net nj = ∑W ij Anj−1 − θ j (2)


i

is the active function, where θj is the threshold for the jth neuron and W ij are network weights. The function f
is a transform function, which is used to map the active function to be the output of a neuron. Investigating the
equation (2), it is obviously that each node in a hidden layer can be run in parallel, and each node in a output
layer has the same story. For example, as shown in Figure 2, neuron h1 needs outputs from neurons in the input
layer but not from other neurons in the same hidden layer, and neuron o2 needs outputs from the hidden layer but
not from other neurons in the output layer.
Therefore, the parallelization of a feed forward network is quite straightforward. As shown in Figure 2,
supposed that we have only two PCs and there is a BPN with 4 input neurons, 2 hidden neurons, and 3 output
neurons. We may assign processes to PCs by using the following equation,
⎢h⎥
Ai = ⎢ ⎥, i = 0... p − 1 , and Ai = Ai + 1 if i < (h mod p), (3)
⎣ p⎦
where Ai stands for the number of assigned neurons at hidden layer to PC i, h is the number of hidden neurons,
and p is the number of PCs. In equation (3), h has to be replaced with o, the number of neurons at output layer, in
order to calculate the number of assigned neurons at output layer to PC i.

Figure 2. A sample 3-layer BPN

Although it is able to compute weights in parallel, outputs of neurons still need to be broadcasted to PCs.
In our implementation, each PC stores the structure of a BPN. When equation (2) is carried out, Anj−1 has to be
obtained by a PC; and therefore it is why the outputs of neurons need to be broadcasted to PCs. However, there
is another problem after a last layer produces its output vector. As in the standard gradient descent algorithm, the
error between the output vector of an output layer and corresponding target vector will be propagated back to
hidden layers, in order to modify network weights and thresholds to make the network adapt to a problem
domain. The following equations illustrate the procedure of back propagation. Let
δ o = Yo (1 − Yo )(To − Yo ) (4)
and
δ h = H h (1 − H h )∑ Whoδ o , (5)
o

where Yj is the outputs of an output layer, and Hh is the outputs of a hidden layer. Let
∆Who = ηδ o H h , ∆θ o = −ηδ o , Who = Who + ∆Who , and θ o = θ o + ∆θ o , (6)
and
∆Wih = ηδ j X i , ∆θ h = −ηδ h , Wih = Wih + ∆Wih , and θ h = θ h + ∆θ h . (7)
where η is the learning rate of a BPN.

35
Practical Implementation of Back-Propagation Networks in a Low-Cost PC Cluster C.-C. Jeng and I-C. Yang

The computation of ∑Whoδ o in equation (5) will cause PCs failing to obtain the values of Who . For
o
example, as shown in figure 2, neuron h1 is assigned to PC1 and weights among input and hidden neurons are
stored in PC1’s local memory. Weights Who among hidden and output neurons are not all in PC1’s local
memory. Unless weights Who are broadcasted to all PCs, equation (5) cannot be executed in parallel.
Broadcasting Who will cause large amount of network communication and ruin the benefits of cluster
computing. To reduce network traffics caused by broadcasting all Who , we suggest the following algorithm to
enhance execution performance of BPNs in a PC cluster that carries out message passing techniques with MPI
implementation.

( )
S PCi h j = ∑ ⎡Wh oq • δ oq ⎤
⎢ j PCi PCi ⎥
(8)
q ⎣ ⎦
where
i = 0,…,m-1, m = number of PCs;
j = 0,..,n-1, n = number of neurons in a hidden layer;
q
o PC stands for the qth assigned output neurons to PCi;
i
δ o q stands for the δ o of the qth assigned output neurons to PCi.
PCi

And finally, all PCs will obtain the values of ∑Whoδ o by applying the MPI command
o
( ( )
MPI _ Allreduce S pi h j , MPI _ SUM .) (9)

Note that equation (9) is an abbreviation in order to point out the most important parameters in our case.

4. Results
The experimental environment has been described in the previous section. Only one computer carries out
the first experiment Exp1. The second experiment Exp2 adopts the standard gradient descent algorithm and runs
in the PC cluster but weights Who and the outputs of neurons are broadcasted to the cluster in order to keep data
consistency. The last experiment Exp3 is run in the same environment and neuron’s outputs are broadcasted to all
PCs, and the modified algorithm described in equation (8) and equation (9) is applied to make weights Who be
consistent in the PC cluster.

Time
(Seconds)

Topology

Figure 3. Results of 3 experiments.

36
Neural Information Processing - Letters and Reviews Vol. 4, No. 3, September 2004

For all 3 experiments, three variables, number of samples, number of input neurons, and number of
iterations, are set to 4, 3, and 1000 respectively. Two variables, number of hidden neurons and number of output
neurons, are the same in amount. For example, in Figure 3, value 75 at the horizontal axis, Topology, indicates
that a BPN with 3 input neurons, 75 hidden neurons, and 75 output neurons is involved in the experiments.
The experiments show that neural networks with the standard gradient descent algorithm cannot take the
benefits from the PC cluster. BPNs with the standard gradient descent algorithm in the PC cluster even run
slower than BPNs in single PC do, and the curve of completion time, Exp2, is unexpected, while the time curve
of the first experiment gives the effect of being linear. The third experiment, Exp3, shows when BPNs with the
modified gradient descent algorithm and about 150 hidden and output neurons may run faster than BPNs in Exp1
do. If the complexity of a BPN’s topology is less than 3 input neurons, 150 hidden neurons, and 150 output
neurons, BPNs with the modified gradient descent algorithm still cannot be improved the execution performance
by a PC cluster.

5. Conclusions
This study shows the reality of implementing BPNs in an Ethernet/Fast Ethernet networked PC cluster that
might be the most common environment in a university’s computer lab. The experiments confirm that a
standard gradient descent cannot take benefits from a PC cluster because network traffics among PCs ruin the
performance, but the modified version of standard gradient descent algorithm proposed in this study performs
better than the standard one does. Although a PC cluster may reduce completion time of a system, the benefits
of a PC cluster highly depend on how algorithms and message passing interfaces are designed. Without
reducing the load of network traffics, it is difficult to take benefits from a PC cluster.
Unlike the situation that completion time of the BPN run in single PC is linear, BPNs with the standard
gradient descent algorithm seem to have nonlinear completion time. The other unexpected phoneme is also
found. The curve of completion time for Exp3 slopes steeply down when the network topology consists of
about 130 hidden neurons and 130 output neurons. We assume that the phenomena are caused by how the
firewall arranges network traffics or how MPI function calls are implemented. However, those phenomena
should be investigated further in future research.

References
[1] J. Alspector and D. Lippe, “A study of parallel weight perturbative Gradient Descent,” In Proc. of Advances
in Neural Information Processing System (NIPS ’96), pp. 803-810, Cambridge, Ma.: The MIT Press, 1996.
[2] N. Baba. A new approach for finding the global minimum of error functions of neural networks, Neural
Networks, vol. 2, pp. 367-373, 1989.
[3] U. Seiffert. “Multiple-Layer Perceptron training using genetic algorithms,” In Proc. of the 9th European
Symposium on Artificial Neural Networks (ESANN ’01) (Evere: D-Facto, 2001), pp. 159-164, 2001.
[4] T. L. Sterling, J. Salmon, D. J. Becker, and D. F. Savarese. How to Build a Beowulf-A Guide to the
Implementation and Application of PC Clusters. Cambridge, Ma, The MIT Press, 1999.
[5] Q. M. Malluhi, M. A. Bayoumi, and T. N. Rao. “An efficient mapping of multiplayer perceptron with
Backpropagation ANNs on hypercubes,” In Proc. of Symposium on Parallel and Distributed Systems
(SPDP ’93), pp. 368-375, Los Alamitos: IEEE Computer Society Press, 1994.
[6] Mathematics and Computer Science Division. MPICH-A Portable Implementation of MPI. Available from:
HTTP://www-unix.mcs.anl.gov/mpi/mpich/, 2003.
[7] T. Bermmel. MP-MPICH: Multi-Platform MPICH. Available from: HTTP://www.lfbs.rwth-aachen.de/mp-
mpich/, 2003.

37
(This page is intentionally left as blank.)

38

You might also like