You are on page 1of 17

Machine Vision and Applications (2009) 20:379394

DOI 10.1007/s00138-008-0133-3
ORIGINAL PAPER
A hybrid system for embedded machine vision using FPGAs
and neural networks
Miguel S. Prieto Alastair R. Allen
Received: 26 February 2007 / Accepted: 25 February 2008 / Published online: 27 March 2008
Springer-Verlag 2008
Abstract This paper presents a hybrid model for embedded
machine vision combining programmable hardware for the
image processing tasks and a digital hardware implementa-
tion of an articial neural network for the pattern recognition
and classication tasks. A number of possible architectural
implementations are compared. A prototype development
system of the hybrid model has been created, and hardware
details and software tools are discussed. The applicability of
the hybrid design is demonstrated with the development of a
vision application: real-time detection and recognition
of road signs.
Keywords Embedded machine vision FPGA ANN
SOM
1 Introduction
The types of vision application which will form the basis for
mass implementationover the next fewyears are those requir-
ing the embedding of intelligent pattern recognition sys-
tems into industrial processes, instrumentation and portable
systems. These systems will make extreme demands on the
hardware, requiring the integration of traditional image pro-
cessing (IP) with articial intelligence (AI) techniques, and
be implemented in designs with small footprint and low
power consumption. In these kind of designs, there is often
the need for ltering and preprocessing of the images,
M. S. Prieto A. R. Allen (B)
School of Engineering and Physical Sciences,
University of Aberdeen,
Aberdeen AB24 3UE, Scotland, UK
e-mail: a.allen@abdn.ac.uk
preparing them for high-level analysis which frequently
involves the recognition or classication of certain features
which may be present in the image.
This paper investigates possible architectures of a hybrid
system combining programmable hardware (for IP tasks)
with a neural array (for classication tasks). Additionally,
this paper also presents a hybrid prototype system (HPS)
that facilitates the development and test of vision applica-
tions on the hybrid system. Designing a prototype of the
hybrid system has two main purposes. The rst purpose is
to reduce the cost in time and effort of implementing an
embedded vision application. Following the methodology
described and using the suggested set of tools, the imple-
mentation of an application is greatly simplied. The second
purpose of creating the HPS is that the specic requirements
of an application can be better studied through testing its
performance on the HPS. Once the tests have been analysed,
the choice of architecture to be used in the nal implemen-
tation of the hybrid system, for that particular application,
becomes clear. In this way, a new vision application would
rst be implemented and tested on the HPS, and then a suit-
able architecture would be chosen according to the results
obtainedduringthe tests. This process is illustratedduringthe
development of the demonstrator application included in this
paper.
The next section presents an overviewof previous work in
the elds related to this project. Section3 studies the advan-
tages and disadvantages of several architectures of the hybrid
system, and Sect. 4 presents a prototype systemfor the devel-
opment of applications based on any of the hybrid system
architectures presented in Sect. 3. As a demonstrator of the
hybrid system, a road sign recognition (RSR) application has
been implemented, and this is outlined in Sect. 5. Finally,
Sect. 6 draws some conclusions and proposes some future
research directions of this work.
1 3
380 M. S. Prieto, A. R. Allen
2 Background
The rst part of this section reviews the work that has been
carried out towards the development of hardware systems
for the pre-processing of images. Section2.2 introduces var-
ious techniques of post-processing analysis of the images,
and concentrates on the application of articial neural net-
works (ANN) to IP, and the advances in the development
of hardware systems for ANN. To conclude this section,
Sect. 2.3 discusses some of the issues related to combining
these two technologies and gives a short introduction to the
hybrid model chosen as the main theme of this paper.
2.1 Pre-processing
Most machine vision systems use a combination of IP tech-
niques to perform a complete inspection. Amongst the more
common IPtechniques are: thresholding, segmentation, mea-
surement, ltering, texture and colour analysis, and edge
detection. Some of these techniques require complex and
time-consuming calculations. The large amount of compu-
tation required by some techniques can easily exceed the
capacity of a conventional microprocessor, when the system
is expected to process images fed by a camera at real-time
speed.
In the past, the most common solution to this problemwas
the use of application specic integrated circuits (ASIC),
hardware designed specifically for each application. How-
ever, this solution is not very practical, specially for low
volume applications, since ASICs are expensive to produce
in small quantities, require a great amount of experience in
microelectronics, and their development is a very time-con-
suming task.
Another common approach for real-time image process-
ing has been the use of parallel computing [59]. In IP, due to
the nature of its operations, it is often possible to partition the
image across several processors, and to process each parti-
tion in parallel. This mechanism requires additional commu-
nication between the processors to share the data, which can
be a problem for high-latency networks. As a result, many
multiprocessor systems have difculties with real-time video
processing due to communication overheads. Additionally,
parallel computing is costly, and the lack of stability and
software support for the parallel machines can also be a dis-
advantage [7].
Fortunately, there have recently been great advances in
the development of programmable hardware, which substan-
tially reduces the time spent in the design phase of an appli-
cation.
Fieldprogrammable gate arrays. Aeldprogrammable gate
array (FPGA) consists of an array of logic and routing resour-
ces that canbe reprogrammedafter it is manufactured. FPGAs
are generally slower than purpose-built hardware, and draw
more power. On the other hand, the advantages that they
offer are of great importance: a shorter time-to-market, lower
development costs, and allow the reprogramming of the
device [12, 57].
The greatest problem of FPGAs used to be their very lim-
ited size. Complex operations, such as some IP algorithms,
required impractical designs implemented across multiple
devices, with the communication problems that this implied.
Even though by the early 1990s there were already some
FPGA-based IP systems [1, 27], it was not until the end of the
1990s, that the significant improvements in FPGAcapacities
(and FPGA technology in general) made this technology a
serious candidate for hardware, and more applications started
to appear [17, 26]. Since then, FPGA technology has only
improved, with a consequent increase in the number of FPGA
implementations of IP. Various researchers have developed
FPGA-based systems for IP applications. Salcic et al. [53]
and Bouridane et al. [7] produced real-time co-processors
based on FPGA. Bouridane et al. also created a library of
common IP algorithms that can be called fromthe PChosting
the co-processor. This original work evolved into a different
kind of library in [5], which is a collection of reusable hard-
ware skeletons of common IP operations, which can be com-
bined through high-level algorithmic descriptions to obtain
hardware solutions. Another processing system for machine
vision applications was developed by Dunn et al. [18]. An
important part of their workis the development of a high-level
language that could be compiled into hardware by automated
techniques. More details on these techniques follow in the
next section. Similar to these works, there is a growing num-
ber of other kinds of architectures based on FPGAs designed
for real-time IP [4, 34, 42].
Recent technological advances have provided new func-
tionalities for the FPGAs of latter generations, such as
dynamic reconguration. Dynamic reconguration is the
ability to recongure a part or the whole of the FPGA dur-
ing execution time. Tanougast et al. [60] have introduced
methods to optimise the use of the FPGA, exploiting this
new functionality, and created a dynamically recongurable
embeddedreal-time system. Nevertheless, dynamic recong-
uration is still a very new area of research that still requires
some improvement to facilitate its use.
Despite the great advantages of the new functionalities of
FPGAs, such as dynamic reconguration, possibly the most
important area of research around FPGA technologies at the
moment is centred on the programming of the FPGAs. More
specifically, the development of high-level languages that can
be compiled directly into hardware by automatic techniques
can neutralize the main disadvantage of FPGAs, i.e. the pro-
gramming model for FPGAs has been up to now at gate
level. This issue is examined in more detail in the following
section.
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 381
Hardware compilation. The development and testing of
algorithms at gate level is a very time-consuming task and
requires a considerable expertise in microelectronics. Hard-
ware compilation [14, 57] is a recent approach to this problem
in which algorithms are rst written in a high-level language,
and then compiled into hardware by automatic techniques.
Some high-level languages usedinhardware compilationare:
the CCGL designed specifically for Xilinx FPGAs by Dunn
et al. [18], the SA-C developed by Draper et al. [16], and
Handel-C [10], which has been extended considerably more
than other languages. The familiarity of writing in a high-
level language, combined with the simple compilation and
simulation environment, makes hardware compilation very
suitable for the adaptation of IP algorithms into hardware
implementations [44, 57, 58].
Handel-C is a programming language that includes most
of the capabilities of conventional C. In addition, Handel-
C also includes some parallel constructs, which enable the
exploitation of the inherent parallelism of algorithms. In this
way, the implementation of an algorithmin hardware follows
three steps: translation of the C/C++ programinto Handel-C,
parallelisation of the code, and adaptation of the code to the
hardware platform that it is going to be running on (by x-
ing variable widths, high-level interfacing of devices, and so
on). After this step, all that is needed is to use the Handel-C
hardware compiler to automatically generate the gate-level
hardware.
2.2 Post-processing
The computational complexity of recent machine vision
applications has demanded new AI techniques to be added
to the traditional methods of IP. Two common approaches
to analysing the data are from the point of view of statistics
or ANN. In general, there are many parallelisms between the
twoelds, andthere is a direct correspondence betweenmany
of the techniques. An excellent summary of the similarities,
differences, and the ideal situations in which different tech-
niques should be used, can be found in the Neural Network
FAQ by Sarle [54].
This section is concerned mainly with the use of ANNs
in the context of IP, and the development of hardware imple-
mentations of ANN. One of the reasons whyANNare appeal-
ing to this project is because they are biologically inspired
systems. Since one of the global aims of this project is an
attempt to produce a suitable system for the simulation of
cognitive models of the human mind, it seems appropriate to
use methods that are biologically inspired.
Concerning the use of ANNs in IP applications, there has
been a large amount of work to evaluate the performance of
different networks for each kind of IP task. A very compre-
hensive review of these studies can be found in [19]. In this
study, Egmonton-Petersen et al. point out that the most fre-
quently applied architectures are: feed-forward ANNs, self-
organizing maps (SOM) and Hopeld ANNs. From these
three, feed-forward ANNs seem to be less useful for some
very important IP tasks, and Hopeld ANNs are sometimes
difcult to apply to a particular problem and convergence to
a global optimum cannot be guaranteed [19].
The major problem of using ANNs in real-time IP is that
most of them are very costly computationally and therefore
unsuitable for real-time processing, unless implemented in
hardware. For this reason, there has been a great effort on
producing suitable architectures for ANN[43]. Neural archi-
tectures are commonly classied into the following main
categories [28, 39]: accelerator boards, multiprocessor sys-
tems, and digital, analogue and hybrid (digital and analogue)
neurochips. Heemskerk [28] mentions that the successive
order of these categories corresponds to increasing speed-up
but decreasing maturity of the techniques used. In general,
hybrid neurochips [6, 23] have only been used in laboratories,
while accelerator boards have been commercially available
for some time. At the moment, digital techniques seem to
offer the best possibilities for implementing exible, gen-
eral-purpose neurocomputers.
There are many reviews of neural hardware in the litera-
ture. Some very complete reviews are [28, 31, 39]. In this last
work, Ienne et al. argue that the quest for generality has inu-
enced the design of many new processing elements, making
new architectures tending to be less attached to a particular
algorithm or ANN, and looking for a more general design.
Other reviews can be found in [13, 55, 56]. An interesting
comment by Dias et al. in their recent work [13] is that many
neurochips are being removed from the market, and very
few new ones appearing. They suggest that this might be
because neurochips are expensive to build and there is still
little known about the real commercial prospects for working
implementations, but that the appearance of new hardware
solutions in the coming years may change the present state
of the ANN hardware market.
Having those considerations in mind, in order to produce
a system with real hardware, the ANN chosen for the system
developed in this project is the SOM, for various reasons.
There seems to be a significant amount of research on appli-
cations of the SOM to IP and with apparently very positive
results [3, 8, 9, 19, 38]. Another reason is that there are avail-
able hardware implementations of the SOM (more details in
the next sections). The following section presents the basic
theory of the SOM and introduces the VindAX processor, a
digital implementation of this kind of network.
Self-organising maps. In 1982, Tuevo Kohonen presented a
new model of unsupervised competitive ANN called Self-
Organising Map (SOM) [24, 35]. The model consists of an
array of neurons, each of which is connected to all the inputs
and to some neighbourhood of the surrounding neurons.
1 3
382 M. S. Prieto, A. R. Allen
This architecture resembles the way biological nets are
organised. In the brain, the number of connections within
each group is much greater than the connections to outside
of the group. Moreover, the physical proximity of two bio-
logical neurons reects some kind of similarity between the
impulses that activate them. In order to implement this fea-
ture, while classical competitive learning only updates the
weights of the winning neuron, the Kohonen learning algo-
rithm also extends the competition over spatial neighbour-
hoods. This extension is achieved by updating the weights
of the neurons within the proximity of the winner neuron,
thus allowing the formation of clusters of nodes within the
array. As in the biological system, the neurons grouped in
a cluster share some sort of similarity between the features
of the inputs that activate them. In this way, it is possible to
perceive the underlying structure of multidimensional data
by projecting it over the array of neurons and observing the
array auto-organise in clusters.
The SOM algorithm. In the SOM algorithm, the multidi-
mensional Euclidean input space
n
, is mapped into a two-
dimensional output space
2
. The reference vector of each
neuron in the network is m
i
= [
i 1
,
i 2
, . . . ,
i n
]
n
,
where
i j
are scalar weights, i is the neuron index and j the
weight index. An input vector x = [
1
,
2
, . . . ,
n
]
n
is presented to all neurons in the network, and the neuron
with the closest matching (i.e. greatest similarity) vector c
becomes the active neuron, i.e.
c = argmin
i
{|x m
i
|}, (1)
which means the same as
x m
c
= min
i
{x m
i
} ,
where x m
i
is the distance between the input vector x
and the reference vector m
i
, using a similarity metric such
as the Euclidean distance.
During the training of the ANN, after the active neuron
has been identied, the reference vectors of the neurons in
the neighbourhood of the active neuron need to be updated
to bring them closer to the input vector x. The amount of
change is determined by the distance of the neuron from the
active neuron. The updating rule is:
m
i
(t +1) = m
i
(t ) + (t ) [x (t ) m
i
(t )] if i N
c
(t )
and
m
i
(t +1) = m
i
(t ) if i / N
c
(t ) ,
where N
c
(t ) is the current neighbourhood, and t is discrete
time (i.e. t = 0, 1, 2, . . .).
The training is considered successful if the neurons form
clusters of similar reference vectors. Each cluster of neurons
would correspond to a different class of the data. This clus-
tering is a product of the design of the network and occurs
without any supervision. Once the network is successfully
trained it can be used to classify new data. The new data
would be codied into a new input vector, which would then
be passed to the network. The neuron that becomes active by
Equation1 determines the class of the input data.
Hardware implementations of the SOM. Over the years,
there have beenmanyhardware implementations of the SOM.
In most cases, the SOM algorithm is implemented on a neu-
ral chip designed to support different kinds of networks. An
example of this is the work by Ienne et al. [32] where they
used a chip with a Single Instruction stream Multiple Data
stream(SIMD) architecture called MANTRAI [61] to imple-
ment the SOM algorithm. The authors suggested a few dif-
ferences in the algorithm to improve the parallelisation and
its implementation in hardware.
More recent examples of SOM implementation on neural
hardware can be found in [6, 21] and [48]. The last work by
Porrmann et al. presents a dynamically recongurable hard-
ware acceleration based on FPGA technology for the simu-
lation of SOM. The system, equipped with 5 FPGAmodules,
achieved a respectable maximum performance of more than
50 GCPS (giga connections per second) during recall and 5
GCUPS (giga connection updates per second) during learn-
ing. Other publications from the same research group giving
more details of the system and results can be found in [49]
and [52].
Another fully digital hardware implementation of the
SOM algorithm is the Modular Map [40, 41]. It is composed
of a neural array and a module controller that provides the
interface between the host and the array which is a SIMD
array of processors congured to provide a highly parallel
processing system. Each neuron of the array is implemented
separately as a simple RISCprocessor, and they interact with
each other creating a network of 16 16 neurons with the
topology equivalent to that of a SOM. The design adopts a
modular strategy, which permits the neural array to work as
a fully functional self-contained network or as a part of a
bigger network by interconnecting modules. The commer-
cialised version of the hardware implementation of the Mod-
ular Map design is known as the VindAX processor, and is
developed by AXEON Ltd [2]. The manufacturing company
provides a PCI development board, which contains one Vin-
dAX processor and a software package to run on a PC for
the development of applications.
Through the Vindax Development Board, the VindAXpro-
cessor can be accessed as one 16 16 network or as various
partitionings of the neural array. The partitionings can either
be in the dimension of the neural map (e.g. 4 networks of
4 4 neurons with reference vectors of 16 elements) or in
the dimension of the reference vectors (e.g. 2 networks of
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 383
16 16 neurons with reference vectors of 8 elements). A
Register Transfer Level synthesisable VHDL description of
the VindAX processor is available for Intellectual Property
ware applications [29]. In the case of embedded IP applica-
tions, the Modular Map soft core could be included as part
of the hybrid system designed in this project, and optimised
for the use of each particular application.
In the VindAX processor implementation of the SOM
algorithm, covers the range (0, 255) and (0 < n 16)
in the multidimensional Euclidean input space
n
, which
means that the vectors have 16 elements with values up to
255. Regarding Eq. 1, a variety of distance metrics can be
used as a measure of similarity. Since the equation is imple-
mented in hardware, the Manhattan distance metric has been
found to be a valid alternative to the more widely used Euclid-
ean distance [40, 51]. The Manhattan distance is less expen-
sive in terms of computational resources.
The hardware implementation of the SOM used in the
present work is the VindAX. Some of the advantages that are
directly relevant are: the availability as an Intellectual Prop-
erty core, the ability to partition the neural array into various
sub-maps, and the user-friendly development system.
2.3 Combination of the pre-processing and post-processing
Because of the requirements for increasingly complex vision
processing, there is a clear motivation towards creating sys-
tems that allowthe mergingof the twoaspects of image analy-
sis presented in the previous sections: the pre-processing and
the post-processing. The pre-processing of the images com-
prises all IP tasks which deal with the image at a very low-
level, whereas the post-processing tasks include the
high-level analysis of information with AI techniques, such
as ANN.
There appears to have been relatively little research inves-
ted in the development of systems combining traditional IP
techniques with ANN. One company that seems active in this
line of research is General Vision [25]. Their main product
is an image recognition engine named CogniSight. Cogni-
Sightincludes a Xilinx Virtex FPGAof 50Kgates together
with a parallel neural network based on the Zero Instruction
Set Chip (ZISC

) [20] from IBM. This system has been


designed to analyse the colour, shape and texture of visual
objects, learn these signatures with the ANN, and then recog-
nize identical or similar objects to produce a response. One of
the main problems of the CogniSight engine is that the kind
of pre-processing is limited to a pre-dened set of operations,
which are not very complex in nature, due to the limited size
of the FPGA. The usage of the system has been greatly sim-
plied so as to reduce the required technical knowledge to
design the pre-processing of the images or the training of the
ANN. Even though this may be useful for a limited amount
Fig. 1 FPGA used in both stages of the hybrid system, the pre-pro-
cessing and classication
of applications, the design is in general too rigid for a broader
use in machine vision applications.
This paper proposes a hybrid system consisting of a num-
ber of image processing (IP) stages followed by a neural
classier. The pre-processing stages of the system would be
implemented on programmable hardware, in the form of an
FPGA, and the post-processing tasks of classication and
pattern recognition would be performed by a digital imple-
mentation of the SOM, namely the VindAX processor. More
details regarding this hybrid system can be found in Sect. 3.
In order to facilitate the application development process
for the hybrid systembeing designed, and following the lines
of Benkrid et al. in [5], a library of IP algorithms has been
developed. The main idea is that this library would provide a
set of common IPalgorithms that would avoid a great amount
of work in the development of applications on the hybrid
system. In contrast with the work by Benkrid, the algorithms
contained in the library have been created by using hardware
compilation techniques in order to facilitate their develop-
ment and to increase the re-usability of their code.
3 Possible architectures of the hybrid system
This section discusses three possible architectures of a hybrid
systemconsisting of a number of IPstages followed by a neu-
ral classier. The IP stages are performed by an FPGA, and
the neural classier corresponds to an implementation of the
SOM algorithm either in the FPGA or in dedicated hardware
(the VindAX processor). The advantages and disadvantages
of each architecture are analysed in detail.
3.1 FPGA standalone
In the simplest case, the FPGA would perform both stages
of the hybrid system, i.e. the pre-processing and the classi-
cation (see Fig. 1).
1 3
384 M. S. Prieto, A. R. Allen
In this model, the FPGA would contain all the necessary
algorithms for the pre-processing of the images, as well as
a full implementation of a SOM. It is clear that because of
its simplicity this design is ideal for embedded applications.
The problem with this method is that having the full imple-
mentation of a SOM on the chip would require the use of
a very large FPGA, or otherwise very little space for the IP
algorithms would be left available. The Intellectual Prop-
erty core of the VindAX processor could be used for the full
implementation of a SOM in this architecture.
Advantages of this architecture:
The interconnection and communication between the
pre-processing and the SOM is well established, does
not need to be analysed in particular for each new appli-
cation, and is likely to be fast.
Having the solution as just an FPGA makes it ideal for
embedded applications, i.e. one single chip means less
power consumptionandeasier tointegrate withinembed-
ded systems.
Including the full implementation of the SOM allows
the system to have on-line learning capabilities, mak-
ing it capable of adapting to new situations during the
operation of the device.
Disadvantages of this architecture:
Afull implementation of the SOMrequires many resour-
ces causinga limitationinthe maximumsize of the neural
array and in the complexity of the pre-processing of the
images.
In applications with very intensive use of the neural array
the speed of the system at processing vectors might not
be fast enough for the speed requirements of a real-time
application.
3.2 FPGA standalone with off-line learning
One of the problems with an implementation on a stand-
alone FPGA, as mentioned in Sect. 3.1, is that a full SOM
implementation may use too many resources on the chip,
limiting the complexity of the pre-processing of the images.
Figure2 shows a solution to the problem, in which the work-
ing mode of the FPGA is differentiated from the learning
mode.
More specifically, in applications where the learning of
new vectors does not need to be done in real-time, which
is the case for many applications, the implementation of the
SOM in working mode would only include those resources
needed to nd the active neuron. This process requires very
simple processing (see Sect. 2.2), which means that very high
classication speeds can be reached using this model, at the
Fig. 2 The two modes of the FPGA used for applications in which the
learning of new vectors by the SOM is performed off-line
same time that a smaller amount of resources than the full
implementation are being used.
When the application required the learning of newvectors,
the FPGAcouldbe set tolearningmode, anda full implemen-
tation of the SOM would be included in the chip. In the case
that there were not enough resources for the pre-processing
and the full SOM in the same chip, then the pre-processing
could be performed rst, storing the obtained vectors for the
training of the network, followed by the reprogramming of
the FPGA with a full SOM implementation (without the pre-
processing of the images) which could then be trained with
the vectors stored in the previous step.
This model solves the problem of resources to a cer-
tain degree. The classication stage of the SOM would still
require a large area of the FPGA, unless only a subset of
neurons was implemented, instead of the full neural array.
In this case, the neural array would have to be partitioned,
and the new vector would have to be classied with each
partition in order. This method would reduce the speed of
the classication by a factor proportional to the number of
partitions. In cases where the classication speed is critical,
the achievement of real-time performance by this design will
not be possible.
Advantages of this architecture:
The interconnection and communication between the
pre-processing and the SOMis well established and fast.
Solution integrated in a single FPGA, ideal for embedded
applications.
By limiting the on-line capabilities of the chip to only
classication, a large amount of resources are freed for
the implementation of more complex pre-processing
algorithms compared to the design in Sect. 3.1.
Disadvantages of this architecture:
Due to the time taken switching between the Working
Mode and the Learning Mode, on-line learning is greatly
reduced, and in many cases it might prove necessary to
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 385
Fig. 3 FPGA and VindAX processor working in parallel
come off-line during the time that it takes for the learning
of new vectors.
The implementation of only the classication step of
the SOM does free a large amount of resources for the
pre-processing, although this might not be enough for
applications with a heavy load of pre-processing.
In applications with very intensive use of the neural
array, the speedof the systemat processingvectors might
still not be fast enough for the speed requirements of a
real-time application.
3.3 FPGA and dedicated ANN processor
The third solution proposed in this project combines the
advantages of using FPGAs for the pre-processing of the
images, with the computational power of dedicated hard-
ware (the VindAX processor) for the classication. By using
dedicated hardware, the classication stage can meet real-
time requirements even when the network is continuously
adapting its reference vectors (on-line learning). Further-
more, by dedicating the FPGA solely to the pre-processing
of the images, a large amount of resources are freed which
can in turn be used to implement more complex, and faster,
IP algorithms.
Figure3 shows the design of the hybrid systemcombining
the two chips, the FPGA and the VindAX processor, to work
in parallel.
The simplest mode of interaction between the two chips
is that of a pipeline in which there is only one stage of pre-
processing in the FPGA followed by one stage of classica-
tion in the VindAX processor. A more complex interaction
between the two chips can be found in the Road Sign Recog-
nition demonstrator application, in Sect. 5. In this last exam-
ple, there is a continuous dialogue between the two chips
analysing the data at different levels, moving down a classi-
cation tree.
The design shown in Fig. 3 offers the exibility of creat-
ing any sort of interaction between the two chips. Another
example of how the system is useful in embedded applica-
tions could be to address one of the problems of the architec-
Fig. 4 An FPGAcontaining a SOMclassier and the VindAXproces-
sor learning in parallel
ture presented in Sect. 3.2. One of the disadvantages of that
architecture was that the system had to be stopped in order
to adapt it to new vectors and therefore on-line learning was
not possible. This problemcould be solved by connecting the
systemas in Fig. 2, with the difference that the learning mode
of the FPGA would be substituted by the VindAX processor
working in parallel with the FPGA. The design is illustrated
in Fig. 4.
During the normal operation of the system presented in
Fig. 4, the FPGA would do both tasks of pre-processing and
classication of the input images. At certain intervals, when-
ever the network required to be adapted (on-line learning),
the same vectors that were being classied by the SOM clas-
sier inside the FPGA would also be passed (in parallel) to
the VindAX processor. The VindAX processor would then
use these vectors to re-train the ANN and adapt the values
of the models stored in the individual neurons of the neural
array.
At the end of each adaptation interval, the new reference
vectors of the VindAX processor neural array would then
be transferred to the FPGA in order to update the values
of its SOM classier. This adaptation could be useful for
increasing the accuracy of the classication. During the peri-
ods when the network did not require on-line learning, the
VindAX processor could obviously be deactivated in order
to decrease the power consumption of the device.
A very important factor to keep in consideration with
this kind of system is the choice of communication chan-
nel between the two chips. If the speed of the channel is
too low, the channel is likely to become the bottleneck of
the design and even restrict its ability to meet the speed
requirements of real-time applications. If the two chips are
contained within the same board, this problem does not
arise since the latency between the two chips would be
insignificant. In many cases, however, a different approach
which is probably easier to implement could also be con-
sidered. In these cases, the communication between the
two chips could be performed by any of the following
channels:
1 3
386 M. S. Prieto, A. R. Allen
1. PCI: A 64-bit PCI has a 133MHz clock to enable data
transfer speeds of up to 1 Gbit/s (otherwise a 32-bit PCI
works at 33MHz clock, giving 132Mbit/s).
2. Ethernet 100: In theory, the communication speed of this
bus is 100Mbit/s. However, this could not be achieved if
the data were passed through a computer with an oper-
ating system that did not offer real-time capabilities. If
that were the case, when the bits were being transferred
up in the protocol layers for handling the communica-
tion between the two chips, the operating system might
not grant total CPUusage to the operation, impeding the
uency of the transaction.
3. Parallel port: In Enhanced Parallel Port (EPP) mode,
data transfer takes place as a single software instruc-
tion, and the rest of the transfer is handled by hardware.
This allows an EPP port to function as a 16- or 32-bit
data transfer interface using 8-bit I/Ohardware, in effect
enabling EPP peripherals to achieve the same speed and
efciency as their ISA bus counterparts: 8Mbit/s.
4. Serial port: The typical data transfer speed of the serial
port is 115,200bit/s, although other speeds can also be
used.
In any case, a detailed study of the communication chan-
nel to be used is necessary in order to prevent the channel
from becoming the bottleneck of the design.
Advantages of this architecture:
The FPGA can be fully dedicated to the pre-process-
ing of the images allowing the implementation of much
more complex IP algorithms than the designs presented
in previous sections.
The use of dedicated hardware for the classication and
learning of new vectors assures a performance compati-
ble with real-time applications, as well as providing the
ability to perform on-line learning.
Disadvantages of this architecture:
Detailedanalysis of the communicationchannel between
the two chips is required in order to prevent it from
becoming the bottleneck of the design.
The implementation of the design becomes more com-
plex since it involves the management of two chips and
the communication between them.
4 The hybrid prototype system
The purpose of the work described in this section is to pro-
vide a useful tool for the development of embedded machine
vision applications based on the hybrid model presented in
the previous section.
4.1 Introduction
Section3 has introduced three possible designs for the devel-
opment of the hybrid system, each with its own advantages
and disadvantages. The differences between the implemen-
tation of these three designs can be quite significant. For this
reason, an initial study of which is the most suitable archi-
tecture for the application is strictly necessary. The problem,
though, is that this decision is often far from trivial. The
difculty of this decision lies in the fact that, especially in
the early stages of the development of the application, there
is great uncertainty about vital points such as: the speed at
which the ANNwill be required to work, what resources will
be needed for the implementation of the pre-processing of the
images, and what kinds of interaction between the two chips
will be involved.
The choice of a specic architecture for the system can
be a determining factor as to whether or not its performance
will be able to meet the specications imposed by the appli-
cation. This is, however, not the only determining factor since
many other implementation decisions have great effect on the
performance. For example, there are on the one hand criti-
cal decisions regarding the pre-processing: the algorithms to
use, the method of optimisation used on them (i.e. optimise
for speed or for silicon area), the method of vector extraction
for the ANN, etc. And on the other hand, there are also some
critical decisions regarding the neural classier: the optimal
size of the network, the size of the reference vectors, the
quantity of maps used in the classication, and so on.
All these decisions cannot be takena priori andtheyshould
be investigated as the application is being developed. The
problemwith this method is that since the application is going
to be based on hardware, any changes in the architecture or
in any of the decisions mentioned above are likely to be quite
time-consuming, and probably going to affect all other areas
of the application as well.
For example, if the decision has been taken to base the
system on an architecture of the type FPGA standalone, pre-
sented in Sect. 3.1, an increase of the size of the ANNis likely
to have an impact on the complexity and speed of the algo-
rithms that can be used in the pre-processing of the images.
The circularity of the problemis then evident, since this could
in turn affect the way in which the vectors are extracted from
the images, and therefore have some impact on the data to be
passed on to the ANN.
From the previous example, it is clear that, instead of
deciding the architecture of the system from the start, it is
preferable to rst develop the application and then to per-
form the necessary adjustments to the system so that it can
be implemented in one of the three architectures available.
For that reason, it would be useful to have the tools to develop
the application in a way that allowed each part of the hybrid
system to be developed individually, and at the same time be
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 387
possible to abstract the view from the different parts of the
system and see its operation as a whole.
In conclusion, the main problem in the development of a
hybrid system is that the tools for the development of appli-
cations based on such a system are limited. In general, it is
difcult to design each part of the system as an individual
(i.e. the IPor the ANN) and to still be able to see the operation
of the system as a whole. At the same time, a methodology
for the design, development and test of applications based on
the hybrid system architecture is also required.
The following sections discuss some available tools for
the development and test of applications based on a hybrid
system, and combine these tools into a holistic prototype sys-
tem able to support embedded machine vision applications
during their design phase, before they are ported to any of
the architectures introduced in Sect. 3.
4.2 Tools for the development of applications based on the
hybrid system
This section presents the two development systems consid-
ered for the development of applications based on the hybrid
system.
4.2.1 The RC200 development system
The RC200 development system is composed of the RC200
board and the Celoxica DK Design Suite, from Celoxica
Ltd. [11]. DK is a development environment for software-
compiled system design, which is a methodology for design-
ing electronic systems for programmable hardware. In other
words, DK facilitates the development of applications for
FPGAs.
In DK, the programs are rst written in a programming
language called Handel-C [10], which has a similar structure
to C with the addition of certain commands oriented towards
a hardware implementation of the algorithms. The gate-level
description logic of the hardware can be obtained from the
Handel-C code using automated techniques called Hardware
Compilation [57].
The main advantages of software-compiled systemdesign
relevant to this project are:
1. High-level system design using a C-based language
which can be directly compiled into optimised FPGA
logic.
2. Accurate simulator/debugger of the source code at hard-
ware-level.
3. Area and delay proling of source code.
Using a C-based language (Handel-C) is very useful to
developandtest parts of the applicationrst insoftware (stan-
dard C) before adapting themto Handel-C. Once the code has
beentransferredtoHandel-C, it is easytouse the internal sim-
ulator/debugger of DK to verify the functionality of the code
and to transfer it to the actual hardware. Finally, the area and
timing analysis of the code can be used to identify parts of the
algorithms which could be potential bottlenecks and to learn
how they could be optimised. The area and timing analysis
can also be useful to determine whether the pre-processing
of the images will be able to share the resources in the FPGA
with an implementation of the SOM in the same chip, cor-
responding to the architectures described in Sects. 3.1 and
3.2.
The RC200 board is a standalone FPGA-based prototyp-
ing board, which provides an ideal tool for the testing of
Handel-C code implemented with DK. The RC200 board
is particularly useful for the development of machine vision
applications, since it can be used as a standalone systemcon-
taining the main resources needed for this kind of applica-
tions, namely: a high performance direct connection to the
video decoder and video output, a powerful Xilinx FPGA
capable of supporting complex IP operations, large memory
banks to store the images, and a diverse array of methods to
communicate with external devices.
More specifically, the main components of the RC200
board are:
FPGA: Xilinx Virtex II XC2V1000 (1M system gates),
720Kbits SelectRAM Blocks and 160 Kbits distributed
RAM.
Memory: Two independent banks of 2Mb SRAM each,
and a SmartMedia Socket.
Video decoder: 24 bit RGB or YCrCb signal.
Video output: DAC with a 24-bit colour map for a VGA
monitor or projector.
Communication: 10/100 Ethernet, Parallel Port, RS232
port.
Connectors: keyboard, mouse, touch-screen.
One of the advantages of using the RC200 board is that
there exists a Platform Abstraction Layer (PAL) for DK,
which makes the driving of the components of the board
much easier. The PAL basically provides the tools for pro-
gramming the various components of the board from a high-
level of abstraction, hiding most of the hardware interfacing
details from the user. The RC200 comes with a fully-devel-
oped simulator of the board which includes all of its compo-
nents. In this way, the behaviour of the board as a whole, and
not only the FPGA, can be fully tested during the design and
development of the application without having to require the
actual hardware.
4.2.2 The VindAX development system
The VindAX development system (VDS) comprises a com-
bination of hardware and software for the development of
1 3
388 M. S. Prieto, A. R. Allen
solutions to problems involving non-linear systems. In the
core of the VDS lies the VindAX processor, which is a fully
digital implementation of the SOM algorithm following the
Modular Map Design (see Sect. 2.2). The VindAX processor
is mounted on the VindAXEvolution 2 Board, which connects
to a PC through the PCI port.
The VDS provides all the necessary tools to interact with
the VindAXprocessor. These tools come in the formof steps,
where each step performs a specic function. These functions
range from the formatting of the input vectors to the display-
ing of the reference vectors in 3D graphs. Even though the
standard steps are exible enough to be sufcient for most
applications, the VDS also allows the incorporation of user-
created custom functions. The most important feature of the
VDS is perhaps its simplicity to modify various parameters
of the system (such as the size of the input vectors and of the
neural array) and to easily test the learning capabilities of the
new network specication. In conclusion, the VDS provides
all the necessary tools for the creation of complex systems
based on SOMs.
In relation to hybrid systems, the VDS provides many
useful tools for the creation of applications based on such
systems. For example, it is possible to use the algorithms
created for the pre-processing of the images, and compile
them into a user-created custom step, which could then be
used in the VDS. In this way, the whole hybrid system could
rst be simulated in software, before it was actually adapted
for hardware. Actually, to be precise, the step corresponding
to the VindAX processor would not be a simulation since the
VDS actually uses the VindAX microprocessor mounted on
a PCI card on the PC.
Summarising, the main benets of using VDS relevant to
this project are:
1. Rapid application development.
2. Cost-effectiveness.
3. Easy integration of user-created code as customfunction
steps.
4. Wide range of tools for investigating different specica-
tions of the neural array.
5. Shorter training times and increased reliability due to
training performed on actual hardware.
For these reasons, the VDS is an ideal tool for the devel-
opment and test of the SOM stage of any application based
on the hybrid system developed in this project.
4.3 The prototype system
In the previous sections we have described the RC200 Devel-
opment System and the VindAX Development System. A
prototype of a hybrid system combining these two develop-
ment systems has been developed, the hybrid prototype sys-
tem (HPS). The objective of the HPS is to provide the user
with all the required tools for the development of embed-
ded machine vision applications based on the hybrid system
presented in Sect. 3. This section presents howthe two devel-
opment systems are combined to form the HPS, and how the
HPS can be used to develop embedded applications.
First of all, it is important to note that the HPS has been
designed to facilitate the development and testing of embed-
ded applications, albeit the applications running on it may
not necessarily achieve real-time performance. The HPS has
been designed, however, to provide the main framework on
which embedded solutions with real-time performance could
be based.
The idea is to provide the necessary tools for the design of
the application in a way that possible shortcomings could be
detected at a very early stage, and so they could be prevented
to make sure that the nal solution can deliver the required
performance. The HPSdesignedtothis endis showninFig. 5.
In the HPS, the RC200 would process the images received
by a camera, and nd the objects or features that required fur-
ther analysis. The RC200 would then extract some vectors
of characteristics fromthese objects, and transfer themto the
VindAX processor through the PC. The VindAX processor
would return the class to which these objects belong back
to the RC200, and it would nally display the results on a
monitor connected to the board.
In the approach presented in Fig. 5, the VDS and the
RC200 development system have been connected to a PC.
This PC would be running the software of the systems: the
DK and the VDS software. Through these software pack-
ages, the user would be able to design and implement the
embedded application in progressive stages. The tests would
rst be performed in software, and nally in hardware. The
nal hardware tests would be performed on each of the indi-
vidual hardware systems communicating through the PC.
This method would facilitate the monitoring of the trafc
between the two systems, allowing the user to easily identify
any source of problems if they appeared.
The complete life cycle in the development of an applica-
tion on the HPS would be:
1. Initial design of the application. Partitioning of the tasks
to be implemented on the FPGA and specication of the
kind of analysis to be performed by the SOM.
2. Implementation of the pre-processing algorithms of the
application using a standard tool such as Matlab. Extrac-
tion of initial set of vectors from a database of example
cases of the application.
3. Training of the neural array with the VDS using the vec-
tors obtained from a database of examples. Adjustment
of the parameters of the network in an effort to improve
the classication accuracy.
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 389
Fig. 5 The hybrid prototype
system being used in a road sign
recognition application
4. Initial testing of the system and analysis of the results.
If the result is satisfactory, proceed to step 5, otherwise
modify the algorithms used in the pre-processing and the
methods employed in the vector extraction, and go back
to step 2.
5. Implementation and testing of the IP algorithms in C.
6. Adaptation of the algorithms to Handel-C and compila-
tion into hardware using the DK software. Loading of
the netlist on the FPGA of the RC200 board and testing
of the algorithms.
7. From the area and delay proling of the code, improve-
ment of the key areas of the algorithm causing the great-
est delay in time, if the application is more concerned
about execution speed, or reduction of the silicon area
used by the code, if the concern is more centred on the
use of FPGA resources.
8. Interconnection of the whole systemthrough the PCand
global testing of the application using the HPS.
9. Analysis of the overall performance of the system and
of the silicon area used by the pre-processing. Choice of
architecture for the nal embedded hybrid system from
the designs proposed in Sect. 3.
10. Migration of the solution into the architecture of choice
and nal testing of the embedded hardware.
By the time Step 5 in the life cycle of the application is
reached, the algorithms to be used in the pre-processing of
the images have already been decided. In order to get a sim-
ulation of the system working with the actual FPGA, it is
necessary to code the algorithms in C, test them, adapt them
to Handel-C, optimise themfor hardware, and test themonce
more. This process can become rather tedious and it requires
a great amount of knowledge of programming techniques at
hardware level.
In order to facilitate this process of preparing the algo-
rithms for hardware, a library of IP algorithms has been cre-
ated. The basis for the creation of the library is twofold: IP
algorithms which (1) are commonly used, and/or (2) exhibit
widely useful design patterns.
(1) Some IP algorithms are common to many machine
vision applications, and so, it would be very helpful to have
them already implemented and prepared for hardware. For
this reason, a library has been compiled with these standard
algorithms implemented in Handel-C, and optimised for their
use in an FPGA.
Regarding the Step 5 mentioned above, the job of the
developer of a new application is now reduced to determin-
ing which of the standard algorithms from the library can be
used for the application, and implementing the other more
specic algorithms not present in the library. Furthermore,
not only does the library save the time of having to implement
and prepare these algorithms for hardware, but it also pro-
vides a fundamental core of standard techniques to process
the image.
(2) As mentioned above, very specialised knowledge is
required for the programming of algorithms for hardware.
This task, however, canbe greatlyaidedbypre-existingmeth-
ods and procedures to deal with similar kind of problems. For
example, if a special kind of image ltering is required by
the application, the developer can explore the solutions given
in the IP library to ltering algorithms (such as median or
Gaussian ltering), and base the new algorithm on the code
provided for those solutions. By providing useful examples,
the programming task of Step 5 and Step 6 can be greatly
1 3
390 M. S. Prieto, A. R. Allen
reduced, and the work of the developer is directed from
technical and time-consuming tasks towards more creative
aspects of the process.
As mentioned previously, there are various possible ways
of connecting the RC200 to a PC (i.e. ethernet, parallel or
serial port). In principle, one would normally choose the fast-
est, however, there may also be other, practical consider-
ations. The method used for many of the experiments with
the HPS is the serial port, even though the speed that it can
deliver does not correspond to the fastest available option.
The reason why this method is employed is its simplicity of
use: on the RC200 side, the PAL provides a series of func-
tions for the transmission of bytes through the serial port,
and on the PC side there also exist various libraries for serial
communication. It is also important to bear in mind that many
machine vision applications involve a considerable amount
of data reduction in the preprocessing to extract a feature vec-
tor. Thus only a small amount of data is actually sent to the
ANN classier. Therefore, in many applications, the speed
of the connection between FPGA and ANN is not a critical
factor.
Due to the speed limitations of the HPS, it is necessary to
bear in mind that the overall performance of a solution being
tested on the system will be significantly improved after its
migration into the nal embedded hardware. More specifi-
cally, the three points which limit the performance of the test
of any system being developed on the HPS are:
The use of the development systems slows down the
communication to and from the chips, while a direct
approach with both chips in the same board would be
significantly faster.
Since DKand the VDS run on Windows

, the complete
dedication of the processor time to the systemcan not be
assured due to the background processes always present.
The serial cable, although easy to program, is the slowest
method possible to connect the RC200 with the PC.
These points need to be taken into careful consideration when
the architecture for the embedded solution is chosen.
5 The road sign recognition application
As a proof of concept, a real-time machine vision application
has been developed and tested using the HPS. Road sign rec-
ognition is a part of driver support systems. Its main aim is
the increase of trafc safety by calling the drivers attention
to the presence of key trafc signs such as: stop signs, yield
signs, speed limits, etc. Additionally, a vision-based system
able to detect and classify trafc signs from road images in
real-time would also be useful as a support tool for guidance
and navigation of intelligent vehicles.
The problem of RS detection and classication might
seemsimple and well-dened since the colour, shape, dimen-
sions and placement of the RS is tightly regulated. In reality,
the problem is very complex due to several reasons:
Since the images are acquired from a moving vehicle,
they suffer from: vibrations, blurred scenes, and varying
illumination of the captured scenes.
The signs can be found damaged, partly occluded, clus-
tered, and other kind of situations which make their
detection more difcult.
There are variations in the width of the sign borders and
actual pictograms onthe signs, inspite of the regulations.
This section presents a new method using self-organising
maps for the detection and recognition of road signs. The
RSR application is implemented and tested using the HPS
presented in Sect. 4, and is used as a demonstrator of the
hybrid system.
The detection of trafc signs from outdoor images is the
most complex step in a RSRsystem. There are many ways in
which the characteristics of road signs (RS) (e.g. well estab-
lished shapes and colours of the signs) could be exploited.
However, the necessity to analyse the images in real-time is a
limiting factor as to how much information available within
the image should be extracted and analysed. Most approaches
to the problem of RS detection use either colour information
or shape information. A review of various approaches can be
found in [33] and [37].
In most applications, once a region of interest (ROI) from
the image has been detected, i.e. an area of the image that
might contain a trafc sign, this ROI is passed to the rec-
ognition module in charge of identifying the sign. Most of
the RSs contain a pictogram, a string of characters, or both.
The recognition of RSs is therefore very often implemented
with ANN [15, 22, 36], because of their pattern recognition
abilities. Nevertheless, other approaches such as template
matching [30, 47] or Laplace kernel classier [45, 46] have
also been successfully used in the recognition step.
The approachchosenfor the RSRapplicationdevelopedin
this paper uses SOM in both the detection and recognition.
In each task, a vector characterizing the ROI is extracted.
The classication of the vector by the SOM will determine
whether it corresponds to a potential RS or not, at detection
level, and what kind of RS it is, at classication level. More
details of the implementation of this application can be found
in [50].
5.1 The HPS in the RSR application
The RSR application has been chosen as a demonstrator of
the HPS presented in Sect. 4. RSR is an ideal problem for
testing the use of the HPS in the development of embedded
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 391
machine vision applications. The real-time recognition of
RSs requires a complex interaction between the FPGA and
the ANN while working under strict speed constraints.
In exploring the design space for this application, it is
necessary to take into account the requirements of both the
preprocessing and the learning/classication. In this case, by
step 7 of the life-cycle, it is clear that the preprocessing will
take up a considerable FPGA area. It is the object detec-
tion and extraction, and construction of the feature vector,
which is the processing bottleneck, and therefore requires
as much parallel hardware as possible. This rules out imple-
menting both the preprocessor and the SOM classier in the
FPGA (Sects. 3.1 and 3.2). Moreover, the application also
requires dynamic adaptation and on-line learning, and the
reconguration time of the FPGA (O(10
2
) ms) would ren-
der the method of Sect. 3.2 infeasible.
The other consideration is the nature of the connection
between the FPGA and the ANN. In the RSR application,
the preprocessing produces a feature vector for each ROI in
the image [50]. The VindAX processor requires a feature
vector of 16 bytes, so a typical source image would result in
only O(10
2
) bits to be sent. There is even less data (O(10)
bits) in the reverse direction (fromANNto FPGA). The com-
munication time would be dominated by the latency of the
RS232 or ethernet (O(1), O(10
1
) ms respectively), and in
either case would be insignificant compared with the FPGA
preprocessingtime. As mentionedinSect. 4.3, anRS232con-
nection was chosen for pragmatic reasons. Figure5 shows the
HPS supporting the development of the RSR application, in
the nal test of the system corresponding to step 8 of the life
cycle of an embedded application for the HPS proposed in
Sect. 4.3.
In Fig. 5, the image to be analysed corresponds to a frame
of a road scene captured by a camera connected to the RC200.
From this image, the RC200 analyses the different ROIs that
could potentially be a RS and generates vectors of character-
istics describing them. Each vector is sent to the PC, and then
gets transferred to the VDSconnected to a PCI port of the PC.
The VindAX processor analyses each vector and returns the
class that it belongs to. The PC transfers back the informa-
tion to the RC200, and if the analysis seems to indicate that
the object is a potential RS, then the RC200 extracts more
information and sends it to the VindAX processor again, in
order to determine exactly which kind of RS it is.
Given the large number of RSs in some classes, the train-
ing of an ANN with all the RSs of a class at the same time is
nearly impossible. Atypical solution is to group the RSs into
subclasses, according to the similarity between their picto-
grams, and to assign a neural map to each of the subclasses.
In this way, the process of transferring vectors to the VindAX
processor and receiving their class back is done a few times,
as the RS is further analysed. Once the exact RS has been
identied, the original frame captured by the camera is dis-
played in the Video Output, with the RSthat has been detected
marked appropriately.
The Partitioning of the SOM. One of the features of the Vind-
AX processor is its capacity to partition the neural net into
various independent maps working in parallel. This feature
can be used to great advantage in the RSR application.
Dividing the VindAX processor into four 8 8 networks
gives it the capacity to analyse each vector by four sub-
networks at the same time. Each of the sub-networks could
be trained with four or ve different RSs, giving the pro-
cessor the capacity to examine up to 20 individual RSs for
classication at the same time.
Dynamic adaptation. Another very important feature of the
hybrid system is its ability to adapt the neuron values of
the SOM map while processing vectors for classication. In
order to best exploit this feature, the RSR system has been
implemented in such a way that it is capable of adapting its
SOM maps without having to stop its normal execution.
By using the dynamic adaptation of the neural array at
specic times, the system is able to adapt itself to small
changes in the RSs being analysed. Clearly, the objective here
is to keep the system learning from experience and evolving
through time so as to always obtain the most accurate clas-
sication that is possible, even when changes in the appear-
ance of RSs take place. These changes could be observed,
for example, when the frontier between two countries is
crossed. For this reason, this very same example is recreated
in the experiments, where the system is trained with images
fromSpain and Czech Republic and tested with images from
United Kingdom.
5.2 Results
This section presents the results of the experiments. These
experiments study the performance of the RSR application
running on the HPS. The two most important points being
tested by the experiments are: the accuracy to detect and
classify RSs, and the processing speed of the system.
5.2.1 Detection and classication accuracy
With appropriate lighting, the detection algorithmhas shown
a great reliability to detect RSs of the four main classes (stop,
give way, warning and prohibition signs). In general, the RSs
were perfectly detected and identied under variations of
scale (as small as 25 pixels wide), rotation (up to 20

) and
occlusion (up to 1520% of vertical and 510% of horizon-
tal occlusion, depending on the area being occluded) of the
signs.
It has to be said that all those results were obtained after
the network was adapted to work with the British standards
1 3
392 M. S. Prieto, A. R. Allen
of RSs. As was mentioned in the previous section, the RSs
used in the training of the SOM and the experiments are
slightly different. Using the initial training of the maps, all
RSs except cross-roads and trafc-light ahead were perfectly
recognised by the system. In the case of these two, the SOM
seemed to confuse them when the RSs were shown with the
slightest rotation of the pictograms.
To help the system to differentiate the signs, the sys-
tem was asked to adapt the neural array values about 20
40 frames for each sign. Through this procedure, the system
perfectly learned the two classes and proceeded to correctly
classify them, as well as the other signs, under the same kind
of rotations tested on them.
5.2.2 Speed of the system
Initially, the main concern of the experiment was that the HPS
might not be able to achieve a high-speed processing time.
After all, the HPS gathers a great amount of statistical data,
communicates through a serial port, and part of the system
runs under Windows

, which is not a real-time operating


system. In spite of all these limiting factors, the application
ran at a surprising speed of 1921 frames per second (fps),
which can be considered as almost real-time (assumed to be
approximately 25fps).
Furthermore, some experiments revealed that simply by
disabling the logging and displaying of the data gathered by
the interface program running on the PC, without any further
modication of the system, was enough to boost the applica-
tion speed to the desired 2425fps.
Considering the aforementioned speed limitations of the
HPS, an implementation of the RSRapplication on a suitable
architecture should be able to performa much more complex
pre-processing of the images as well as incorporating the
classication of more RSs, and all this without compromis-
ing the performance of the application.
6 Conclusions
In the past few years, machine vision applications have
become more demanding on the hardware. These kinds of
applications often require the integration of traditional IP
with AI techniques, with an implementation in designs with
small footprint and working at high speeds. The main aim of
this paper was to present a system able to meet the require-
ments of embedded vision applications, and to provide the
necessary tools for the development and testing of such kind
of applications on this system.
The systemintroduced in this paper combines programm-
able hardware for the IP tasks and a digital implementation of
an (ANN) for the pattern recognition and classication tasks.
The different architectures that this hybriddesignmight adopt
have been compared in Sect. 3, where the advantages and dis-
advantages of each architecture have been analysed in detail.
Aprototype of the hybrid system, called the HPS, has been
created to aid in the development and test of new embed-
ded vision applications, before the nal hardware of these
applications is produced. The general methodology for the
development of applications for the HPS has been outlined,
and the capacities and limitations of this new prototype have
been discussed.
The demonstrator application developed in this project,
included in Sect. 5, corresponds to a RSR application. The
aim of the system was the real-time detection and classica-
tion of RS in road scene images provided by a camera. This
application was chosen because of the high-speed require-
ments of the application, as well as because of the com-
plex interaction involved between the FPGA and the neural
array.
The results obtained by the RSR application have shown
great accuracy in the detection and recognition of RS with
changes in position, scale, rotation and partial occlusion. Fur-
ther, the system has demonstrated the ability to recognise
non-standardised RSs and signs fromdifferent countries. The
system was able to produce results in real-time, in spite of
the limitations of its laboratory implementation.
Finally, the hybrid prototype system has demonstrated its
worth in the development of the RSR application. A number
of preprocessing algorithms were experimented with, to opti-
mise the feature extraction. The ability to use components
from the IP library in various combinations facilitated this
experimentation, and permitted a rapid estimation of FPGA
area usage. In this application, it quickly became apparent
that the preprocessing was the potential bottleneck, and the
IP algorithms had to be implemented using the parallelismof
Handel-C and the FPGA. This in turn meant that the FPGA
being used (XC2V1000) would not have space for a SOM
implementation in addition to the IP. The resultant necessity
of using the ANN processor for learning and classication,
meant that the HPS could be used to explore various SOM
partitionings (permitted by the VindAX system) to optimise
the recognition process. Another feature of the HPS tested
by the RSR was the possibility to allow adaptation of the
neural array parameters during execution time. The results
were more than satisfactory, illustrating the capacity of the
system to learn from new examples without having to be
re-trained in the laboratory. The exible logging facilities
of the HPS were used to good effect: full logging of fea-
ture vectors and other run-time information was used during
development, and then reduced or switched off when it was
desired to more closely emulate the nal system. The HPS in
all these ways allowed the design space of the RSR system
to be explored intelligently and optimised before committing
to hardware. In conclusion, this demonstrator application has
shown the great advantage of the HPS in aiding the design
1 3
A hybrid system for embedded machine vision using FPGAs and neural networks 393
and implementation of embedded vision applications for the
hybrid system.
Future research directions. A future direction of this dem-
onstrator could be to extend the functionality of the RSR
application into a full commercial application. The appli-
cation of this particular example would not be restricted to
intelligent vehicles, but it could also be of great use for robot
navigation.
An area where this work could be continued is clearly the
development of other embedded vision applications using the
HPS. Anumber of areas that could make immediate use of the
HPS are for example: navigation, security and surveillance
applications (object tracking, luggage inspection), medical
applications (analysis of brain scans, medical X-rays, elec-
tron microscope images). This list is by no means complete,
and it is worth remarking that these are not only research
areas, but also areas where useful commercial applications
could be developed.
The HPS has been designed to aid in the implementation
and testing of an application for the hybrid system, and not in
order to be used as a nal embedded solution. The develop-
ment of a hardware implementation of the hybrid systemthat
could be used as the nal product would clearly be another
area of research following this work. From the architectures
discussed in Sect. 3 the most useful designs that could be
developed would be the FPGA standalone (see Sect. 3.1) and
the FPGA and VindAX processor (see Sect. 3.3). More spe-
cifically, the rst design would incorporate both modules of
the hybrid system in one standalone FPGA, and the second
design would incorporate one chip for the FPGAand another
with the VindAX processor connected together in one board.
Acknowledgments The authors gratefully acknowledge the support
of AXEON Limited for this work.
References
1. Arnold, J., Buell, D.A., Davis, E.G.: Splash-2. In: ACM Sympo-
siumonParallel Algorithms andArchitectures (ACM92), pp. 316
324 (1992)
2. AXEON Ltd: URL http://www.axeon.com
3. Batista, L.B., Gomes, H.M., Herbster, R.F.: Application of growing
hierarchical self-organizing map in handwritten digit recognition.
In: Proceedings of the Brazilian Symposium on Computer Graph-
ics and Image Processing (SIBGRAPI03) (2003)
4. Batlle, J., Mart, J., Ridao, P., Amat, J.: A new FPGA/DSP-based
parallel architecture for real-time image processing. RealTime
Imaging 8, 345356 (2002)
5. Benkrid, K., Crookes, D., Smith, J., Benkrid, A.: High level
programming for FPGA based image and video processing using
hardware skeletons. In: Proceedings of the IEEESymposiumField-
Programmable Custom Computing Machines (FCCM01) (2001)
6. Bode, M., Freyd, O., Fischer, J., Niedernostheide, E.J., Schulze,
H.J.: Hybrid hardware for a highly parallel search in the context of
learning classiers. Artif. Intell. 130, 7584 (2001)
7. Bouridane, A., Crookes, D., Donachy, P., Alotaibi, K., Benkrid,
K.: A high level FPGA-based abstract machine for image process-
ing. J. Syst. Archit. 45, 809824 (1999)
8. Brown, D., Craw, I., Lewthwaite, J.: ASOMbased approach to skin
detection with application in real time systems. In: Proceedings of
the British Machine Vision Conference (BMVC01) (2001)
9. Campbell, N.W., Thomas, B.T., Troscianko, T.: Automatic seg-
mentation and classication of outdoor images using neural net-
works. Neural Syst. 8(1), 137144 (1997)
10. Celoxica: Handel-C Language reference manual, Celoxica (2004)
11. Celoxica Ltd: URL http://www.celoxica.com
12. Crookes, D.: Architectures for high performance image process-
ing: the future. J. Syst. Archit. 45, 739748 (1999)
13. Dias, F.M., Antunes, A., Mota, A.M.: Articial neural networks: a
review of commercial hardware. Eng. App. Artif. Intell. 17, 945
952 (2004)
14. Diniz, P., Hall, M., Park, J., So, B., Ziegler, H.: Automatic map-
ping of Cto FPGAs with the DEFACTOcompilation and synthesis
system. Microprocess. Microsyst. 29, 5162 (2005)
15. Douville, P.: Real-time classication of trafc signs. Realtime
Imaging 6, 185193 (2000)
16. Draper, B., Najjar, W., Bohm, W., Hammers, J., Rinker, B., Ross,
C., Chawathe, M., Bins, J.: Compiling and optimizing image pro-
cessing algorithms for FPGAs. In: Proceedings of the IEEE Com-
puter Architectures for Machine Perception (CAMP00) (2000)
17. Drayer, T.H., IV, W.E.K., Tront, J.G., Conners, R.W.: A modu-
lar and reprogrammable real-time processing hardware. In: Pro-
ceedings of the IEEE FPGA-Based Custom Computing Machines
(FCCM95) (1995)
18. Dunn, P.A., Kearney, P.D., Jensen, M.J., Davey, P.I.: A modular
congurable logic processor system. CSIRO Manufacturing Sci-
ence and Technology (2002)
19. Egmont-Petersen, M., Ridder, D.de , Handels, H.: Image process-
ing with neural networks - a review. Pattern Recognit. 35(10),
22792301 (2002)
20. Eide, A., Lindblad, T., Lindsey, C.S., Minerskjold, M., Sekhni-
aidze, G., Szkely, G.: An implementation of the Zero Instruction
Set Computer (ZISC036) on a PC/ISA-bus card. In: Proceedings of
the Workshop on Neural Networks (WNN/FNN94), pp. 319330
(1994)
21. Eppler, W., Fischer, T., Gemmeke, H., Chilingarian, A., Vardanyan,
A.: Neural chip SAND in online data processing of extensive air
showers. Comput. Phys. Commun. 126, 6366 (2000)
22. Estable, S., Schick, J., Stein, F., Janssen, R., Ott, R., Ritter, W.:
Real-time trafc sign recognition system. In: Proceedings of Intel-
ligent Vehicles94 Symposium (1994)
23. Fiesler, E., Duong, T., Trunov, A.: Design of neural network-based
microchip for color segmentation. In: Proceedings of SPIE, vol.
4055 (2000)
24. Fu, L.: Neural Networks in Computer Intelligence. McGraw-Hill,
Inc, New york (1994)
25. General Vision: URL http://www.general-vision.com
26. Greenbaum, J., Baxter, M.: Increased FPGAcapacity enables scal-
able, exible CCMs: An example from image processing. In: Pro-
ceedings of the IEEE FPGA-Based Custom Computing Machines
(FCCM97) (1997)
27. Hamid, G.: An FPGA-based coprocessor for image processing.
IEE Colloquium Integrated Imaging Sensors and Processing, pp.
6/16/4 (1994)
28. Heemskerk J.N.H. (1995) Overview of neural hardware. Neuro-
computers for brain-style processing: Design, implementation and
application. Ph.D. thesis, Unit of Experimental and Theoretical
Psychology, Leiden University, Leiden (1995)
29. Hendry, D.C., Duncan, A.A., Lightowler, N.: IP core implemen-
tation of a self-organizing neural network. IEEE Trans. Neural
Netw. 14(5), 10851096 (2003)
1 3
394 M. S. Prieto, A. R. Allen
30. Hsu, S.H., Huang, C.L.: Road sign detection and recognition using
matching pursuit method. Image Vis. Comput. 19, 119129 (2001)
31. Ienne, P., Kuhn, G.: Digital systems for neural networks. SPIEOpt.
Eng. Crit. Rev. Ser. CR57, 314345 (1995)
32. Ienne, P., Viredaz, M.A.: Implementation of Kohonens self-
organizing maps on MANTRA I. In: Proceedings of the Interna-
tional Conference on Microelectronics for Neural Networks and
Fuzzy Systems, pp. 273279 (1994)
33. Johansson, B.: Road sign recognition from a moving vehicle. cite-
seer.ist.psu.edu/570294.html
34. Kessal, L., Abel, N., Demigny, D.: Real-time image process-
ing with dynamically recongurable architecture. RealTime Imag-
ing 9, 297313 (2003)
35. Kohonen, T.: Analysis of a simple self-organizing process. Biol.
Cybern. 44(2), 135140 (1982)
36. Krumbiegel, D., Kraiss, K.F., Schrieber, S.: A connectionist traf-
c sign recognition system for onboard driver information. In: 5th
IFAC/IFIP/IFORS/IEASymposiumon Anlaysis, Design and Eval-
uation of ManMachine Systems, pp. 201206 (1992)
37. Lalonde, M., Li, Y.: Road sign recognition, survey of the state of
the art. Tech. rep., Centre de recherche informatique de Montreal
CRIM/IIT (1995)
38. Le, D.X., Thoma, G.R., Wechsler, H.: Document image analysis
using integrated image and neural processing. In: Proceedings of
the International Conference on Document Analysis and Recogni-
tion (ICDAR95) (1995)
39. Liao, Y.: Neural networks in hardware: A survey. Tech. rep. bit.
csc.lsu.edu/~jianhua/shiv2.pdf (2001)
40. Lightowler, N.: Modular maps: an implementation strategy for the
self-organising map. Ph.D. thesis, University of Aberdeen, Aber-
deen (1997)
41. Lightowler, N., Allen, A.R., Grant, H., Hendry, D.C., Spracklen,
C.T.: The modular map. IJCNN (1999)
42. McBader, S., Lee, P.: An FPGA implementation of a exible, par-
allel image processing architecture suitable for embedded vision
systems. In: Proceedings of IEEE International Parallel and Dis-
tributed Processing Symposium (IPDPS03) (2003)
43. Moerland, P.D., Fiesler, E.: Hardware-friendly learning algorithms
for neural networks: an overview. In: Proceedings of the Interna-
tional Conference on Microelectronics for Neural Networks and
Fuzzy Systems (MicroNeuro96) (1996)
44. Muthukumar, V., Rao, D.V.: Image processing algorithms on re-
congurable architecture using HandelC. In: Proceedings of the
IEEE Euromicro Systems on Digital System Design (DSD04),
pp. 218226 (2004)
45. Paclik, P.: The automatical classication of road signs. Masters
thesis, Faculty of transportation science, Czech Technical Univer-
sity, Prague (1998)
46. Paclik, P., Novovicova, J., Pudil, P., Somol, P.: Road sign classica-
tion using laplace kernel classier. Pattern Recogn. Lett. 21, 1165
1173 (2000)
47. Piccioli, G., Micheli, E.D., Parodi, P., Campani, M.: Robust
method for road sign detection and recognition. Image Vis.
Compu. 14, 209223 (1996)
48. Porrmann, M., Franzmeier, M., Kalte, H., Witkowski, U., Ruckert,
U.: A recongurable SOM hardware accelerator. In: Proceed-
ings of the European Symposium on Articial Neural Networks
ESANN02, pp. 337342 (2002)
49. Porrmann, M., Witkowski, U., Kalte, H., Ruckert, U.: Implementa-
tion of articial neural networks. In: Proceedings of the Euromicro
Workshop on Parallel, Distributed and Network-based Processing
(EUROMICRO-PSP02), pp. 337342 (2002)
50. Prieto, M.S., Allen, A.R.: Using self-organizing maps in the detec-
tion and recognition of road signs. Image Vis. Comput. (2005,
submitted)
51. Rueping, S., Goser, K., Rueckert, U.: A chip for self-organiz-
ing feature maps. In: Proceedings of the International Confer-
ence on Microelectronics for Neural Networks and Fuzzy Systems,
pp. 2633 (1994)
52. Ruping, S., Porrmann, M., Ruckert, U.: SOM accelerator sys-
tem. Neurocomputing 21, 3150 (1998)
53. Salcic, Z., Sivaswamy, J.: Imeco: A recongurable FPGA-based
image enhancement co-processor framework. RealTime Imag-
ing 5, 385395 (1999)
54. Sarle, W.S.: Neural Network FAQ. URL ftp://ftp.sas.com/pub/
neural/FAQ.html
55. Schoenauer, T., Jahnke, A., Roth, U., Klar, H.: Digital neurohard-
ware: principles and perspectives. In: Proceedings of the Neuronal
Networks in Applications (NN98), pp. 101106 (1998)
56. Seiffert, U.: Articial neural networks on massively parallel com-
puter hardware. In: Proceedings of the European Symposium on
Articial Neural Networks (ESANN02), pp. 319330 (2002)
57. Sheen, T.M.: Tools for portable parallel image processing. Ph.D.
thesis, University of Aberdeen (1999)
58. Sheen, T.M., Allen, A.R., Lawrence, A.E., Page, I.: Hardware
compilation technology for embedded image processing. In: High
Performance Architectures for Real-Time Image Processing: IEE
Colloquium Digest 1998/197, pp. 9/19/6 (1998)
59. Siegel, H.J., Armstrong, J.B., Watson, D.W.: Mapping computer
vision-related tasks onto recongurable parallel-processing sys-
tems. IEEE Comput. 25(2), 5463 (1992)
60. Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H.: Tem-
poral partitioning methodology optimizing FPGA resources for
dynamically recongurable embedded real-time system. Micro-
process. Microsyst. 27, 115130 (2003)
61. Viredaz, M.A.: MANTRA I: an SIMD processor array for neural
computation. In: Proceedings of the Euro-ARCH93 Conference,
pp. 99110 (1993)
1 3

You might also like