A VLSI Implementable Learning Algorithm

TABLE OF CONTENTS
1. Introduction 12. Objective ............................................................................................................................. 3 1.3. Contributions of the dissertation ...................... ................................................................. S 1.4. Organization of the dissertation .......................................................................................... 6 2. Background information 2.1. On hardware implementation of neural networks ............................................................... 7 2.2. On top-down design methodology ................................................................................... 10 3. The learning algorithms 3.1. Deterministic teaming algorithm .................................................................................... 11 3.2. Original stochastic learning algorithm ............................................................................ IS 3.3 Modifications made for VLSI implementation .................................................................. 25 3.3.1. Digital Sigmoid ..................................................................................................... 27 3.3.2. Modifications of the original weight adjustment mechanism ................................. 32 3.3.3. Data representation ............................................................................................... 33 4. Top-down design methodology .................................................................................................... 37 5. Top-down design with Alopex 5.1. Choice of data format ....................................................................................................... 59 5.2. First Step: C language implementation ............................................................................. 65 5.3. Second Step: HDL functional description ......................................................................... 65 5.4. Third Step: preliminary module partitioning ..................................................................... 66 5.5. Module description 5.5.1. Weight adjustment ................................................................................................. 71 5.5.2. Output calculation units ........ : ............................................................................... 78 5.5.3. Control unit. .......................................................................................................... 83 5.5.4. Noise serial, clock generator and power-on-reset modules ..................................... 90 5.5.5. Operations and machine cycles used by the control unit and the neural array during one training iteration............................................................................... 91 5.6. Synthesis Step ................................................................................................................... 95 5.7. Placement and routing ...................................................................................................... 99 6. Conclusions and future work 6.7. Summary ........................................................................................................................ 100 6.8. Conclusions .................................................................................................................... 102 6.9. Directions for future work .............................................................................................. 106
v
7. Appendix A - Software listings A.l. Perception HDL behavioral/structural descriptions & results ..................................... 107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A.2. Sigmoid HDL behavioral description ........................................................................ 122 A.3. Alopex C language implementation and results ......................................................... 125 A. 3. ............................................................. Alopex HDL behavioral/structural description 145 8. Appendix B B. l. .............................................................................. Design, synthesis and analysis tools 160 B. 2. Tutorials ................................................................................................................... 164 9. Bibliography ............................................................................................................................ 169 10. Author's vita... _______ .................................................................................................... ------- 172
vi
Table 5.1. Actual weight adjustments for 8 iterations ........................................................... 77
LIST OF TABLES
Table 5.2. Error calculations in a typical iteration ____ ............ ____ ........... _ ..................89
Table 5.3. Machine cycles for one training iteration _
_______________________________________ ............................................................91
vii
Figure 3.1.: The Percqmon ............................................................................................... 12 Figure 3.2.: Hierarchical module partitioning ................................................................... 13 Figure 3.3.: Error/weight change probabilities ................................................................. 17 Figure 3.4.: Flowchart for Alopex C language implementation ........................................ 19 Figure 3.5.: Digital sigmoid implementation .................................................................... 25 Figure 3.6.: a) Digital sigmoid transfer function for several values of b) timing waveforms of the behavioral HDL description; c) synthesized schematic and d)chip layout .................. 30 Figure 3.7.: Integer/fractional multiplication comparison ................................................. 36 Figure4.1.: Top-down design steps ................................................................................... 40 Figure 4.2.: Sample run for results of 13 iterations of perceptron algorithm ..................... 43 Figure 4.3.: Top-level schematic ...................................................................................... 48 Figure 4.4.: Training module partitioning......................................................................... 50 Figure 4.5.: Verilog XL timing waveforms ...................................................................... 52 Figure 4.6.: Training and testing patterns ......................................................... - .............. 53 Figure 4.7.: Gate level schematic and reports generated by Synergy synthesis tools ........ 54 Figure 4.8.: Verilog XL timing waveforms of the gate level simulation ........................... 57 Figure 4.9.: Perceptron (training module) final chip layout ........................................... 58 Figure 5.1.: Verilog code for signed fractional multiplier. ................................................ 61 Figure 5.2.: Data representation for Alopex implementation ............................................. 63 Figure 5.3.: F(net) for present implementation ................................................................. 64 Figure 5.4.: HDL structural description of network architecture....................................... 67 Figure 5.5.: System block diagram ................................................................................... 70 Figure 5.6.: Timing waveforms of a weight unit updating its value .................................. 74 Figure 5.7.: Timing waveforms of output calculation units .............................................. 81 Figure 5.8.: Portion of sample run showing how neurons calculate f(net)=I Wj * Xj ....... 82 Figure 5.9.: Timing waveforms for control unit ............................................................... 86 Figure 5.10: Portion of a sample run showing actual calculated values for partial and total errors ....................'. .......................................................................................................... 89 Figure 5.11: Timing waveforms for Alopex ..................................................................... 93 Figure 5.12:Gate level schematic of output calculation unit (neuron) ............................... 96 Figure 5.13:Gate count, total area and maximum delay reports for a)weight unit, b) output calculation unit. ................................................................................................................ 97 Figure 5.14: Final layout for output calculation unit ......................................................... 99 Figure B.I.: Data flow in Cadence Design Framework II Environment .......................... 161
viii
Chapter 1 INTRODUCTION
1.1 Motivation
During the last two decades, several artificial neural network architectures have been proposed to address the issue of computational intelligence (Werbos, 1974; Kohonen, 1988; Hopfield, 198S; Grossberg, 1987; Fukushima, 1984, Minsky&Papert, 1988). A fundamental characteristic of artificial neural networks is their learning capability. They implement this by performing simple calculations, such as product sums and nonlinear functions, using local operators on a high number of interconnected processing units, trying to emulate the biological neurons (Freeman, 1992). Learning is realized as various adaptation rules that define the way synaptic weights are to be modified. After learning, neural networks automatically reproduce the relation implicitly contained in the new test data. The large number of processing elements and the large number of interconnections among them make neural network simulation on traditional hardware dramatically slew (Lehman, 1993). Most of these applications are merely software implementations in which both the learning and the testing are "simulated in a sequential, single processor machine. Other applications rely in training the networks (i.e., adjusting the weights during several iterations) using a high level programming language such as Pascal, C or C++. After training is complete, the weights are downloaded to the hardware for testing (Sackinger, 1992). This actually implements an accelerator chip, which is not learning but merely performing fast calculations after it was trained off-chip. In some cases, training may take place during several hours or even days, since the CPU must decode each instruction and update each weight individually following a learning algorithm. The fundamental drawback is that the inherent parallelism of a neural net architecture is lost entirely or in part when simulated in a traditional sequential machine, a vcctor computer, workstation or even transputer arrays (Ramacher, 1991). By implementing a neural network architecture in hardware, with on-chip 1 learning capabilities, or simulating its implementation using the design and analysis tools now available an appreciable reduction in computing time is possible. Thus, the parallel nature of the interconnected
neurons can be fully exploited (Sanchez, 1993; Linares, 1993). These simulations are essential for providing a reasonable assurance that a proposed computational paradigm will function as intended in hardware. Simulating networks can help assure manufacturability if the components are modeled to exhibit their characteristics, usually, switching speed, at both extremes of their tolerances (Reid, 1991). Several researchers (Ramacher, 1991; He, 1993; Melton, 1992; Mumford, 1992; Lehman, 1993; Macq, 1993; Harrer, 1992; Linares, 1993; et.al.) have successfully attempted hardware implementations of learning algorithms. Generally, they approach the design process from a bottom-up perspective and do not present a design methodology that others may follow and adapt to other applications. In addition to a hardware implementation of neural network algorithms to Iruly appreciate the parallel nature of these powerful architectures, todays high demand for shorter turn around times and increasingly complex electronic designs requires a new design methodology. Top-down design with Hardware Description Languages (HDLs), including the use of automatic synthesis, chip layout tools and backannotation at various stages for checking correctness, is the way to design today. This methodology is likely to displace bottom-up, traditional, schematic based design techniques. For many years, logic schematics served as the basis for circuit designs. However, in todays complex systems, such a gate-level, bottom-up design methodology would produce schematics with so many interconnections that the functionality of the system would be lost in a web of wires (Stemheim,1993). Thus, integrated circuit and digital logic designers are now using a different approach to circuit design: a top-down design methodology using HDLs that keeps the architecture and functionality of the complete system at the highest level of abstraction, hiding the details of implementation of lower level modules until the system performs as close as possible to the specifications.
The concept of top-down hardware design correlates to the concept of structured programming in software design. As explained in (Comer, 1983), Niklaus Wirth, the developer of the Pascal programming language, provides the following definition of structured programming: Structured programming is the formulation of programs as hierarchical, nested structures of statements and objects of computation. (Wirth, 1974) This implies the partitioning of a problem into simpler, easier to handle, parts or modules. This refining process continues until each module is a task that can be implemented by program statements, minimizing the interaction between modules resulting in minimum propagation of changes or errors to other parts of the system. All these modules working together accomplish the overall system function defined in the specifications. These concepts can be extended to hardware top-down design. The complete system is characterized as a "box, with inputs, outputs and a set of specifications. Only these and the overall function of the system are known at the beginning of the design process. This may be a neural network architecture for which only its functional behavior is well understood since it may have already been extensively tested in software simulations using a high level language such as C. During several iterations, the designer divides this system into functional modules as independent of each other as possible. Further iterations continue partitioning these modules in submodules, creating a hierarchical architecture until the lowest level modules -contain simple hardware elements found in well defined libraries of components. As each iteration is performed, the refined modules are tested within the complete system for timing and functional verification. In this way, the designer proceeds further into the design process with the assurance that the final product will perform to specifications.
1.2 Objective
Several researchers have discussed the top-down design approach for evolving the next generation of computers (Comer, 1983; Franca, 1994; Wolf, 1992; Sandige, 1992; Chen 1993). All these papers address traditional computer architecture examples such as CPU design, microprocessor systems, ALUs, ROM,
3 RAM, decoders, etc. They assume that their audience has a substantial background or training in computer organization, assembly language programming and digital logic design. However, today researchers from a variety of fields such as physics, psychology, computational neuroscience, mathematics, etc., are proposing novel architectures and computational paradigms such as neurocomputers, generic algorithms, fuzzy controllers, etc. This thesis is targeted to such an audience in order to introduce them to this hierarchical design automation approach. For the researcher who is experimenting with a learning algorithm, the refinement process of module partitioning may stop before a purely structural description (i.e., a description with its modules consisting only of submodule instantiations, library components and state machines) is achieved since, in this case, module partitioning may not be as clear cut as in the case of a well understood digital system architecture such as a microprocessor. The objective in this case is to produce a parallel system architecture that is synthesizable and implementable in silicon, i.e., for which a gate level architecture could be automatically produced by the software synthesis tools and automatically placed and routed by the layout tools. By backannotating the information produced at each level, i.e at the synthesis and layout levels, the design can be verified with parameters that are closer to the actual ones to be encountered when the chip is finally fabricated. In this way, the learning algorithm is embedded in hardware and the parallel nature of the artificial neural networks can be fully studied. The objective of this dissertation is to develop a top-down design methodology for synthesizing neural networks using a standard CAD toolset The hypothesis is demonstrated by using the most versatile neural network, namely single layer perception. Then the methodology is extended to an advanced neural net architecture, Alopex.
1.3 Contributions of the dissertation

It is the first time ever that a top-down design methodology has been proposed for the design of neural network architectures. Top-down designing concepts have been discussed in the context of well known, digital computer architectures such as CPUs and I/O interface devices. These applications do not require such an exhaustive iterative technique of backannotation between different levels throughout the design process as the one required to implement computational algorithms, such as a neural network architecture. After each design step was completed, knowledge gained by performing extensive simulations was used to go back up one or several steps in the design process, modify the algorithm, and continue down. Several iterations at each design step were required since each step advanced the implementation further into the hardware level, uncovering new limitations and secondary effects which had been transparent at a previous higher level of abstraction. Neural network hardware implementations have been successfully attempted by several researchers. They usually approach the design using a bottom-up technique which produce a specific architecture that can not be improved or optimized to address all the issues that come up during the design process and do not provide light for other researchers to continue or optimize
t
their designs. This dissertation however, applies a top-down design methodology to the hardware implementation of neural netwoik architectures. With such a methodology, a designer may undertake a different design, learning algorithm or application, using the modules developed in this dissertation and optimizing, improving or easily adapting them to suit other implementations. The approach proposed in this dissertation allows for extensive modifications in the computational algorithm that are necessary for hardware implementability. The methodology and examples shown in this dissertation are general enough for any machine learning algorithm implementable as a neural network architecture. This is shown with an example of a most versatile neural network architecture that can be used as a building block for 5 other architectures since the modules are cascadable. This is then applied and results shown
for a novel architecture.
1.4 Organization of the dissertation

This dissertation is organized as follows: Chapter 2 provides an overview of previous work done in the areas of hardware implementation of neural networks and design methodologies. Chapter 3 describes both the learning algorithms (i.e., the deterministic perceptron and the stochastic Alopex) and the modifications made to make them synthesizable, i.e., VLSI implementable. Chapter 4 describes the steps taken to map the perceptron learning algorithm to hardware while presenting results. Chapter 5 extends the methodology to a novel learning architecture implementation, A/gorithm for pattern extraction (Alopex). Chapter 6 concludes this dissertation with some thought* recommending additional work to continue the research. Software listings are included in Appendix A . Appendix B includes an overview of the software tools used, and a step by step tutorial on how to use these tools in an actual university laboratory environment. The bibliography and the vita of the author are also included.
Chapter 2 BACKGROUND INFORMATION
2.1 On hardware implementation of neural networks

Hardware implementation of neural networks has been investigated by several researchers. Several of them implement the Backpropagation (BP) algorithm. However, and according to Grossberg (Grossberg, 1987), this popular learning algorithm is not plausible biologically and is inherently difficult to be implemented in hardware, since there is no evidence that biological synapses can reverse direction to propagate the calculated errors to previous layers nor that neurons can compute derivatives (Hinton, 1989). In BP, the calculation of the weights depend on differentials with respect to other parameters which in turn depend on differentials with respect to previously calculated parameters. That is, the BP algorithm recursively modifies the synaptic weights between neuronal layers. This algorithm first modifies the synapses between the output field and the penultimate field of hidden neurons. The algorithm then uses this information to modify the synapses between the next two levels back of hidden neurons, and so on all the way back to the synapses between the first hidden field and the input field (Kosko, 1992). To realize this in hardware requires an impractical amount of interconnections among neurons to propagate the information forward and backward in order to calculate the error.
Other learning algorithms implemented in hardware that are found in the literature include the Kohonen feature map (He, 1993; Melton, 1992; Mumford, 1992) and Hopfield type of networks (Lehman, 1993) both of which are architectures that have limited applicability to solving a variety of real world problems.
The proposed stochastic learning algorithm to be implemented in hardware was initially investigated by Harth in 1976 (Harth, 1976) in relation with the problem of ascertaining the shapes of visual receptive 7
fields. He originally proposed a model of visual perception as a stochastic process where sensory messages received at a given level are modified to maximize responses of central pattern analyzers. Later computer simulations were carried out using a feedback generated by a single scalar response and very simple neuronal circuits in the visual pathway were shown to carry out this algorithm (Harth, 1987). Harth and Pandya showed that the process represented a new method of approaching a classical mathematical problem, that of optimizing a scalar function f(xi, x 2,... x) of n parameters xit i =1,2, ...n, where n is a large number (Harth&Pandya, 1988). Herman eL al. have discussed this algorithm in the context of pattern classification, in particular using piecewise linear classifications (Herman, 1990). Rosenfield refers to it in a survey of picture processing algorithms (Rosenfield, 1987).
The algorithm works as a cross correlation between the synaptic weights and global errors calculated in the previous two iterations, at any given layer. These simple calculations can be done concurrently by all the neuronal processing elements. In addition, the proposed algorithm, due to its stochastic nature, does not have another inherent problem arising from the BP implementation: that it converges to a local error minimum, if it converges at all (Kosko, 1992).
Most of the work done in hardware implementation of neural networks is in the analog field (Macq, 1993; Harrer, 1992; Linares, 1993; Sackinger, 1992; Sanchez, 1993). These researchers believe that an analog implementation best resembles the biological learning experience, the final chip/die size is much smaller than a digital implementation would produce, and is faster. However, the relation of accuracy and chip area is a major design problem since the more precisely a designer wishes to control the matching of the analog components the larger the chip area that is required. Another drawback of an analog implementation is the difficulty to store the weights, which is usually done as charge stored in capacitors. The charges leak and thus require a refreshing mechanism, adding to the overall chip area and affecting accuracy. The chip area is also affected by the need for low power consumption . This requires low currents which in turn requires large resistors in the circuits that implement the synaptic weights. The
8 resistors, or switched capacitors, that implement the weights do not allow for high precision, thus limiting the complexity of the pattern that can be reliably processed with an analog net (Ramacher, 1991).
The technology available today allows for a very large component integration. Thus, for a final die size of reasonable dimensions, very powerful computations can be implemented using a digital implementation. Storage of weight information and user defined parameters can be easily accomplished in digital architectures, and speed of operation could be comparable to an analog implementation. In a digital implementation, the application and modeling of the net characteristics can be made independent of circuit design. The word length, and thus, computation accuracy, can be determined for each module before the implementation by software simulation. The use of floating point numbers can provide for high accuracy and eliminates the problem of limit cycles. According to (Ramacher, 1991), If the information must be processed with high precision (not less than 8 bits) and the learning is to be supported on-chip, digital circuitry is the right candidate for implementing a neural net. Conversely, for applications which do not need hardware support for learning and for less severe requirements in computation precision, analog design seems to dominate.
The stochastic nature of one of the proposed learning algorithms to be implemented for this dissertation, Alopex, and the simplicity of its weight adjustment calculations require a rather high level of accuracy in parameter precision. In addition, the learning algorithm is to be implemented on-chip. Thus, a digital implementation was selected.
Other authors have implemented hybrid architectures by designing parts of the neural network architecture using analog techniques and other parts using digital techniques, in an attempt to find the ideal architecture that will incorporate the best features of both worlds: ease of information storage, ease of interfacing, and high precision of a digital implementation with the speed and compactness of an analog implementation (De Yong, 1992; Sackinger, 1992).
2.2 On top-down design methodology

Industry has developed top down design methodologies for their product development process flow with the goal of achieving a substantial reduction in design cycle time. These corporations publish internally design methodology guides, firom specifications through the end of the entire process for creating an integrated circuit that their design engineers must follow. Usually, these documents are proprietary to the companies that develop them and are not available to outside researchers. Several publications in the field of engineering education have addressed this topic as well. However, their treatment of the top-down design methodology for circuit design is limited to the teaching of courses in a university environment which generally uses non commercial software tools developed by other universities, such as the Magic VLSI CAD package developed at the University of California at Berkeley (Williams, 1991; Wolfe, 1992; Berkes, 1991), or at other universities (Reid, 1991) or tools that are not readily available or used in the industry today (Aylor, 1986; Soma, 1988; Rucinsky, 1988; Sait, 1992). The application examples are also .limited to a few well known digital architectures, such as microprocessors, and do not give a general, step by step, methodology that others can follow and thus produce the desired product. Other authors describe a simplistic top-down design methodology for logic equation implementation with a PLD, considering the top most level as the writing of a Boolean equation, the next level is a list of signals, followed by the next step to obtain the design documentation, then the functional logic diagrams, the realizable logic diagrams and finally, the bottom level, to obtain the detailed logic diagrams (Sandige, 1992). Gander (1994) presents a top-down design methodology for designing a multi-chip hardware system, with the software tools including schematic capture, circuit simulation and PC board layout capabilities This dissertation shall define a more exhaustive top-down design technique than the ones that can be found in the literature, without being as specific as a proprietary documentation would be. This will be accomplished by using todays popular industry standards, such as Verilog XL, Synergy and Epoch software tools, with the goal of producing a single VLSI chip that implements the desired functions.
10
Chapter 3 The learning algorithms
3.1 Deterministic learning algorithm

The perception learning algorithm will be used in the next chapter to illustrate the design methodology approach to mapping a learning algorithm to hardware. By following the design steps with a simple example, the methodology thus studied will be applied to the design of a network that implements a more complex learning algorithm. The simple, but elegant perceptron learning algorithm is used to design a training module in such a fashion that each submodule adjusts one weight only. This allows for a completely cascadable design that can be used as a building block to form larger networks such as multilayer perceptions. Frank Rosenblatt developed a large class of artificial neural networks called perceptrons in 1958 (Rosenblatt, 1962). The typical perceptron consisted of a layer of input neurons (the retina) connected by paths with weights to a second layer called associator neurons (see Figure 3.1). The weights on the connection paths were adjusted by following a learning rule called perceptron training rule, which uses an iterative weight adjustment that is very powerful. For each training input, the net calculates the response of the output unit (the calculated output) by performing the sum of the products of the weights times the inputs. Then the net would determine whether an error occurred for the individual pattern (by comparing the calculated output with the target or desired value). In this design, for example, if the desired (target) output is 1 and the calculated output is negative or zero, then the weights are adjusted by adding the input value, i.e., increasing the weights as given by wj (new) = w;(old) + Xj (3.1)
11
where xi is the corresponding input for the given pattern. On the other hand, if the desired output is 1 and the calculated output is positive, the weights are not changed, i.e., they are correct. Similarly, if the desired output is -1 and the calculated output is positive, the weights are adjusted by subtracting the input value as shown below: Wj (new) = Wj(old) - Xj (3.2)
On the other hand, if the desired output is -1 and the calculated output is negative, then the weights are correct and are not adjusted. Figure 3.2 shows the hierarchical module partitioning of the neural architecture, automatically generated by the software tools when the Verilog files were imported from the Unix to the Design Framework II environment.
Response cells
Figure 3.1: The Perception (adapted 1990) from Khanna,
12
neLtop
train. 1
ouLunM
ouLunitl
. tram
ouLunit
ouLunit
Figure 3.2: Hierarchipal module partitioning
13
The goal of the trained neural net is to classify each input pattern as belonging, or not, to a particular class. Belonging is signified by the output unit giving a positive valued response; not belonging is indicated by a negative valued response. The net is trained to perform this classification by the iterative technique described below (For a detailed discussion see Fausett,1994 or Pandya and Macy, 1995): Step 0: Initialize weights (in this design, weights are initialized to 0) Step 1: While weights change, and iterations do not exceed a maximum allowed, do Steps 2-6. Step 2: For each training pattern, do Steps 3-5 Step 3: read in the pattern into the input units, xi( and the desired output, tj Step 4: Compute response of output unit: y_in = X xiWi And calculate the activation function: y = sigmoid(yjn) Step 5: Update weights if an error occurred for this pattern: If t = -1 then if y > 0 then
wi(new) = Wj(old) *
xit else
wi(new)=wi(0ld),
Else if t = 1 then if y<0then wi(new) =

Wi(old) + xi( else
wi(new) = Wi(old),
Step 6: Test stopping conditions: If no weights changed in Step 2, convergence = true (i.e., network is trained), 14 or number of iterations exceed maximum allowed, convergence = false (i.e., network did not converge), stop; else, continue.
3.2 Original stochastic learning algorithm

The learning algorithm proposed is a cross correlation based parallel stochastic processing algorithm, i.e., the parameter changes are correlated with changes in the parameters and global error calculated in previous iterations. This is inspired by earlier work done by Pandya and Harth (Pandya & Venugopal, 1994). The weight update calculations do not involve derivatives as is the case in most other algorithms making it simple enough for digital VLSI implementation in its original form. However, a few approximations are still necessary in order to make it suitable for hardware implementation. This learning algorithm is flexible with respect to network topology, choice of error function and neuronal types; it tends to avoid local minimal due to its stochastic nature; it is independent of the derivatives which mark gradient descent; it is able to generalize well, especially over noisy data, and it is applicable to highly parallel implementations (Pandya & Macy, 1995). The proposed algorithm uses a stochastic procedure to find the global optimum of a function. It is suitable for VLSI implementation since the neuronal calculations are simple and there is no transfer of' information between processing elements. The procedure is iterative, with the variables that determine the cost function (the weights) updated simultaneously by small increments during each iteration. Following the update, the new value of the cost function (the error) is calculated. The change in weights depend stochastically upon the change in the
15 value of the cost function and the change in weights over the past two iterations. Two parameters guide the process: the step size and the effective temperature. The rule for updating the weights at the n"1 iteration is given by: ws(n) = wg(n-l) +DELTA where DELTA is a small positive or negative step. The choice of sign is done by generating a random number and comparing it with the probability, p, given (3.3)
by:
Pij(n) = l/(l+exp(ETA*Delta/T))
(3.4)
where ETA is the learning rate, and T is a parameter that changes the slope of the sigmoid, Delta(n) = [ws(n-l) - ws(n-2)] * [E(n-1) - E(n-2)] (3.5)
and E(n) is the sum, or the average, of the partial errors, in a given epoch, for all the individual training patterns. For a detailed treatment of different error measures, such as the squared error, the square root error and the information theoretic eiror measure, refer to (Pandya & Venugopal, 1994). T is the temperature, which is averaged in order to anneal the learning procedure, i.e., the adjustments are larger at the beginning of the training cycle than as training advances. If the random number generated is smaller than the probability calculated, a step is added to the weight. If the random number generated is larger than the calculated probability, then the step is subtracted. As shown in Figure 3.3, the algorithm takes a biased random walk in the direction of decreasing error, E. The step size is constant, and the temperature, T, determines the effective randomness of the walk (Kondo, 1992)
16
AW,
Ay=
m
+ P n >. 5 5, S A# = Rj < 8^+8. ea
A*=-
Ri < -5 8^+6 U
Error Decraiied
A(j= + -5 R> -5 5 _ V*
Error Increased
Weight Dacrauad Wteght Inaeaied
AE
Figure 3.3: Error/weight change probabilities (from Pandya, 1995) Several publications studied the application of this algorithm to solve optimization problems, image recognition, XOR and parity problems. Encoder problems of different sizes investigate the scaling properties of Alopex. These authors show that convergence times are comparable to those taken by backpropagation (Unnikrishnan, 1994). The learning ability of Alopex is shown in work done by (Thrun, 1991) and (Peterson, 1989) who solve the MONKs and the mirror symmetry problems used in benchmarking. In all these experiments, the same learning module was used for diverse architectures and problems thus proving the flexibility of the algorithm in solving a wide array of problems. For this thesis, Alopex was implemented in C language (see program listing in Appendix A), to study the behavior of the algorithm at higher level and of the algorithmic properties and network architecture to be implemented in case of proposed changes. Two .versions of the architecture were studied, a fully connected network, with a 17
single layer of hidden neurons, and another version with two hidden layers of neurons. Both versions allow fpr the user to determine, through the input training files, the number of input neurons, the number of neurons in the hidden/s layer/s and the number of neurons in the output layer. Both versions were applied to solve three types of problems: a linear separable problem, the exclusive-or problem, and an image recognition problem. Two error measures were used, logarithmic and squared error. For a detailed discussion on the effect of speed of convergence for different error measures, refer to (Pandya & Venugopal, 1994). Figure 3.4 shows a flow chart and code segment for the C language implementation of the Alopex training algorithm with one hidden layer of neurons that solves the mentioned problems. The program listing in Appendix A implements the two hidden layer version:
18
Figure 3.4: Flowchart for Alopex C language implementation
19
b) training procedure
Figure 3.4(continued): Flowchart for Alopex C language implementation
20
c) testing procedure
Figure 3.4(continued): Flowchart for Alopex C language implementation
21
Definition of global variables; int readinput(char *trfile); /* read input file */ void initweightsO; /initialize weights*/ void preferences(long *iter_max, double ETA, double *eiror_max, double *DELTA, double *T, double *To, char *trfile, char *ttfile, int *action); /*user parameters */ void train(double ETA, long iter_max, double error_max, double DELTA, double T, double To); /training algorithm*/ void saveweights(char *trfile); /*store weights of trained network in a file*/
int restoreweight(char *trfile); /restore weights of trained network if needed for testing*/ void decode(char *ttfile, char *trfile, double To); /*implement testing (i.e..recall) phase*/ void change_ext(char infilename[], char outfilenme[], char extensionG); /*save weights in a file with same name of training patterns file but different extension */
main
{
Definition of local variables; (start infiite loop) get input from user: max.# of iterations allowed; learning rate (ETA); temperature (T); threshold (To); step (DELTA); max. allowable error; desired action: 1 = test; 2 = train; 0 = quit; name of (training/testing) file; if (action = test) { decode /*testfile*/
)
if (action == train) { readinput /*read_input_training_file; */ initweights /"initialize weights randomly; */
22
train; /implement training algorithm */ saveweights; /*save weights for testing*/ } else exit(l); } /*end infinite loop*/
train while(iter_curr <= iter_max) && (error_max < error_current)
{ for each iteration do: Anneal T: take average of last 10 iterations if (iter_cuir%10)=0; for each input pattern do: for each hidden neuron do: calculate outputs ofCode segment: hidden neurons: { SUM = E weights(hid_to_inp) * training_dat; SUM = (SUM - threshold) / To; (limit the values) hid_out = l/(l+exp(-SUM));) for each output neuron do: calculate output for output neurons: {SUM = I weights(hid_to_out) * hid_out; SUM = (SUM - threshold) / To; (limit the values) out_out = l/(l+exp(-SUM)); j calculate the error for output neurons: {if (target_output == 0) out_error = logio(l/(l-out_out)); else out_error = logi0(l/out_out); if(record # = max_num_rec) { calculate error for this iteration: 23 /after all patterns done*/
sum_error = out_enror > u save in error_curr, set numrec = 0; calculate new weights for hidden neurons: for each hidden neuron (in each hidden layer) do: {Delta = (w(n) - w(n-l)) * (error_curr(n) - error_cun(nl)); (limit value of Delta) P = l/(l+exp(leaming_rate * Delta/T)); if(P>rand) new_weight = old_weight+DELTA; else new_weight = old.weight - DELTA; } calculate the new weights for output neurons: for each output neuron do: { Delta = (w(n) - w(n-l)) * (eiror_curr(n) - eiror_cuir(nl)) (limit the value of Delta) P = l/(l+exp(leaming_iate * Delta /T)); if(P>rand) new_weight = old_weight + DELTA; else new_weight = old_weight - DELTA; } } } /*end while */
decode {read test_file; restoreweight s; for all
/test*/
records do: 24 {calculate output for hidden neurons; calculate output for output neurons; } }
readinput initweights preferences saveweights restoreweights change_ext
3.3 Modifications made for VLSI implementation

After the algorithm was thoroughly tested using a high level language, an HDL version of it was attempted, using all the constructs and data types available in Verilog HDL. This demonstrated the first round of challenges in hardware implementability, such as the absence of a library of mathematical functions, e.g., the exp function, needed to implement the sigmoidal and probability equations of the algorithm. In C, values are real, i.e., continuous, while in HDL due to limitations imposed by hardware bits, all values are discretized. This may deter the computational ability of the algorithm, so adjustments are required. Another challenge relates to information storage since Verilog allows for two dimensional arrays only, implementing an array as a memory element with one dimension being the number of bits in the word, and the other dimension is the number of locations. Typically, weight elements in a C language implementation of a neural network involve three dimensional arrays, i.e., layer number and two indices related to neurons being connected. Verilog would not allow for such an implementation. The next step was to implement the learning algorithm using the subset of the HDL that the current synthesis tools will accept and thus produce a gate level schematic automatically. Several problems were 25 encountered since the synthesis tool, Synergy, allows for the register data type only. However, the learning algorithm is heavily dependent on real numbers, exponentiations, random number generation and divisions. This issue brought about several modifications to the original algorithm:
1) the implementation of a digital approximation of the sigmoidal transfer function; 2) the implementation of a new weight adjustment mechanism excluding the exponentiation and the division operations, and including saturation arithmetic. For example, after one of the iterative backannotation steps, the need for using a limiting operation to saturate the values of the weights was apparent in order to obtain correct results. This is not the case with the original Alopex algorithm. 3) a new interpretation to the weights of the bit positions in a register to represent integer and fractional values since neuronal outputs are between 1 and weights have both integer and fractional parts. The use of floating point (real) numbers was discarded to reduce the complexity of the hardware implementation. Neurons are simple processing elements performing simple operations. To include complex mathematical functions and floating point arithmetic in each neuron would be impractical and too expensive in regards to silicon area. 4) the implementation of a scaling down mechanism, with rounding and limiting, to solve the problem of I/O port consistency and small limit cycles, i.e., parasitic oscillations (Ramacher, 1991). These changes were studied during the HDL behavioral phase of the implementation. They were done in an iterative manner, using backannotation, i.e., the results of simulations resulted in the trial of different design options resulting in different implementations addressing the problems mentioned above. For example, during one of the Alopex simulations of a linearly separable problem, it was noted that the network converged for several sets of input patterns but did not converge for a specific set. After studying its behavior, it was noted that the network was classifying the patterns by drawing a line at a different location than the desired one. All of the patterns were classified correctly except one that was grouped with the patterns belonging to an incorrect class. The problem was solved by increasing the amount of
26
noise to the origina] weight adjustment mechanism. The addition of more noise effectively pulled the network out of its local minimum and allowed it to converge rapidly. These iterative steps allowed for several modifications of the algorithms and design trade-offs to be studied during the simulation phase, not requiring the actual hardware implementation. These simulations, using an HDL following the modeling style for later producing a synthesizable design, proved invaluable in that they brought the design concept to the hardware level, without actually requiring the time and expense of an actual hardware implementation including schematic drawings, wire wrapping and testing of the prototype every time a change was made. Simulating a hardware implementation using a high level language, suck as C, is also not sufficient to guarantee that a hardware implementation derived from it would function as required since many issues are transparent to the programmer and are taken care by the high level constructs, mathematical functions, input/output routines, etc.
3.3.1. Digital Sigmoid:

Typically, implementing the sigmoidal activation function, such as the one in equation 3.6, using a lookup table in each neuron would require a very large memory size if a high precision is required. In addition, each value of T changes the shape of the sigmoid, thus requiring numerous look-up tables, i.e., one for each value of T. One solution is to hardwire it as a polynomial that implements an approximation to the sigmoid: f(net) = 1/ (l+exp(-net/T)) by the following relationships: y=1 y= l-kCX^-X)2 y = X(XMt - X)2 if x 2>XM forXMt> x >0 for-XiU<x<0. (3.7) (3.8) (3.9) (3.6)
27
if x -X
(3.10)
Figure 3.5 shows these relationships by plotting y vs. x using 12-bit precision for y and the above stated squares function. A very close approximation to the sigmoid non-linearities may be generated using such relationships. Note that X value allows for the change in the slope of the sigmoid similar to the T parameter in equation 3.6
Figure 3.5: digital sigmoid implementation (adapted from Ramacher, 1991) The Verilog HDL description for the sigmoid is included in Appendix A. Figure 3.6 shows several sigmoid shapes resulting from varying the parameter X . The implementation shown is for an 8-bit input to the sigmoid HDL module, corresponding to x = net = Z w , * , and a 16-bit output from the sigmoid module, corresponding to y = f(net). To determine the accuracy of the sigmoid shape used in this design, the range of values of net were plotted along the horizontal axis from decimal 0 (hex.00) to decimal 127 (hex 7f), corresponding to positive values of x in
if x -X
(3.10)
equations 3.7 - 3.10. In this case, the sigmoid (y 28 values) increases from .5 to 1. For values of net between decimal 128 (hex 80) and decimal 255 (hex fl), corresponding to negative values of x, the sigmoid increases from 0 to 0.5. This is consistent with the graph of Figure 3.5 above. The Verilog code was synthesized and simulated using Verilog XL. Timing waveforms of the simulated behavioral description are shown in Figure 3.6-b. An initial estimation of the delay can be done by looking at the timing waveforms. For a clock period of 50 nsec, the time between a new value of x read by the module and the corresponding output value generated is 260 nsec. This is equivalent to about 5 clock periods, which is consistent with the behavioral HDL description, i.e., for one of the branches the program takes, there are five assignment statements synchronized with the positive edge of the clock (case x = 03 for example). In this case, the last elseif branch of the named block (sigmoid) is taken, taking 5 clock periods to execute. Several iterations of the behavioral HDL description were necessary until a satisfactory design was achieved. A first HDL description produced a synthesizable circuit, i.e., it passed the synthesizability check done by Synergy". Thus, it was synthesized, giving a gate level schematic that included a functional block for the multiplication operations. A first estimation for the silicon area and gate count was given by Synergy as 50892.61 units (i.e., a unit is going to be a micron squared if 1 micron design rules are used, or four micron squared if a 2 micron design rule technology is used, etc., during the i.c. layout step), for a total of 142 cells (i.e., gates, multiplexers, etc.) When the netlist generated by Synergy for the gate level schematic was used for placement and routing by Epoch, it failed because of the functional multiplication module. Thus, the behavioral description was refined and an 8-bit multiplier was designed by hand. The new HDL description was simulated using Verilog XL, modified and tested several times until a satisfactory behavior was achieved (Verilog behavioral description included in Appendix A). These timing
waveforms are included in Figure 3.6-c. A timing estimate is now of about 210 nsec, using a 20 nsec clock period, which is about 10.5 clock periods, for the time between a new input value and a new output value generated by the sigmoid module. The synthesized schematic takes a silicon area equivalent to 188,060.22 units now, for a total of 504 cells. The circuit was then placed and routed producing the layout shown in Figure 3.6.-d The area now is given as 0.201 squared millimeters by the placement and routing 29
tools. (A 3v logic, a feature size of 1/2 micron, 3-micron-wide metal and 1-micron-wide poly was used as the ruleset option in Epoch). The design was backannotated for timing and functional verification using Verilog XL.
test 7:0]<- test y I IS 01 test., elk c. test in o test react ' /x'JO) /in /reset . o o <,
/y<IS 0 > -o /elk .
Figure 3.6:a) Digital sigmoid transfer function for several values of X; b) timing waveforms of the behavioral HDL description; c) synthesized schematic and d) chip layout
30
sig/cwadBjfclg {CDfl,sli3nlp3V) Cell; Up_(485.3 x 414.6) B 2.0,le*05 sg wlcro Figure
3.6
(continued):a) Digital sigmoid transfer function for several values of X; b) timing waveforms of the behavioral HDL description; c) synthesized schematic and d) chip layout
31
3.3.2. Modification of the original weight adjustment mechanism:
The weight adjustment mechanism for the Alopex algorithm given in equations 3.3 - 3.5, requires the implementation of the probability function, p, as a means to add stochasticity to the learning process. This randomness allows the network to converge even when the network is trapped in a local minimum error by shaking it, i.e., adding some noise that may increase the error but that may also allow it to jump out of this false convergence point. As mentioned above, the hardware implementation of the exp function is not synthesizable from an HDL description by the present tools unless it is approximated by a series expansion, requiring several terms with factorial and division operations. The C language implementation of the algorithm successfully implements this probability function by calculating the value of p in equation 3.4, and comparing its result with a random number between -1 and +1, generated by the compiler. If the random number is greater or smaller than the calculated value of p, a small step is added or subtracted from the present weight values, resulting in a new weight value, implementing equation 3.3. However, for the hardware implementation to be feasible and result in a neuron with a minimum silicon area, the weight adjustment was modified to follow the original rule proposed by Harth in (Harth, 1976). This original weight adjustment mechanism is better suited for a digital hardware implementation since its circuit consists only of adders and multipliers. It implements the following equations: Wj (n) = Wi (n-1) -x + noise where x = AE* AW AE = Error (n-1)- Error (n-2) AWi=wi(n-l)-wj (n-2) (3.11) (3.12) (3.13) (3.14)
The feedback is given by x, i.e., the network learns by adjusting its weights by incorporating a measure of global performance of the system in its calculations. This measure of global performance is the
32
same as equation 3.5, effectively incorporating information on the past behavior of the network in its rule to modify the value of the present weights. The noise in equation 3.11 is what allows the network to randomly find its way to convergence, as mentioned above. The original algorithm proposed by Harth uses gaussian noise, with its characteristic uneven distribution of values centered around a point The noise used in the present hardware implementation is white in nature, i.e., it is evenly distributed between -1 and +1. A random number generator was designed and described behaviorally using Verilog. It simulates the RBG1210 Random Bit Generator which produces truly random bits. The chip is commercially available, thus, the present hardware implementation of the learning algorithm contains a functional block that simulates its function. This block is not synthesizable with the tools available today, so it is considered to be offchip. The RBG 1210 is based on the naturally occurring Johnson noise which requires no initial seed value. Each new value i$ independent of all previous values, it is not pseudo random, there is no repetition, giving an infinite cycle size. The Johnson noise is generated by a resistor connected to a noisy amplifier which generates thermal noise due to the random motion of free electrons in the resistor. This white noise signal is then converted to a stream of binary levels by an A/D converter (Newbridge, 1992).
3.3.3. Data representation:

The next important issue that needs to be addressed is data representation since the present technology and software tools for synthesizing an HDL description only allow for integer manipulation while the learning algorithms involve real numbers. The data format selected allows for representation of fractional numbers in addition to integers. One of the reasons for selecting this type of format is the fact that fractional numbers can be more easily rounded, i.e., the least significant bits of the product can be rounded to the most significant bits of the product naturally, and that the most significant bits of the product have the same format as the inputs. The addition and subtraction of fractional numbers require no additional modifications to the hardware for manipulating integers, i.e addition and subtraction are
the 33 same whether integer or fractional data is used thus no additional circuitry is required when using either data representation. For fractional number multiplication, an additional scaling or shifting is needed. A twos-complement format was selected to handle negative numbers. For the contents of a hardware register to be considered as fractional bits instead of integers bits, the most significant bit is the sign bit, and the rest of the bits represent powers of 241 for the integer portion, and powers of 2* for the fractional portion. The imaginary fixed binary point is located after the bit shown. This results in the following formats: sign i i i . f f f f with the following weights: 22 2 2 . 2'1 2'2 2'3 2*
This format was used for the 8-bit weights allowing them to take on values between + 7.9375io (7fi> and -7.9375io (81ii). For the 8-bit inputs, the format selected was the following: sign . f f f f f f f with the following weights 2'1 2'2 2'3 2* 2's 2* 2'7
This allows the inputs and outputs of the neurons to take on values between + 0.9921875io (7f )6) and 0.992187510 (8116). If 16-bit representations are used, for example, then the inputs and outputs would have values between 0.9999695 for the format shown below: sign . f f f f f f f f f f f f f f f
. 2'1....................................................................................................................................... 2'iS
It is easy to see that a simple truncation of the lower 8 bits of a 16-bit number will produce an 8-bit number that is very close in value to the original 16-bit number since the most significant bits are preserved. This is not the case if the bit positions are for integers instead of fractional data. For example, a 16-bit result of multiplying two 8-bit numbers may result in 7 f f f t6 = 0.9999695iO using the above
34 format. Truncating this 16-bit to an 8-bit number so it can be output to the next neuronal layer to be used as an 8-bit input, results in 7 f i6 = 0.9921875io using the above format, (a 0.7% error). If integer representation is used for the neuronal inputs and outputs, then 7 f f f
16
= 32767]0. Merely truncating
the lower (or upper) 8 bits will result in an unacceptable large error since 7 fi 6 = 12710. Obviously, a different mechanism, more costly in timing and silicon area, would be needed to maintain calculation accuracy. When numbers are multiplied and added, such as when the weights are multiplied by the inputs and their results are accumulated in a given register, this register has additional extension bits to allow for word growth. For ample, when multiplying an 8-bit weight times an 8-bit input, a 16-bit register is needed to store the result If several of these multiplications are to be accumulated in a register, several extension bits may be needed to avoid overflow. The accumulator registers chosen for the Alopex implementation are 19-bit registers, i.e., they allow for 3 bits of extension since 16 bits are needed to store the result of multiplying two 8-bit numbers. This is important when rounding and truncation is required, such as for example, when a 19-bit number, the output of the neuron, has to be reduced to 8 bits to be used as the input of the next level of neurons since the data path is 8-bits wide. If any of these extension bits is on, indicating the summation exceeds the maximum number representation, then limiting arithmetic is used. That is, the value of the register is clipped at the maximum or minimum allowed (the saturation value of the neurons), consistent with the size of the register to be stored and the sign of the accumulator register (the most significant bit of the extension register). Please, refer to chapter S about specific implementation issues and examples.
A signed multiplication circuit was also implemented to handle signed operands and the scaling needed when handling fractional data. An example of two numbers being multiplied together is given in Figure 3.7 below. The difference between signed integer and signed fractional data multiplication is that in hardware multipliers that handle integer data the extra sign bit is used as a duplicate sign bit (there is an extra sign bit because two sign bits exist before the multiplication and only one is needed in the result). Hardware multipliers use this extra sign bit as a sign extension bit. However, in fractional
multiplication, 35 this extra sign bit is appended after the least significant bit as a zero bit. In the hardware implementation of the learning algorithms, this is accomplished by a shift left of the result of the multiplication with a zero fill in the least significant bit position. Thus, the same hardware multiplier is used than for integer multiplication, but the final result is shifted left to produce a correct fractional value.
1. 11-
1 I-
ll-
SigflM Mulbglicr
| SiQntd Multiplier
Hi
js.Most Significant Product Lust Significant Produet |Q |
I Sign Extension
Figure 3.7: Integer/fractional multiplication comparison
36
Chapter 4 TOP-DOWN DESIGN METHODOLOGY

Currently, a top-down design methodology is preferred over a bottom-up design methodology. The latter resembles the steps involved in breadboarding the hardware system. It gives a poor functional view of the entire system, is time consuming and does not facilitate the engineering team to work concurrently because individual pieces of the design have to be completed first (Thomas, 1991). On the other hand, top- down design begins with an HDL functional model of the top level system. Software design, simulation, synthesis, analysis and layout tools allow the designer to functionally describe, simulate and test a complete design architecture at the highest level of abstraction, i.e., specifying only a set of inputs, outputs, and its functionality. Similar to the concept of structured programming in software, the hardware system is partitioned in modules, as independent of each other as possible, after the overall system behavior is satisfactory. Each module is then optimized with respect to speed, behavior and area, tested and plugged back into the overall system architecture for behavioral verification. As each module is modified it can be incorporated back into the system architecture for verification at the system level. This design methodology allows for system performance verification early in the design process, saving many hours of woik in achieving the desired performance. It also allows for the engineering team to work concurrently since, once agreed on the desired system specifications and after a preliminary module partitioning is made, each team member can take on the design of one of the modules. In this way, the complete design is developed concunendy.
37 Verilog is a Hardware Description Language for both behavioral and structural modeling that is becoming a standard among commercial users. Descriptions using Verilog, in some cases, result in code that is much more compact than VHDL, the other prominent hardware description language currently in use (Stemheim, 1993).
Verilog can be used to model a digital hardware system at many levels of abstraction, ranging from the algorithmic level (similar to a high level programming language implementation such as C or Pascal) to the gate level or to the switch (transistor) level. This model specifies the external view of the device and one or more internal views. The internal view of the device specifies its functionality or structure, while the external view specifies the interface, or connectivity, of the device through which it communicates with the other modules within the system (Thomas, 1991). The HDL can both describe the functionality, connectivity and timing of a hardware circuit. Using HDLs makes the design of complex systems more manageable since the designer only needs to change the HDL description instead of reconfiguring and re-wiring the hardware prototype if changes or upgrades are called for. The designer can try different design options quickly and easily so the time and cost to change a design problem are significantly reduced.
As in a software environment, in which there is a section of code that contains the higher level control statements, one of these partitions will be the control unit module. The control module is not decomposed further until all other modules are successively decomposed into simpler, more independent structures, and no further refinement is possible, i.e., until all submodules are ideally implemented with other module instantiations and basic library components (Coiner, 1983). This may not be possible in all cases, especially if the system under study is a learning algorithm for which module partitioning may remain at a higher functional level of abstraction due to the highly abstract nature of the algorithm itself or due to the inexperience of the researcher not being very familiar with computer architectures. Once the control and timing signals are defined, and the system performs satisfactorily, the control unit is optimized by implementing it as, for example, a state machine.
38
It is at this stage of the designing process that the concepts of top-down structured programming and top- down hardware design become different In software, the simulations are generally run using a single CPU, thus executing in sequence the statements included within each module, and each module is put on wait while another is executed. In a hardware simulation environment, such as with the Verilog HDL simulator we used, all the modules appear to operate in parallel, controlled by the control unit which generates the necessary timing and control signals for each module. This simulator is event driven, i.e., only those elements that might cause a change in the circuit state (about 2 to 10 percent of circuit components at any given time) are evaluated and simulated, as opposed to other
simulators that are time driven and evaluate each element at each point in time, producing a new circuit state at each point in time. Event driven simulation has the concept of time wheel embedded, i.e., time advances only when every event scheduled at that time is executed (simulating a concurrent execution environment) and the tme wheel can only advance forward (simulating the real life passage of time). This allows for a more accurate simulation of a hardware implementation.
This approach is in contrast to the traditional bottom-up design methodology in which each system module is designed, implemented as a prototype, tested, modified, etc. After it performs satisfactorily, it is included as a building block upon which a larger module is built. For example, a one bit full adder is made up of half adder units. Several one-bit full adders are put together to build a larger adder, and so on. Not until the complete ALU is finished, the designer could verify its overall performance. With a top- down design methodology, a block diagram of an ALU is described in a high level HDL, simulated, tested, and then partitioned in modules, such as an adder, a multiplier, a shifter, etc. Each individual module is refined and optimized, plugging it back into the system until the desired behavior and performance is achieved. Several iterations may be needed at each design step, with the designer going back and forth between the product specifications and the design. Once a satisfactory performance is 39
achieved, the design is automatically synthesized. The gate level schematic thus produced is verified, comparing its behavior to the functional description. Finally, the design is automatically placed and routed by the tools. After the design is backannotated and verified, it is sent for fabrication. The fabricated chip is then tested. The availability of these sophisticated software tools has made possible the achieving of a complete design process in less time, and the cost of building and testing prototypes, verifying and rebuilding them, etc., has been all but eliminated, resulting in a faster, cheaper design process, shortening the design turn around time and allowing for increasingly complex architectures to be automatically implemented in a VLSI circuit.
Figure 4.1 shows the flow chart of a complete top-down design process, from the design concept to the chip fabrication stage.
Figure 4.1:Top-down design steps
40
The following is a suggested step-by-step method for mapping a learning algorithm to hardware and are exemplified by the implementation of a single layer perception. The architecture is rather simple but it serves the purpose of illustrating the design steps. These steps are just a guide for the beginner or the inexperienced designer to easily design a hardware implementation of a general purpose, fully connected neural network architecture with the goal of investigating the performance and behavior of a learning algorithm. This work does not attempt to replace the expertise of an experienced digital circuit designer. However, it merely attempts to guide the novice designer through the steps in the design process with the hope that the truly parallel nature of a neural architecture with on-chip learning capabilities could be studied. First step: C language implementation.: This step is optional but highly recommended since through a high level language implementation, the designer can understand every detail of the algorithm, and initiates the process of identifying the block of the program that contains the control statements. This portion will become the control unit of the hardware implementation in the fourth step.
Second sten: HDL description: The complete HDL behavioral description is achieved by following programming logic developed using C language and adapting it to satisfy constraints imposed by Verilog or any other HDL that is chosen. Since the final objective is to produce a gate level description, this step may require the designer to limit the choices of language constructs and data types to those that the synthesis tools understand. This step may require some minor, or major, adjustments to the original algorithm, especially for those that use floating point numbers and hyperbolic transfer functions. For these cases, chapter 3 described some modifications made to a learning algorithm to make it synthesizable and VLSI implementable. In cases where this is not possible, the algorithm may be tested at this level of abstraction, with nonsynthesizable blocks simulated
41
at the behavioral level and blocks that may be synthesized simulated at the gate level. Mixed mode simulation is a powerful analysis tool available in cases such as this. Still, the event driven simulator is capable of simulating concurrent processing of parallel blocks of code allowing the researcher to experiment with the learning algorithm even though a final chip layout could not be automatically produced by the synthesis and layout tools available today. The complete software listing developed in Verilog for the perception algorithm is in Appendix A. Please
refer to portions of it during the following discussion. Figure 4.2 shows a sample run of the behavioral description, with the steps taken during a typical iteration. A set of two patterns is tested giving the conect classification. The format of the data used for the perception implementation is the following: weights = w[3:0] = s i f f sign 2 . 2'1 2'2 inputs = x [3:0] = s.fff sign. 2* 2'2 2'3 Thus, the values for the hexadecimal numbers shown in the sample run of Figure 4.2 are the following: input, x
F (hex)
2 = (fract) + 0.25
input, xj= (hex) f = (tract) -0.125 input, X| = (hex) e= (fract) -0.25 weight, W| = (hex) 2 = (fract) +0.5
input, xp (hex) 1 = (fract) + 0.125 weight, Wi = (hex) f = (fract) - 0.25 weight, wi = (hex) e = (fract)-0.5
The accumulator register where the sum of products ( X| * Wj) is to be stored, is a 12-bit register. This allows for accumulation without overflowing. It has the following format:
s s s i i . f
f f O
4 2
count-3 wl_old 2 2jald* coune-0 nuarac- 0 traindat_tx_>2 c ra ind*t_ntx_2 -1 out (daalrad)-l x^tnaida.axiltîn^t raining* 000000001000 *i-_tnji(k_Bult_in_t raining* 111111111110 x_l* 2 xj2 1 1*2 2* f out** 1 wl*xltw2>x2<* 006 *_1 naw- 0010; X 1-0010) W_2n*w- 1111; X~20001 1- 2 w2- f itar_curr* 10 wl_old* 2 v2_old - count*1 wl.old - 2 v2_old* ( niarec* 1 traindatjtx_l-l traindat_mtx_2-f ouc (daairad)-l w jc_ina Ida _u It _ln_t raining- 000000000100 * x_uis ide_mul t_i n_t raining* 000000000010 x_l* UJ f wl-2 *2* out- 1 wl*xl2*x2- 006 If 1 now- 00101 X 1*0001; CO*- mi; x~2-mi wl- 2 2- iter_curr- 11 wl_old* 2 *2_old count-2 tfl_pld 2
v2_old- f
nu*roc- 2 traind*t_mtx_i0 traindat_atx_2-2 out (dasirad)-f w x_maidajau 1t_in_training* 000000000000 w*x_ina ids_oult_in_t raining* 111111111100 x_t- 0 x_2 - 2 1-2 *2- out- wl*xl+2*x2* e K 1 new* 0010; X 1-0000;
CO1*" l111 X~2-0010
wl-2 2 itarêurr- 12 wl_old* 2 w2_oid - count3 wl_old 2 w2_old- nuarac- 3 craindacjatx_l-e tmndatjatx_2-l out (dasirad}* w * x_in s Ida jau It _ln_t rami ng- 111111111000 w *_ina ldajxu lt_in_t ra Inlng- 111111111110 x_l* x_2 - 1 1-2 2- out- wl*xl*2"x2- 6 N 1 new* 0010; X 1-1110;
CO* 11111 O"00i

wl- 2 w2- itar_curr* 13 tfljold* 2 *2_old - count*< l_old 2 w2_old- coating
claaa-Cl, output* 1 w_l- 2 tsstdatjstxIO) - l 5 2 - taacdat_mtxlll0 claaa-1
cl*aa-C2. output* >1 w_l- 2 t9tdatjacx(2l- *_2 - cascdac acx(31 - 2 claaa -
L132 * final, v": 3 finish ac simulation tiaa 11250 4433 siaulacion events CPU tiam: 1.1 seca to coopile * 0.4 secs to link * 0.3 secs in sisailecion end of VEIULOG-XL 2.2.13 Mar 28, 1996 15:41:56
Figure 4.2: sample run for results of 13 iterations of perceptron algorithm
43
The starting values of the weights are zero. After 13 epochs, the values at convergence are W| = 2]6 = O.Si0; and W2 = f = -0.25io, using the data format explained above. Convergence is determined when the weights have not changed from the previous iteration. In the sample run shown above, four patterns were used for training, thus, when the weights remain unchanged for a complete set of training patterns, convergence is achieved and the system has finished the training session. After training, testing results are shown where pattern (1,0)i 6 = (0.125,0)lo is classified into class 1 (output=l), and pattern (f, 2)u = (0. 12S, +0.25)io is classified into class 2 (output=f).
Third Step: System clock and power-on-reset modules: The system clock module may contain timing information that may not be synthesizable, thus, it is suggested that a separate module for a single phase, or a two-phase clock, be created at this point.. Timing and delay information may be ignored by the synthesis tools since this information is technology dependent. In addition, parameter initialization statements, i.e., statements that are executed only once, at the beginning of the simulation, are not synthesizable. It is always advisable to write HDL descriptions that are portable, i.e., technology independent, so the design can be easily modified or re-synthesized for a different technology without having to make significant changes in the code. The module, m(clk), generates a 20 megahertz single phase system clock signal, elk, that synchronizes all the other system modules. The system clock frequency was modified several times during simulations to study design performance trade-offs. A clock frequency of 20 Mhz was selected so as to obtain results that could be compared to other published work ( Melton et. al 1992). However, as the results of the synthesis steps will show, a faster clock frequency could have been used. This and other design option optimizations are currently being studied. The power-on-reset module generates a signal called reset that resets all the other system modules at the start of the simulation. After a delay of half a clock cycle, it is deasserted allowing
the network to 44 function. To have similar timing simulations for both the behavioral and the gate level (synthesized) description, the reset line should only be asserted once, at the beginning of the simulation.
Fourth Step: Preliminary module partitioning: A first level partitioning separates the control statements, that will make up the control unit of the system, from the rest of the behavioral description, that will perform the training. One module contains the statements that implement the weight adjustment mechanism. Another module contains the control statements which include the decision making portion of the learning algorithm which will determine when to start and stop training, i.e., when the network learned the input/output associations for a given problem it terminates the training mode. At this stage, the control unit will change the network to testing mode and generate all the necessary control signals for the coordination of the activities of the other system modules. In the present implementations, the control unit module contains a ROM to store the training and testing patterns. For the perception learning machine, the control unit determines that the network is trained once the weights do not change for one complete set of training patterns. Thus, the control module requires information from the training modules to determine when convergence has occurred, increasing the complexity of the systems connectivity. This is not the case for the application described in the next chapter, the Alopex learning algorithm, which greatly simplifies the network architecture since the weight information is local to each weight unit. To determine convergence, a measure of the error is implemented within the control unit itself and is broadcast to all the weight units concurrently. The weight units then update their values using this measure of global performance and information available in their local memories, as described in chapter 5. In applications in which the algorithm requires an external critic it may be treated as an off-chip I/O signal. The modules described below address a neuronal unit of two inputs and two weights. This unit can be enhanced to more weights
by increasing the size of the weight vector in the control unit and the size of the input vector (Xj) in the training unit. 45
. Inputs: two 4-bits weights, wl[3:0], w2[3:0J; one 1-bit handshake signal, done; clock signal, elk; reset signal, reset; Outputs: two 4-bits training patterns (xis), traindat_mtx_l[3:0], traindat_mtx_2[3:0]; one 4-bits training pattern (desired output, tjs), out[3:0]; one 1 -bit enable signal, enable_train;
Training module: The inputs and outputs are the following: Inputs: two 4-bits input patterns, xl[3:0], x2[3:0]; one 4-bits desired output, out[3:0]; one 1-bit enable signal, enable_train; clock signal, elk; reset signal, reset; Outputs: two 4-bits weights, wl[3:0], w2[3:0]; one 1-bit handshake signal, done;
Again, in this case, the network architecture has been limited to two input neurons (i.e., two weights) but it can be cascaded for bigger applications. The pseudocode for the perceptron learning algorithm is the following: Control Unit:
while convergence = false for each epoch do for each pattern do output to training unit the training pattern stored in ROM assert control signal (enable_train) wait for training unit to adjust weights (i.e., wait for done signal)
* Training Unit: if (enable_train) do: read in training

pattern calculate the net weighted input, i.e., * W| adjust weights if a pattern is incorrectly classified output new weights to the control unit assert done signal for one clock period. 46
* Control Unit; since done signal is asserted: read in new weights increment iteration number by one if weights did not change (i.e., correct classification) increment count by one save new weights continue with next pattern when weights have not changed for all of the training patterns at least once, i.e., count = maxrec = 4, training is finished else continue with next epoch * Control Unit' test the network with a new set of patterns not used for training.
The multiplication task was implemented as a separate function, mult, since using the Verilog symbol for multiplication, performed unsigned multiplication only and did not generate a gate level implementation that could be later placed and routed. The signed multiplication routine implements this function by stripping the sign bit off the multiplier and the multiplicand. It then takes the twos complement of a negative number (sign bifc=l) to obtain its absolute value. Then, it performs unsigned multiplication of positive numbers by a series of shifts and adds. It then calculates the sign bit of the result by taking the ex-or of the sign bits of the two numbers being multiplied, and adjusts the final result, i. e., takes the twos complement again, if the result is declared as a negative number, i.e., sign bit is
1.
Fifth Step: top level structural description: The module netjopO is the top level module that describes the connectivity of the system modules, i.e., it instantiates all the modules described above and contains information about each modules input and output correspondence or connectivity. Figure 4.3, shows the complete system with this preliminary system partitioning. It was automatically generated by the software tools when the Verilog file, with the described control unit, training, system clock and power-on-reset modules, was imported to the Design 47
Framework II environment as both functional and structural descriptions. This top level system module describes the complete system. It will be used to test each sub-module, as it is refined, further partitioned and optimized, within the overall system. As each step is taken, from the top level down in the hierarchy of modules, the complete system functionality can be verified so as to proceed further into the design process with the certainty that the system architecture at the higher level still functions as desired.
48
Sixth Step; module partitioning: Figure S.2 shows the hierarchical module partitioning of the neural architecture. The controI_unit module was implemented as a single module. It generates the control signals to enablejrain, waits for the signal done" from the training indicating that the weight adjustment has finished, it reads in the new weight values, determines whether it has finished learning, contains a memory array with the training and testing patterns. The control unit module has not been partitioned further. Mixed behavioral and gate level simulations were used to verify the design. After training is complete, the control unit tests the network with a new set of patterns. The training module was further partitioned into the following sub-modules: 1) output calculation modules, called outjinit, which include a signed multiplication module, mult and implements the product of an input and its corresponding weight, i.e., implements Wi * Xj, when signaled by the control unit asserting enable_train. If not enabled, the output of this module is z (high impedance). When the module finishes the calculation, it asserts the signal finish which is anded with other finish signals of other out_unit modules. The output of this and gate enables the add module which performs the sum of the products. These out_unit modules could be replicated to produce any number of product terms. The add module adds two of these product terms at a time, thus, as many add modules as needed could be cascaded, thus implementing the complete sum of products, S Wj * x :. 2) adder unit modules, called add, as mentioned above, that implement the summation of the product terms received from the out_units, net= Wj * xi. 3) sigmoid implementation modules, called sigmoid, that implement f(net). For the current implementation, the neuron transfer function is a linear function, i.e., f(net) = net = w* * x,. 4) weight adjustment modules, called train, which implement the actual learning algorithm, for this example, implementing the perceptron learning rule. Each of this modules adjusts a single weight, so they can be replicated as needed for a larger size network. 5) the multiplier module, called mult implements signed multiplication of two numbers 49 These modules were vi&*n in Verilog, imported into the Design Framework II environment as
functional and structural views, producing the block schematic shown in Figure 4.4 automatically.
Figure 4.4: Training module partitioning 50
Seventh Step: Verilog XL simulation: As each module is refined it is functionally simulated and tested using Verilog XL within the top level module, net_top(), to make sure the system performs as required. Stimulus vectors are provided to the Verilog simulator via a test fixture file. A template is automatically created when setting up the simulation environment, listing the inputs to the module to be simulated. The designer must then complete the testflxture with statements that, for example, will generate a clock signal with a desired period, or will loop around a set of input values to test the behavior of the given module, or the entire network, under different input conditions. Figure 4.5 shows the Verilog XL timing waveforms for a training iteration, and for the testing of two patterns. Figure 4.6 shows a set of training and testing patterns used to train and test the network. Four patterns were used for training and then the trained network was tested using four previously unseen patterns. The test patterns are indicated by a dotted circle around them. In one application, using a clock period of 50 nanoseconds, learning took place in 10.55 microseconds, during 13 epochs. In another trial with different patterns, the network trained in 9 epochs, or 7.3 microseconds. Note in the timing waveforms of Figure 4.5, the signal labeled temp[ll:0] is the one showing the correct classification of the patterns, i.e., if the contents of temp[ll:0] n are positive, (002), then the result is that the pattern belongs to class 1. If the result of testing the second pattern is negative, (FFC) then the result is given as the test pattern not belonging to class 1.
51
uxountl 3*1 0
.cuntlMI
IXDCHIXZDGXEÎXIXIXIXZ iDOCXXXXXXDOQOC
0 IZ)OS13(EO^ uwmrw&tl 0
iliHmtl|ll4l On
ZXI)GD(DCIXIXI3GXDGD(I)GX
u.twncail.-fll 0 ii
_u.lwnplll41 0 i
DC
mp.ouQI l:fll 0
.c_u.*. 113.41 On
c.u.w.23 0 /c.u/dk OSN
IIXIXiXC(IXIXIXIXIXIXI)(L XDCEXIXZIXIXIXIZIXIXIXL XE XZZLZZXL

Mg 4W ___________ gjj
r.tunf IM> O nisi

U.IMWKIM1O
u. tamp 1111 0 UK
v.tamp2MfcM 0
ry "> yir
_U.Mmp|1tl in np_owail4l Om
DCHD6
UMI
ot.tnU.a34J t
p.c_u.* 1(3:0) o
.al_ett94l
p.e_w.*aA-fl
O i
u.2_oMI34l
O >
.c_u.*_UMl
x.u..a3 o
/e_u/c* OSW
/biMmo six
T:
inw
HIM
m ms na
Figure 4.5: Verilog XL timing waveforms 52
Figure 4.6.: Training and testing patterns (0 belongs to class 2; X belongs to class 1)
Eighth Step: Synthesis:
Logic synthesis is used to automatically produce a gate level schematic of the design in a desired target technology. After a preliminary synthesizability check, the design are synthesized and optimized by the software tools. Synthesis converts the Verilog HDL description into a intermediate boolean logic format. The optimizer then takes this intermediate representation and optimizes it based on constraints set up by the user, such as clock period and cost (area or timing) requirements, or particular of the target library technology, such as maximum loading or fanout parameters for the gates in the specified library. After a successful synthesis step, a netlist (or connectivity list file), a schematic and several reports about the total gate count, total area, path delays, etc., of the resultant circuit are generated. The schematic must be installed in the design library. A Standard Delay Format (SDF) constraints file containing information about timing constraints will be generated and later used to drive the place and route tools and produce the final integrated circuit. If the synthesis step fails, the designer must explore other alternatives by modifying the top level partitioning or improving the Verilog functional description trying different styles
53
with different sets of constraints. Figure 4.7 shows the results of synthesizing the functional/structural descriptions of the single layer perceptron. The reports produced by the synthesis tool give the following
aaence Dealer. Sysceas. Inc.
Cost Report file

Circuit. outjcxit Coun1
!tOC&; .curjaiic Cadence Design Syete*s. Inc. Optimizer Tlaung Report Sort Circuit: cufcjunit Tise tmt: Ins. Precision: lOps Keximai Clock Frequency 471.870 HHs Longest Path: Dalay - 2:12 ns 4U9dowr. s stssurmv IT. 3ccaux2 Lr Cell
are* to col
Sa M ZSS.7C IST;9S203. ~9 *5S. o5 217. 2 12.46' are Sun-*OQHL23. Count : earn C-CO S3*. 52 W7 70 122s IS i:n-3c. 1523-342349-. 34 are* total
itzzii
2 5 4 ' -
ssccur sx issacdk
T. que uru c._ssItipLicacaonO

S.uacocal. Total
Z.UC
C. 00 76QG. 86
26
Figure 4.7 : gate level schematic and reports generated by Synergy synthesis tools.
54
Cadenco Design Syaetas, Inc.
Optmirsr. Timing: Report. Sorted by, slide Xia circuit: add

Ti*o unit: Ins. 'Precision: lOp#
C*dnce Ddaign Systems, .Inc.. Cost Rtport rilft Circuit: add
XSdaua-Clock Frequwicy . 38.806 KBs

LongtacrFaeh DuLay 25.77 K
'
area ecu. Countt nr.r mAA V/; *'; d total:

stdnandz 2JC - . 1 tdnor 1 stdot 2x 1 . 249. is 2M.lt .215.46 " 249.66
area li
21+. tl . 215.46
tdbuiiiw 3x 3 aedaddh L stddff c 2

dltch_c_fcc Total
.Wadd - 2037
1ST. 95 '4730S 512.46 .. 512.46 "TSB.IS - .ISIS.32

?-B18J2 . .-6791.40. S4T.SS 10951.2a . 20924.46.
n ____________________________________________________________
:
Li
i rÊ .ES m i m
Ipfi
SEE MM
EP-j fggfl M
_jdbd= n
=#
Figure 4.7 (continued): gate level schematic and reports generated by Synergy synthesis tools.
55
Cadenee Design Systems, Ine Cose Report File Circus: trair. Cadenee Seaiqn Systems. Inc. Qpeisuzer rminc Reporr Sorted by Slack TA* Circuit: sun Precision IQps Saxaiu iiock ?recuency 26.153 3S Longest Delay * 2? 5 nj MODULE: trai* Cell 3Cdand2_3x 9C43JU_2T: 3 ui OB 121 stdaoi22 scdoe2_2x jcdacdh staou211 atddri_= scdburinvJSx sednand2_2x acdlacca_s^2x. staild Total Count r 1 I 1 2 1 3 2 13, 12. 7 6 i i i o ]> 1 area ach 26n. 7B 265. 76 277.29 3334S: 249-. 65 S1H46 323.45 7S8. ie 1SX.3S. 214 .*11' S47.5& 84&_92. area total 26o.76 266.76 277.25 333.45 49?.32 512.4b* 100C.2S ISIa. 22 20S5.2S 256?. 32 3332.92 S093.ES 18221. S5
Ziaa unit. Ins.
Figure 4.7 (continued): gate level schematic and reports generated by Synergy synthesis tools.
56
Ninth Step: Verilop XL simulation of the synthesized schematic: Using the same testfixtuies used for the functional simulation done before the synthesis step, the gate level schematic is simulated and the waveforms compared. Simulating the gate level schematic is important because the actual library cells are used. Mixed-level simulation is also possible allowing the designer to simulate modules for which a gate level implementation exists, either as a product of the synthesis tools or manually entered directly at the gate level, mixed with blocks of Verilog functional descriptions. This permits the simulation of isolated modules at the gate level within the context of the complete system, assuring a correct functionality at the system level.. After the design has been functionally verified at the gate level, and a standard delay file is generated by the software tools timing verification is done by backannotating the delay information into the timing analysis. The critical delay path will then contain information about estimated interconnect delays, giving a more accurate critical path analysis. Figure 4.8 shows the timing waveforms produced by the gate level simulation.
3coora30XDc*ocd:
CDCDGDCDCIXDCDGDCDGDCDG
lOEXIZ^DCIXIX^DCDG X: I I I I I I I I I I I
nnrinjTnnnnrinr ZX3(ZXZXZ)Q(iXEXE}SS 1CDGXDGDCDCDCDCEDCD CDCDC:
EDCDCDCZDCDCDCZIXIXDC
IIIIIIIIIII nnnnnrrrinnnr
Figure 4.8: Verilog XL timing waveforms of the gate level simulation
57
Tenth Step: in lepra ted circuit layout: Using Epoch(1}f the design is physically implemented. Please, refer to Appendix B for tutorials describing the step-by-step process for using the automatic placement and routing tools. The Epoch compilation includes automatic placement, routing, buffer sizing and power estimation. Figure 4.9 shows the final training module layout automatically produced by this tool. The synthesized "training module of Figure 4.3, including the manually designed multiplier, was placed and routed (i.e., the out_unit, train and add modules were not placed and routed individually at this stage). Post layout verification is run at this stage so the behavior of the circuit can be simulated including the effect of routing parasidcs.
Csdsnc* Oasiqn Systems. Inc.
1 A product of Cascade Design
Ta unit: Ins.
Precision: lops
OOULE: training
HMuma Clock fraqutncy - 37.333 ms Longest Path Daisy - 26.63 n 249.66
259.31 4.31 321.a t

327.31 333.43 359.10
249.66
259.31
304.39 321.48 327.38 333.45 359.10 429.22 431.73 431.73 494.38 354.38 626.04
Ui.it
428.22
429.22 431.73 431.?] 414.38 277.29 313.02 213.46 303.41 333.43 333.43 301.93 304.38 137.93 321.49 326.43 438.75 74$,20 270.27 . 249.66 333.43 308.79 157.93 214.11 333.43 217.62 214.11 137.95 277.29 333.45 512.46 635.65 738.16 848.92
429.22
805.41 1000.35 1000.33

1003.86
1217.52 1265.60 1215.92 1305.72 1316.25 1490.40 1891.89 1997.28 2)3.15 2470.32 2685.15 2997.54 3667.93 4787.64 7707.96 8845.20 10814.31 22007.70 22548.24 24261.12 24619.82
23603.4 0
185114.29
Figure 4.9: Perceptron (training module) final chip layout
Chapter 5 TOP-DOWN DESIGN with ALOPEX

In this chapter, the top-down methodology described will be applied to a more advanced learning algorithm capable of addressing real world applications. The algorithm was described in chapter 3 with suggested modifications for hardware implementation. The stepwise methodology developed in chapter 4 which resulted from the successful development effort for the deterministic perceptron learning algorithm will be followed and applied to the Alopex stochastic learning algorithm.
5.1. Choice of data format

As explained in chapter 3, a fractional data format was selected for implementing the neuronal inputs and
outputs, and a mixed representation for the weights, i.e., the weights may take on values larger than 1 so they have to include an integer portion, as shown below: 1) for the 8-bit weights, the format used is s i i i . f f f f . The exponents of two are: (sign) 22 212. 2'12'2 2'3 This data representation allows for a range of weights from - 7.987Si0 (81i6) to + 7.9875io (7fi6). 2) for the 8-bit neuronal inputs and outputs, the format used i s s . f f f f f f f , since they are between -1 and +1, i.e., they are squashed by the sigmoid between these values. Each bit represents the following powers of two: (sign) . 2'1 2'2 2'3 2* 2'5 2* 2'7, giving a range of values between 0. 9921875io (81M) and +0.9921875io (7fi6). Negative numbers are represented using 2s complement
format
59 The above described data formats for the weights and the inputs naturally cause the following data representation formats for the accumulator registers storing the results of their multiplication:
3) The implementation of the multiplication of the weights and the inputs performs signed multiplication of two 8-bit numbers. However, these numbers are of different format. Thus, the resulting product will have the format as shown below: wi (weights) = s i i i. f f f f Xj (inputs) = s . f f f f f f f
Wj * X| = siii.fffffffffffO
Since the data has a mixed format, i.e., it has integer and fractional bits, the result of the multiplication is corrected by shifting it left (with a zero fill in the least significant bit). As an example, a calculation done by one of the neurons (sample run is included in section S.3) follows:
input# 1 = 60 = 0 . 1 1 0 0 0 0 0 = s 2'1 + 2'2 = 0.75
i
weight# 1 = ffI6 = 1 1 1 1 . 1 1 1 1 = ( - ) 0 0 0 . 0 0 0 l(2scomplement) = -2* = -0.0625
The result of the multiplication is a 16 bit number with the following format:
result = f f 4 0 , 6 = 1 1 1 1 . 1 1 1 1 0 1 0 0 0 0 0 0 = s i i i . f f f f f f f f f f f f =
(-) 0 0 0 . 0 0 0 0 1 1 0 0 0 0 0 0 (2s complement) = (-) 2 s + 2*=- 0.046875
60 which is the correct result of multiplying 0.75 * (-) 0.0625 = 0.046875 .The HDL description to implement this multiplication is as shown in Figure 5.1:
function[15:0] mult; input[7:0] x;
//
input[7:0] w; reg sign, signw, signx; reg[7:0] W; reg[7:0] X; reg[15:0] temp; begin temp[15:0]=16'h0; W[7:0] = w[7:0]; signw=W[7]; X[7:0] = x[7:0]; signx=X[7]; sign=signwAsignx; if(signw==l'bl) begin $display("W=%h ", W) ; W[7:0]=~W[7:0J +1'bl; $display("W=%h signw=%b \n",W,signw); end // if(signx==l'bl) begin $display(X= %h ,X); X[7:0]=X[7:0]+1'bl; $display("X= %h signx=%b\n",X,signx); end
//
// //
temp[15:0]=(W[7:0]*X[7:0]) 1; $display("temp[15:0]=%h\n",temp); $display("sign=%b\n",sign); if(sign==l) temp[15:0]=~temp[15:0]+1'bl;
// $display(temp[15:0]=%h \n,temp); mult[15:0]= temp[15:0]; end endfunction
Figure 5.1: Verilog code for signed fractional multiplier
61
The shift left statement after the multiplication is performed effectively adjusts the integer result for consistency with fractional data representation.
4) When the complete summation is performed by the neuron, the accumulator register contains 4 extension bits to allow for word growth, since adding several product terms may result in an overflow condition. This is shown in Figure 5.2
7654.3210siii.ffff
bid 7 6 5 4 . 3 2 1 0 W| = i i i . f f f f
bit#
s.fffffff
8-bits W2 = mi= s . f f f
f f f f 8-bits in] =
bit# IS 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Neti = s i i i . f f f f f f f f f f f f 16-bits
7 6 5 4.3 2 1 0 s i i i . f f f f
bit# 7 6 5 4 . 3 2 1 0
bit#
s . f f f
W| = s i i i . f f f f 8-bits w2 = ini = s . f f f
f f f f
f f f f 8-bits in] =
bit# 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Net2 = s ii i . f f f f f f f f f f f f
7654 bit# siii 3 2 10 bit# 8-bits w2 = 8bits in] = 7 6 5 4.3 2 1 0 s i i i . f f f f s .fffffff
16-bits
f f f f s . f f f bit# 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 f f f f l W =
Neto = s i i i . f f f f f f f f f f f f
16-bits
D l l =
6 2
bit# 19181716 IS 14 13 12 11 10 9 8 7 6 S 4 3 2 1 0 Neti
16-biu 16-bits
t i i i . f f f f f f f f f f f f Netj t i i i . f f f f f f f f f f f f
i k k _ _ _ _ Net t i i i i i
i i
i . f f f f f f f f f f f f i . f f f f f f f f f f f f
16-biu 20-bits
Figure 5.2: data representation for Alopex implementation The four additional extension bits effectively move the sign bit position to bit # 19, the MSB (i.e., before the partial product terms (Neti) are summed up they are sign extended to 20 bits) and allow for the sum to grow up to values of around 128. The additional bits allow for accumulation of partial sums in the case that several product terms are being summed up. The final result, after taking f(net), is then rounded and truncated back to 8 bits as required by the data path. Excessive errors would be introduced in the calculations if the product terms were truncated before the final summation.
In addition, whenever a result cannot be stored in the destination register, e.g., an output that becomes larger than +0.9921875 (or hex 7f), then it is saturated to the maximum value a given register can hold. For example,
ini =20,6 = 0 . 0 1 0 0 0 0 0 2 = (sign). 2'2 = (+)0.25,o weight 1 = Ob = 0 0 0 0 . 1 0 1 1 2 = (sign) 21 + 23 + 2* = (+) 0.6875io in 2 = 70,6 = 0.1 1 1 0 0 0 0 2 = (sign) Tl + 2'2 + 23 = (+) 0.875,0 weight 2 = 10,6 =
0 0 0 1 . 0 0 0 0 2 = (sign) 2 = (+) 1,0
63
Neti = (0.25) * (0.6875) + (0.875) *(1) = 1.046875, 0 net = (ini = 20,6 = 0.25,0) * (weight 1 = 0b,6 = 0.6875,0) + (in2 = 70,6 = 0.875) * (weight2 = 10,6 = 1.0,o) = net = 0.171875,0 + 0.875,0 = 1.046875,0 , which is greater than 0.9921875, the laigest number that can be represented in this 8-bit fractional data format. The HDL code must check for these overflow conditions and make the necessary corrections. This will be illustrated in sections 5.2 and 5.3.
The current version of the neuron implements the transfer function, f(net), shown in Figure 5.3 to the summed up outputs:
Figure 5.3: f(net) for present implementation
64
5.2. First Step: C language implementation

The complete software listing is included in Appendix A. The program was run for several problems, including the exclusive-or, several linearly separable pattern recognition problems, and encoder problems. The network converged for several error measures, with results consistent with those found in (Pandya, 1994). Chapter 3 contains the pseudo code for one implementation of the architecture that has a single layer of hidden neurons. The program in Appendix A implements a two-hidden layer architecture. For the hardware implementation, a two-input, one hidden layer of two neurons each, and a single output was designed. This architecture is sufficient to solve the exclusive-or and linearly separable problems. The main concern was to scale down the architecture to solve several issues related specifically to a hardware implementation and still obtain good functional performance. The modules designed may be cascaded to build larger networks.
5.3 Second Step: HDL functional description

The complete system was written in Verilog HDL to study its behavior under a different environment. All the high level constructs of the HDL were used without regards for obtaining a code that was synthesizable. The objective was to have an intermediate step between the high level C language implementation and the final HDL description that would be synthesized. This system description using the complete HDL set included the use of real numbers, random number generator system calls, reading input patterns from a Unix file, etc., all of which are not synthesizable by the software tools available.
65
5.4. Third Step: preliminary module partitioning

To implement this step, a subset of Verilog is used that will be recognized by the synthesis tools. Thus, when partitioning the system in modules, each one is re-written using a modeling style that will later be
used to produce a gate level schematic. No system calls such as {random can be used, so a random number generator was designed that simulates the NM810 (which is made up of 8 RBG1210s, random bit generators) as explained in chapter 3. This will not be synthesized but, since Verilog XL environment allows for mixed level simulation, the non-synthesizable blocks can still be present throughout the design steps to assure a final design that is functionally correct. The current implementation will have an off- chip random number generator such as the NM810. Neurons will read the necessary random numbers from its outputs when needed. The following modules were designed: 1) a top-level structural module, netjop, which instantiates the lower level modules and describes how they are interconnected. This top level module will be the one from which the simulation, synthesis and layout tools are called. As each module is optimized it will be tested within this system module so the functionality at the system level can be verified. Mixed level simulation allows for some modules to be simulated using their functional Verilog descriptions, mixed with modules for which a gate level description exists and needs to be tested, either because it was automatically generated by the synthesis tools or because it was manually entered gate by gate. The structural description is shown in Figure 5.4 :
66
module neural_net_top(in);
//March 14, 1996 7 March 29; April 1 //Laura Ruiz errorll.v; linearly separable // with overflow check and negative weights and inputs
// weight3-siii.ffff input3-3.fffffff
// without initializing weights; // multiplier designed by hand

//
input in;
wise elk, enable_train, reset, delta_E_ready
ice uk euouie team coeu, ue-i.ua & tcauy, noise_serial 1 noise serial 2, noise__serial_3 ~noi3e_serial_4, noi3eserial_5, noIse_serial_6, no is e_s e ria 1_7, noi3e_serial_8, not_noise__serial__5, not_noise_serial_7, donewl, done_w2, done_w3, done_w4, done_w97 doneJvlO* en_l, en_2, en_5, ready_nl, ready_n2, ready_n5, enable_l7 enable_2, enable_5, fir.ished_train, enable_test, ready_nl_and_n2, finished_train_and_enable_test;
wire[7:0] wire[7:0} wire[15:0] . power_on(reset);
w
_new_l, w_new_2 w_new_3, w_new_4, wnew_9# wjnew~10; inl, in_2,~outnl, out_n2, out_n5; deltajE? // power on reset
power_on_reset // system clock m clock_generator <clk);
// NM810 serial number generator noise^serial bitserial_l(elk,noise_serial_l) noise~serial b i t_se r ial~2(elk,noise~serial~2) noise~serial bit_serial_3 (elk, noise~serial~3) noiseserial bit_serial_4 (elk, noise~serial~4) noise~serial bit_jserial~5(elk,noiseserial5) noise~serial bitserial~6 (elk, noise~serial~6) noise~serial bit_serial_7 (elk, noiseserial~7) noise~serial bit_3erial~8(elk, noise~serialJ)) // bank of weight units weight3 weight_uni t_l (enable_train, elk, reset, delta_E_ready, deltaE7not_noise_serial_5, wnewl7 donejwl); weights weight_unit_2 (enable_train, elk, reset, delta_E_ready, del taJE, noise_3erial_8 w_new_2, done_w2); weight 3 weight_unit_3 (enable_train, elk, reset, delta_E_ready, delta_E, noise_serial_3, w_new_3, done_w3); weights weight_unit_4 (enable_train, elk, reset, delta__E_ready, del ta_E, not_noi3e_serial_7, w_new_4, done_w4); weight 3 weight_unit_9 (enable_train, elk, reset, del ta_E_ready, delta_E, noi s e_serial_4, w_new_9, done_w9); weight 3 we ight_unit_l0(enable J:rain,elk, reset,delta_E_ready,de11a_E, noise^serial^l, w_new_10, donejwl0);
// bank of output calculation units

neuro2 n_l (in_l, in_2, w_new_l, w_new_3, out_h 1, en_l, ready_nl, reset, elk); neuro2 n~2 (in~l, in_2, wnew~2, wnew~4, out~n2, enJ2, ready~n2, reset, elk);
Figure 5.4: HDL structural description of network architecure
67
neuro2 n_5(out_nl,out_n2,w_new_9,v_new_10#out_n5,en_5,ready_n5, reset,elk)/ // control unit control_unit sequencer(elk,done_w4,reset# out_n5,ready_n5, enable_train,enable_test, delta_E7 delta_E~ready, injl# in_2, finished_train);
//glue logic
not not_l (not_noise_serial_5, noise_serial_5); not not~2(not_noise_serial~7,noise~seria1*7); and and_l(enablel, done_wl, done_w3, enable_test); and and_2(enable_2, done_w4, done_w2, enable_test); and and_4(ready_nl_and_n2, ready_n2, ready_nl); and and_5(finished_train_and_enable_test, finished_train, enable_teat); and and_8 (enable_5, enable_teat, ready__nl_and_n2, donej9, done_vlO); or or_l(en_l, finished_train_and_enabl_test, enable_l); or or_2(en_2, finished_train_and_enabletest, enable~2); or or5(en_5# finished'train~and~enable_test, enable5); endmodule
. ................................... .............. *****.............. . ............................ ******* .............................................. . .................. *
Figure S.4 (continued): HDL structural description of network architecure
68
This top level structural module instantiates six different modules, and several gates used as glue logic for signal synchronization: 1) random bit generators (generate noise for the learning algorithm implementation). 2) weight update units (implement the weight update mechanism following the Alopex algorithm). These units are enabled during training. At each training iteration they all update the weights concurrendy Each unit updates a particular weight using a single broadcast measure of global performance. 3) output calculation units (implement the summation, net*= w ;* x;, and the transfer function, f(net)) 4) control unit (implements all the control signals needed for training and testing of the network; it contains ROM with the training and testing patterns). 5) system clock (implements a system clock generator, single phase, with a period of SO nanoseconds). This block is no synthesizable, so it will be deleted before synthesis step. The clock signal may be generated as a stimulus file for simulation and testing. 6) power-on-reset (asserts a reset signal at the beginning of the simulation); also non-synthesizable and will be deleted before synthesis, replacing it by a signal in a stimulus file (testfixture) Figure 5.-5 shows the block diagram for the system architecture, automatically generated by the software tools when the Verilog files with the structural/behavioral description was imported into the Design Framework II environment as functional and structural.
69
Figure 5.5 : system block diagram
70
5.5. Module description
5.5.1. Weight adjustment:

The module that implements the learning portion of the Alopex algorithm adjusts a single weight. Thus, one
of these modules per weight is needed. One of the applications has six weight units for one hidden layer of neurons, two-neurons per layer, and one output layer. Another application was studied with two hidden layers and ten weight units. The HDL behavioral description of the weight module can be found in Appendix A. The module has the following inputs and outputs: 1) one bit inputs: clock signal (elk), reset signal (reset), a signal from the control unit (Enable) that is asserted when weight units are to update their weight values, a serial stream of random bits (Noise) from the random number generator (NM810), a handshake signal from the control unit which is asserted when a new measure of global performance is broadcast from the control unit to all the weight units (delta_E_ready). All of the weight units operate concurrently, adjusting their weights simultaneously. They update their locally stored weights with information that is received via a 16-bit data path broadcast from the control unit (delta_E[15:0]) and other local parameters. Thus, no interprocessor'communication is needed between the weight units, and the control unit does not need to know any information from the weight units to determine when the training is done. This simplifies the connectivity of the system. 2) 16-bit input, delta_E[15:0], a measure of global performance calculated by the control unit and broadcast to all the weight units simultaneously. This parameter is used by each weight unit to calculate the new weight values. 3) one-bit output, the handshake signal done, which signals the neurons that calculate the product terms when the weight unit finished the weight adjustment procedure.
71
4) one 8-bit output, W_new[7:0], with the newly calculated weight, connected to the neuron that will use it to calculate the product terms (weight * input). There is a bank of temporary, scratch-pad, registers to store intermediate results, input and output buffers to hold the weight values and to store the received information.
The weight units are initialized on reset with small random weights. The data format of the weight values is s i i i. f f f f. The weight units are in a constant always loop, monitoring the reset, elk and Enable signals. At the beginning of the training block , the weight unit reads in 16 bits of the serial random bit generator in a register called Noise. It is important that each weight unit reads in a different random number, thus, they are not connected in parallel reading the same bit stream or reading off an 8-bit or 16-bit random number generated by one or two NM810 chips. This would be faster but certainly not the acceptable and effective way to add noise to their weight update mechanism, a requirement of stochastic algorithms such as Alopex. After reading in the noise, the weight unit reads in the measure of global performance broadcast from the control unit, 16-bit deIta_E[15:0]\ determines its sign (1 if < 0 ; 0 if > 0), stores it in a register, sign_deltaE for later processing, takes the absolute value of the 16-bit input delta_E- It then calculates the feedback, x = A E * Aw. Register x is 21 bits since it is the result of multiplying the contents of register deltaE (AE), which is 16 bits, and of 5 bits of register delta_W (Aw). The data formats are as follows: delta_E[15:0] = s . f f f f f f f f f f f f f f f delta_W[ 4:0] = O . f f f f The sign bit of x is calculated (sign_x = sign_deltaE ex-or sign delta_W), and the twos complement is taken if the sign bit is 1 (indicating a negative number), thus the format of x is: x [20:0] = s. f f f f f f f f f f f f f f f f f f f f
7 2 Both delta_E[15:0] and delta_W[4:0] are fractional numbers, positive when multiplied, their original signs having been stored in registres sign_deltaE and sign_delta_W. This guarantees that x will also be all fractional. The assumption that delta_W[7:0] contains no significant digits in its integer portion is valid since weights change by very small steps, thus, their difference, delta_W = W_new - W_old, will never have any ones in its integer bit positions. Recall that the format for W_new = W_old = weight format = s i i i. f f f f; thus the format for delta_W[7:0] is also s i i i . f f f f ; but only bits [4:0] are used to calculate x.
After calculating x, the weights are adjusted following the rule shown below: W_new = W_old - x + noise
The code then checks for saturation if the new value for weight (W_new) exceeds a value that can be represented in the format s i i i. f f f f (hexadecimal 7f and 81). Then, a new delta_W is calculated for next time, the new value for the weight is stored as W_oId, and finally, the weight unit waits for the signal delta_E_ready from the control unit to start a new weight adjustment iteration. Figure 5.6 shows the weight adjustment mechanism. All weight units operate in parallel, updating their values concurrently with information available in their local memories (w_oId) and a single measure of global performance of the network (delta_E) broadcast by the control unit to all weight unit The connectivity of the Alopex network is simpler than the one that implements the perceptron algorithm since in the latter, the control unit needs information on weight values to determine when training finishes. Results of the behavioral simulation of Alopex show that a weight unit takes 2.1 microseconds to update its value.
73
Ba3
elimarker4
narker5
iaarker6narker7 Cursor M /0 I 1150 2700 9300 .10850 17100 |
mg mm
. Koise(15: 0
|0
DA6
A_
3W
U_neu[7.0
) <>F5 _1 U_old[7.0)oF5
| .00 , )(F4 F4 ' . | . F4 XF5 | F5
F5XF3 II
mm mm
1.W_temp{9 0| a3F5 | . Iu 1F4 . :F4 r; F4-)[FS F5 ' FS )(f3.
m '"
; 3F4 V3FS 3F5 X 3F3
1
. deltaE 115 Oj 0 4643* ooiu:. . : 1
t1
1 4643 (2740 -
0 0 10 i Miiiii
^blta-_EU5.0j-.^74a ------------------------------------------
2T40 '
wm mm
. deltaJJ[7 0 | 0
unit_i i[4 0) 6
00 .eightyid|7.0
I00'Woo; - 1 ooihiqq; c
Mi
w m ef . hef . m\
l II i '
nat_l.x{20:0| <*.023218 unitl/Enable*: StO
pooc0 0 jktfeo ifffeo x : n m . 0 2 32 18 m im
1 1 1 .. M 1 H I ......................................................................................................................................................................................................................................................... .1 1 -
1 ..........................
1 1 ill III II mil II III 11 III ill II III lllin II III !ll;i: 1 delta_E_ready^ Stl t_unit_l/done 1. _uni t_l /no 1 3 e c-
. . . . . . J
1 1 1 1
i n
S11 t_!unit_l. prob 5 >1 . _unit_l/resetvStl '1 - 1 - '
Figure 5.6: timing waveforms of a weight unit updating its value
74
As will be explained in the section related to the control unit module, the foimat for the error calculated by the control unit at the end of the
nth. iteration is as follows: error (n) [19:0]= O i i i i . f f f f f f f f f f f f f f f delta _error (n) = error (n) - error (n-1) = s i i i i . f f f f f f f f f f f f f f f, which will be clipped using saturation
arithmetic between the maximum/minimum values that can be represented in all fractional 16-bit number. (i.e., the error will not vary greatly between iterations so the difference, delta_error, will be small, usually less than one). This measure of global performance will be broadcast to all the weight units at the end of each iteration. The following is a portion of a sample run showing the weight adjustment mechanism of the weight units. The following fractional data values are shown in the sample run:
enor_current[ 19:0] = 0b891I6 = 0000 1.011 1000 1001 00012= 1.4419251,0
error_nl[19:0] = 0af2a,6 = 00001.010 11110010 1010,= 1.3684692,
delta_error [19:0] = 00967, = 0000 0.000 1001 0110 01112 = 0.0734558 The truncated/saturated value broadcast to the weight units is delta_E[15:0] = 0967,= 0.000 1001 0110 011 h= 0.0734558 The error has increased, thus the sign_delta_E is 0 (for a positive number). Each weight unit has its delta_weight value and sign_delta_W stored in their local memories. The feedback, x, is calculated by each weight unit as seen in the sample run. Asanexample, deltaE*deltaW = 0967,**01,6 = 0.0734558 * 0.0625= 0.004591 x[20:0] = 0012ce,6 = 0.0000 0001 0010 11001110* = 0.004591
75 for the weight that in the previous iteration changed by +01 (recall, the format for the weights is s i i i . f f f f , which for 016 = OOOO.OOO2 = 0.0625io ). The printout shows a value for X that has been shifted 4 times to the left (i.e., multiplied by 16) to move the ls to most significant positions since X is a 21-bit fractional number that win be added to an 8 bit number (the weight) which has a sign bit, 3 integer bits and 4 fractional bits. Not shifting the resultant value of
X
will make no impact whatsoever to

X
the weight adjustment since when truncating it before adding will only add 0s to the weight, (recall that
is a very small value that is the
resultant of multiplying two small fractional numbers, delta_enor and delta_weight) The calculation of the new weight for this particular example is:
W_new[7:0] = w_old[7:0] - x[20:16] + Noise;
both W_new, W_old and x are sign extended before the previous operation is performed. Saturation arithmetic is applied if the result is greater than 7.9375, the largest numbers that the weights can take on given the data format used of s i i i. f f f f The following table shows actual weight and error values during 8 iterations:
76 Table S.l: actual weight adjustments for 8 iterations

weight [I] iter. #1 fO 0a fa Of 0a 06 70ffe iter. #2 fl (+) 0 (=) . fc(+) Of (=) Ob(+) 06(=) Qafia (-) iter. #3 13(+) 0b(+) ff(+) Of C=) 0c(+) 07 0b891(+) iter.# 4 fl (-) 0b(=) fc(-) Of (=) 0c(=) 06 (-) 0bl87(-) iter.# 5 fl (=) 0b(=) fa (-) 10(+) 0c(=) 06(=) 0aa8b(-) iter.# 6 I2(+) 0b(=) fa(=) 12 (+) Od (+) 07 (+) iter.# 7 <3(+) Oc (+) ib(+) 12 (=) Oe (+) 08 (+) 0ab95(+) iter.# 8 13 (=) 0e(=) fc(+) 12 (=) Oe (=) 08 (=) 0ac5a(f)
121
13] (41 [SI (6) Error
(+) weight/error increased; (=) weight/eiror did not change; (-) weight/error decreased
As can be seen from the table above, the weight adjustment mechanism woiks as desired, given a measure of stochasdcity that makes some weights change values randomly. However, most (more than 50%) of the weights follow the desired algorithm, i.e., if the error decreases, then the network is going in the right direction and the weights should continue to move in the same direction, if the error increased, then the weights should be adjusted in the opposite direction, with some random moves to avoid the freeze" of the network trapped in a possible local minimum.
77
5.5.2 Output calculation units:

The current application has two input neurons, one hidden layer of two neurons, and one output neuron for a total of three output calculation modules, each one capable of implementing product terms, summations and linear transfer functions. Another implementation studied had S output calculation units, implementing a neural network with two hidden layers of neurons, one input and one output layer. Still other implementation being investigated at this time, implements the reading of the inputs and weights as a sequential process, allowing for a given number of product terms to be added up. All the versions are fully asynchronous in that each level of processing generates a handshake signal when the computations are finished, signaling the next level that data has been processed and is valid/ready to be used. This particular implementation of the output calculation units starts by reading in the inputs and the weights if a signal en is asserted. This signal is controlled by the done" output signals of the weight units connected to the particular output calculation unit and the ready output signals from output calculation units of previous layers. When the interconnected weights are ready asserting their done signal, and when the control unit asserts enable_train or enable_test signals, and the neurons in the previous layer assert their ready signals, then the output calculation units are enabled. After they finish calculating the neuronal output, they assert a signal named ready that enables the next layer of neurons or tells the control unit that the neurons have finished a complete forward pass through the layers and a final output is
available for the control unit to calculate a new measure of performance (error). The complete HDL description of this module is in Appendix A. Also refer to the structural HDL description of the top level network that shows the interconnections among modules. The neuron reads in the 8-bit inputs and the 8-bit weights, calculates the 16-bit product terms and accumulates them in a 20-bit register that allows for growth up to approximately 127.9997559. After the summation is performed, the 20-bit result must be rounded and truncated to be output as 8-bit since these 78
outputs become the inputs to the next layer of neurons. Saturation values are also used in the case that the value to be output exceeds 7f or 81 (0.9921875). Truncation and rounding is performed as follows: 1- the 16-bit registers to be accumulated to form Iwi * x j, are sign extended to 20 bits before addition to preserve the sign and to allow for word growth of up to 127.9997559. These 16-bit numbers are the result of multiplying two 8-bit numbers, the weight and the input, with the formats s i i i. f f f f and s. f f f f f f f f; thus the 20-bit result has the format as shown below: siii iiii.ffff ffff ffff 2- test for positive saturation: if the sign bit (bit number 19) = 0 (a positive number) then saturation occurs if any of the following 7 bits (bits 18 through 12) are not all equal to 0. This means that the actual 8-bit output, given the format for the input/output data, will be 7f16 (+0.9921875). Else, round and truncate, (see item 4 below) 3- test for negative saturation: if the sign bit (bit number 19) of the result of the addition of product terms is 1, indicating a negative number, then saturation occurs if any of the following 7 bits (bits 18 through 12) are not all equal to 1. In this case, the output is saturated to the most negative value of 81 (-0.9921875). Else, round and truncate (see item 4 below) / 4- rounding and truncating a 20-bit number (format: s i i i i i i i . f f f f f f f f f f f f ) to an 8-bit number (format:s.fff f f f f ) :
79
before s i i i i i i i . f f f f f f f f f f f f after s . f f f f f f f
t 2* __ 2' 2". 2-' Z2 2'\ ..... ......... 2-
*. 2 ' 2J 2J _____ 27
1) If bib [4:1) <8M ôunddown: ff2c2i, 11 1 1 11 11.00 1 0 1 1 00 00 1 0

bit# 19 12 11 4 10
1111 1111.00101100 00 00
andtiunctfeto8biu:............ ________ .......................... 1 . 0 0 1 0 1 1 0 (96i<)
111111,1110 original number f f2c2i6 = -0.8276367io has been rounded to 96i = 0.828125io (recall that negative numbers are in 2s complement format) 2) If blta|4:l] >8W, round up: 00 d 5 Ci: 0000000 0. 11 01 0101 1 1 0 0 bit# 19
1211 4 10
0000 000 0. 11010101 1 100

rounded and
truncated to 8-biU:
0.1101 010
+ _____________ I 0.1101 Oil Thus, the original number O O d S c i 6 = 0.8349609io has been rounded to 6bi6=0.8359375v0 (a difference of - 0.0009766). If no rounding is used, then the output would be 6a i 6 = 0.828125io, introducing a larger error.
3) If bits [4:1] = 8 1 round up or down depending on bit [5], i.e., if bit [5] = 0, round down, if bit [5] = 1, round up. This prevents the rounding to be always one way so results will not be biased in that direction, (adapted from DSP56001 users manual) Once the summation is rounded and truncated, it is stored in the output buffeis and the signal ready is asserted for two clock cycles. This signal is anded with other ready signals of neurons at the same layer, or is connected to the control unit if the neuron is in the output layer. Please, refer to the structural description, neuraljnetjopO, in Appendix A, for connectivity details.
Figure 5.7 shows the output calculation unit timing waveforms obtained when the complete functional system description was simulated in Verilog XL environment Note the correct output calculation. The transfer function implemented is f(net) = net; The behavioral simulations show that these units calculate a new output value in 2.45 microseconds.
81
The following is a portion of a sample run showing how the neurons calculate the value of net = wj* and of f(net).
out_out <calculated output) 127 7f ; targetdat mtx( 2] (desired output) * 127 7f ;suerror - 2178 00681
nuartC 3 Ml - 32 20 . m2* 192 cO. weight 1 253 id. weight - 10 0a
Ml 32 20 . m2- 192 c0. weight) - 21 IS. weight2 22 1

V-ftf
t*>l115:0]- 65344 40 tM^l 115:0)* 1344 0540 tp2 00
(15:01*62720
teap2(15:0)*64256 fbOO tesp_inl*w_l+in2*w_2-threahold1047i04 if 40 teap~inlwl*in2,w~2-threshold-1047104 ffa40 t*p_fa40 104714 i out - d2 210 tea? - ffa40 1047104 ; out - d2 210 Ml - 210 42 . m2* 210 62. weightl - 23 17. weight2 * 17 11 teapl(15:0)* 63420 *7bc Ml 32 20 , m2- 192 c0, weight) * 2) 15. veight2 - 22 16 Ml * 32 20 m2- 192 c0. weight 1 253 fd. weight2 * 10 0s W*fd tOp2(IS:01*63972 f94 tea> inl'w 1*M2*W 2-threshold-1044896 fflaO trap!I IS :oT- 1344"0540 tenpl(15:0) 65344 ff40 teflp2I15:0!-6<256 fbOO te8p2U5:0)*62720 1500 teo> iflsO 1044896 ; out - 8d 141 tes?><_inl,w_l'*in2*w_2-thre3hold-1047104 ife40 tesp_inlt*l*in2*w"';2-threshold-l047104 fis40 ten? io40 1047104 j out dT 210 ter? - f(a40 1047104 ; out - d2 2)0 delta_error"3I4 101: target-tenp_out- 00c 12 squared error - 00120 288 sumtrror - 009a? 2466 out_out tcalculated_output 1 - 14! 8o j targetdat mt*[ 31 (desired output) - 129 B1 ;auerror - 246fi~009a: aunerror - 2466 009a2 ; error_curr 616 00266 error_curr- 616 0026E error_nl- 1562 0061a delta_E- fe4e 6459C delte_r_I - itc4e 1047630 ssved.error- 1562 0061a 74400 iteration - 10. error.current - 616 00268 saved.error*
Figure 5.8: portion of sample run showing how neurons calculate f(net) = Z w i * X j
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
5.5.3 Control Unit:
The control unit module is the module that generates all the control signals that coordinate the functions of the other modules, it determines when learning occurred by evaluating a measure of global performance, and tests the network with a set of training and testing patterns stored in ROM within the control unit itself. The following is a description of the input/output sfgnals: Input signals: 1) elk (clock), and reset signals are input signals from the system clock and the reset modules. 2) handshake signal done_w, connected to a done_w signal from any of the weight units. Since all the weight units update their values concurrendy, any one signaling that the weight has been updated is sufficient to let the control unit know that a new training iteration has started. 3) out_out [7:0] is the 8-bit data bus from the last neuron, the output neuron. If a larger system is to be designed, with multiple output neurons, these will have to be serially read in by the control unit and the ready signals of the output neurons would have to be anded together to let the control unit know when all of the output neurons have finished calculating their output values, f(net). This will require a minor change in the functional HDL description of the control unit, i.e., a loop that reads in the necessary output values from the output neurons. Then, it will have to be re-synthesized, placed and routed. An implementation being studied now will have this format, allowing for any number of input neurons to be read in sequentially by the control unit. 4) handshake signal done, connected to the last neuron (the output neuron in the present implementation) signaling the control unit when the last neuron finished calculating its f(net) and the value is available in the data lines.
83
Output signals: 1) handshake signal enablejrain: this signal is asserted at the beginning of the training phase, i.e., after the initialization of the system on power-on reset. It is connected to all the weight units enablejrain input
signals that they monitor to know when to start the weight update operations. 2) handshake signal enablejest: which is connected to the output calculation units, through additional anding with other signals (see previous section and structural HDL description of top level module), enabling these units to start their product terms and summation calculations. 3) delta_E [15:0]: 16-bit measure of global performance (i.e., change of error with respect to previous iteration for this implementation) that is broadcast to all the weight units. 4) handshake signal delta_E_ready which, when asserted, indicates when this broadcast bus has valid data. The weight units monitor this signal before reading in from the broadcast bus the new value of delta_EThis protocol allows for fully asynchronous operation such that if the control unit is modified to implement a different measure of global performance, the weight units will not need to be modified at all since they operate in full handshake mode with the control and the output calculation units. 5) traindat_mtx_i[7:0]: output data bus that holds the contents (input values of the training and testing patterns) of the internal ROM location being read in by the output calculation units, i.e., the Xis in the net calculation that neurons perform (I Wj * x;) 6) handshake signal finished_train: asserted during the testing phase, enables the output calculation units during testing for calculation of the neuronal outputs with the saved weight values of the trained network and the testing pattern inputs stored in ROM in the control unit. The HDL description of the control unit is included in Appendix A. This module starts each training iteration by enabling the weight units so they do a weight update, i.e., it asserts the signal enable_train
84
for one clock cycle; followed by the output of the x values (training pattern inputs) for the neurons (the output calculation units) to read in, for each of the training patterns stored in ROM. The control unit waits for the signal from the weight units, to make sure that all the weight units have finished updating before enabling the neurons by asserting the enablejest signal. While the neurons calculate the f(net)=f( Z w ; * Xj.) the control unit is in a wait state, monitoring the ready signal of the last neuron in the output layer. When the
done signal (connected to the ready signal of the last neuron in the output layer) is asserted, signaling that all the neurons finished a forward pass through the network calculating their f(net)s, the control unit reads in the output values (f(net)) of the neuron/s of the output layer, disables all the intermediate layer neurons and calculates the error between this calculated output and the desired output, stored in its ROM, for the current pattern. The current implementation calculates the error for pattern i as: errors = (desired output; - calculated output;)2
and accumulates the errors for all of the patterns. Next, the control unit outputs the next pattern, enables the neurons and waits for the next forward pass through the network, monitoring the ready" signal of the output layer neuron/s. It then calculates the error for this pattern. This is repeated for each pattern, accumulating the individual errors in a register (sumerror).
After all the patterns have been passed once through the network, and the accumulated error is stored in the sumerror register, the control unit calculates the average of the errors of all of the patterns, compares it with the error calculated in the previous iteration, broadcasts this delta.E value to all the weight units and asserts the handshake signal delta_Ejeady that the weight units monitor to know when the data is valid. This broadcast value is limited using saturation arithmetic. The newly calculated error is stored to be used in the next iteration for a new delta_E calculation. The control unit continues the with the next iteration. It finishes this training phase when the calculated error is less that a
85
predetermined value stored as a parameter, or when the number of iterations exceeds a maximum value also stored as a parameter.
After the training phase is over, the control unit changes to the testing phase outputting the testing patterns, enabling the neurons and disabling the weight units so they do not keep updating the weights. The weight units hold their last calculated weight values in their output buffers so the neurons can calculate their new f(net)s with the weight values of the trained network. Figure 5.9 shows the timing waveforms of the control unit during a typical training iteration. Note the time axis shows that it takes 8.8 microseconds to complete a training iteration using a clock frequency of 20 Mhz. This does not include the time delay introduced by the manually designed multiplier. In the latter case, the time for a complete iteration was 25 microseconds, as shown in Table 5.3.
Figure 5.9: timing waveforms for control unit
86
out_out(7:0J <7F .. '1- -7 ' 1 1 '-g'^ [ : 7F.: - .01 . x. 1statt .x~tt r.arker2 .*
delta_E 115.0 j o n pmmmm 1-34. MUHB . 0013 UU134 X 4 4643 nMHBBBnBHHI 1643- a lta_E_l [19. 0| o Btafl H HHi UOUOU uuouu 046.43 SOUUQJ : 1 . OOOUOX C 4,64 J . . 1 43x* EKQLJB L: 6c_curr(19 0|<r o' wgmgmmmggm 05641 , (.UlFFO B 05641X0 /Dftl . . uoFrojfo >6411 curr_nl [19 . Q| < S .. , HQ 05641. JOfcF.FS ''. OOFFEX J5641 05*41 X 07B91 er_cucr (15.0) i>0002 u UHldiHI 1 " 1 JOUOl .own p*ji)2 ; ; m* uou; .IT * ' * p ' *' ' { ' . i uencer. k'l 3 o.O 0 mmmm delta_E_roady0 ) ^ equencer/don'e'c S.tO : uencer/done_v mmmmmmmmn -i-2- * V. -A 6. ...9 ... r.nunrec|3 0| <-4 H WBmmmmtmmmm o Stl r/enable_te9 t' * 1 /en ab le j_t r a in o. 0 .. -X1 ..X 2. X- -3.. X4 X; u i.-mv i*. t 5 X4 t fl s .. output[7 :.0| oEl wm mmmtM 1' . 1 TSX *01)I'11! K&KM 1WS23 M i IB El X X FB X: F5d_or ror t: FE XT6:. Hmmmm (W>1 mmm i 1 [19.0}Q05641 i ... i HI H . . 1 unecror[19.0| < i aom. x XT*A614 X1 r0^ X 00000 er in nj . 11 0 4 go . Dual mam Xiwctfir.pl '(7 'Oj (.00 '' asibo ' 1 86000 ' 87000 ' 08000 00 '' 90^00HHB 'i .. oo.; i fini too - ' mm oo m m. mmm r t e tip 2 {19. 01 o nun HV JBSM -1J m 04800 mHMi 1 *: 1: 1 mSmBBi wm<m mmm it. teap3i7. 0| .-43 .* 1*' H . \ './ oOx- *F X 81 X "fr X' X .".43' X 7F X 01 X 7F X X . 4(j ); . .; er tetip4|9:0| 060 : RMHB 1 1 ' 060' T' cer. tep.p(7.Qj oOO' '1 iHlllMllllllMl , HUM r ' (.oo KHB | : tUtJ Jlj . -=s 0U I tenpj>ut|7.0| oEl ' CI0X- Pn. -X -D8 X OA El XX' FB . X F XF6 Ti)\ ttfr r . tempo 119-01.., HflM HH XE1 * FE 1 05641 . OUQOO B^RSSSWEBI S îiar
. Baseline mm 17300 .
nj
' Cursor 85100 lta_E_l 119:01 <. FF6 CD o r _ c u r r ( 1 9 0 ) o 00134 cur r_nl [ 19' 0) <; 00134 er_curr | 15'0] LDOOB
FFbCD
nmrker9 90300
FF6CD
UU13 4
00134
u e n c e r . k | 3 Oj <>2 . r..nui<irec|3 0| v>0
Figure 5.9 (continued): timing waveforms for control unit
Reproduced with permission of the copyright owner. Further reproduction prohibited without 87 permission.
The following is a portion of a sample run showing how the error is calculated, accumulated, and the measure of global performance, delta_E, is calculated and broadcast to all the waiting weight calculation units. The registers used and their formats are shown below (desired output = target output): error for pattern i = (desired output - calculated output;)2 desired output[7:0] = s . f f f f f f f calculated output[7:0] = s . f f f f f f f Before subtracting, they are sign extended to 10 bits: target - temp_out [9:0] = s s s . f f f f f f f When multiplying two 10-bit numbers with the above format, the following format results: squared error [19:0] = 0 i i i i . f f f f f f f f f f f f f f f The additional integer bits are used to allow the squared error to be greater than 1 and still be correctly represented without overflow. The sign bit is always 0 since the square of a number is always a positive number. The accumulation of the individual pattern squared enors is done in a 20-bit register, sumerror: sumerror [19:0] = 0 i i i i . fff f f f f f f f f f f f f Error_current is the average for the patterns, having the same format as sumerror[19:0] delta_E_l[19:0] = error_cunrent[19:0] - error_previous_iteralion[ 19:0] Saturation arithmetic and truncation is used on deIta_E_l[19:0] to produce an all fractional value for delta_E [15:0], the broadcast measure of global performance to the weight units. Since the errors do not change significantly between consecutive iterations, an all fractional value for delta_E [15:0] is most likely to occur naturally and the saturation/truncation operation will not affect the final result significantly.
88
1!
* U rt J
t i
* f
oa M ft I* O
7* a
on
M O *
r
r
r
uO
STS s?
f t
& v & 2 2
rt
00 VO
?
c
i*
2,,&,&&az*2222
^ *
MPM
CffCCi!!!?!****** I I III I I ? ? ? ?
I i * 5 3 29 3 * 3 I 3 3 S S S i i r , 6I mi Kiono n n n M <r u t> n * % M M- u Mrt < rt <ft ft rt ft < < < <
o* o oi in o o*
OtD 10
I I
t I
I I
in
if.
m m 1.
<1
<1
sI
CD
out.out (clcult*_cutpue) 220 dc j tar0cdst_mtx( 0) <diradoutpuc) 12? l :*ua*rror 53136 0e92
ooI nH
>+ c*
D*
8
g &
M* 9
no M 01 Ul (A LJ w uu
un
M
Nj
00
?
&
00
00
to p
VO
&
to
NO
14
o
y?
I *
i I
K
00
8
0> <
3
Oi
00
-J
i*
G
o
us
Os -j
00
o
O S
g
a
*?
A.
-J Ul
oo
8 K
O
L
-J A w -J
u> Ul >0 w IA
2
U l
s r?
o I
09
>
3 . o
3
s o 4^ -J
Ul Ui
K o
UI fe
' >
oo
b\ E? O
UJ
p J
^4
* V M s Iu>
w
ta.
5.5.4 Noise.serlal, clock_generator and power-on-reset modules:

The HDL description of these modules are included in Appendix A, they are self explanatory and their operation does not need clarification from timing waveforms. Noise_serial implements a one random bit generator, continuously outputting a random bit through a serial port. Eight of these units in parallel make up a random number generator, simulating the behavior of the NM810 chip (see Newbridge, 1992) that will be included on the final board. The clock generator generates a 50 nanosecond single phase clock period that is the system clock, and the power-on-ieset module asserts an active low reset signal at the beginning of the simulation and is de-asserted after 1000 nanoseconds, simulating a system power-onreset The weight units are initialized during reset by reading in a 16-bit random number from the random number generator, so the reset signal has to be held assened enough time to allow for this initialization routine to complete. This was successfully simulated for the behavioral description. However, when the weight unit was passed through the synthesis tools for a synthesizability check, it failed. The description was changed so the current implementation initializes the weights with a fixed, small number (positive or negative) for all the weight units. That is, the initial value of the weights do not affect the final outcome since the network still converges. The difference in initialization values affected the speed of convergence,
t
however, since in one case (for a given set of training patterns, using small positive values for the initial weights) the network converged faster (in 12 iterations) than in another case (for the same set of training patterns but using small negative values for the weights), in which the network converged after 56 iterations. Other registers are also initialized during reset.
90
5.5.5. Operations and machine cycles used by the control unit and the neural array during one training iteration:
The total number of cycles for one training iteration for solving a linearly separable problem was 176, using a SO nanosecond clock period gives about 8800 nanoseconds (or 8.8 microseconds) per iteration, as can be seen in the timing waveforms in Figure 5.9 above. This estimate does not include the delay associated by the hardware multiplier. After the synthesis step was attempted, a hardware multiplier was manually designed to replace the unsynthesizable * operator. Another set of simulations was run including the effect of the multiplier in the speed of operation and in the area of the final gate level implementation. Table 5.3 shows the total number of cycles for a typical iteration including the delay associated with the hardware multipliers in the weight and output calculation units (a total of 25 microseconds per training iteration). The waveforms in Figure 5.11 show the actual times.
Step 1 2 3 4 clock control unit each weight unit cycles 1 auroetrorOO; patten# wait for enable <=0; 1 enable__tram<*=l; if(cnable~lX @next clock do: 1 enable_tiin<=0; done <" 0; I for each patten: read in 1 noise bit and 1 )output training input concatenate in reg."Noiie[lJr 1 read in another noise bit 2)increment vector count 1 3)output training input read in another noise bit 1 38 4)increment vector count wft(done_weight) read in another noise bit neuron first layer wait for enable wait wait wait neuron output layer wait for enable wait wait wait
5 6 7. 8
wait wait wait
wait wail wait
9 10
1 49
read in remaining noise bits; dehaenor, do weight update algorithm (38 cycles): all weight units update their weights concurrently done_weight<=l; enabletest <= 1; wait for deha E rcady (new deha_eiror calculated by c.u.) wait wait for done (all neurons to calculate and pass their outputs to next layers with newly calculated weights: wait for readyoutputlayerne uron
wait for dooe_weight wait for doneweigbt and enable test and enable test and ready outputs of previous layer
enable = 17
wait
wait output calculation algorithm (m=49) out <= value; ready <= pulse
Table 5.3: machine cycles for one training iteration 91
Step
clock cycles 49
11 12
wait forunit control each weight unit ready_output_layer_ne uron (total wait = m * #_ofJayers) enabletest <= 0; wait (disable neurons) calculate error for this wait pattern wait
wait neuron output layer neuron layer wait first enable = 1? output calculation algorithm (m=49) out <= value; ready <= pulse; wait
wait
13
14 15 16 17 18
110 repeat steps #4 to 12 for wait wait * remaining patterns (step #_of_ #8 for 2 .and patterns successive patterns takes 0 cycles since weights are updated once per iteration 6 calculate new deha E wait after all patterns done for this iteration 1 delta_E_ready<=1 wait 1 store new error done_weighl<=0; 1 delta_E_ready<-0 1 iter curr <=Her curr +
wait
wait wait watt wait
wait wait wait wait
1;
(continue in step 2)
Table 5.3: Machine cycles for one training iteration Total of492 clock cycles * 50 nsec = 24,600 nsec
Table 5.3{continued): machine cycles for one training iteration 92
/MU8.<I30> <>Olt \i/ 0407 - K l i r r y . X X .
r 1 âici a it I
X X
X X X
) ( X nnry^Tirx^T"
2C
< *
X ;
50
X X X!
---- 0 y ........................... o ^
OK V
X-0 0C ec ;
X*
/*.nw.3<7;>
_______
0 0f ....... cîe
rX X* *
*
X* *)
kX1
" X X^CT X
')
# oe)
tw.rwm_A<l%> OK ~~\f
KK)
/dtUa.OMtfr osti
/deiw.wi 0 5! j |
J!
/<SOA.I0 oSu | |~
1 1
/aoo4.*2 0 311
|
1,1
.
o st*
"* | SMM HM 7 MM MMO 9MN IWc)
/auto
L L
"IT
U II
IT
U LT
/noW*_tlt O SI0
T_ L
If
/notfc.traHi O 5 I
Figure 5.11: timing waveforms for Alopex 93
/IK*
o
T90W
fell
x~~x ir~x x x
Mia X
MJ<U>
o *
0 <>
" X~~X X X X~~~X X '

~ x
/oul.n2<70>
0 >
A>ut_nS<7 0>
O ><
~~X x X X X X X " X*. " X X~~^~X X~ X X X i

X:
r x
X
xx
"
x x x
> y
/.KM.1C<74>
/l.N.K?'(>
11
/_n_2<?:0>
3d
)G
/M.J<J:f>
JEX
*L
/_*< f : >
OIK
Oaa
SLXS
LX
\V
/dno.E_r90r
ostx
/dont.wt
O Stx
/A. >
ostx
/4IM.*2
os*
U
fim* Ml n t
/a*rto.<13:0
0MI9
2 M M
M C M i a
At.74>
0 19
XL
AiJ<M>
020
/0ut_l<7 0>
0 2F
/Oul_n2<7.0>
o :
X
X XZEZX
XZZZXE
/oul_"5<7 0>
0 19
O0P
/*.n.K7l>
09f
/.nw.2<7-4>
JS
O >9
oir
JG JGL
O 10
Reproduced with permission of the copyright owner. Further reproductiono prohibited without permission. /4gn.*1 Sit
/don.19
OStl
/40.2
ostt
Tim m n i
INN
IMM
2SMC
Figure 5.1
(continued): timing waveforms for Alopex 94
5.6. Synthesis step

After the complete system is functionally verified using the Verilog XL integration environment, it is passed through the synthesis tools, Syneigy. For a detail step-by-step process, please refer to Appendix B How to use the software tools. Figure S.12 and 5.13 show the results of synthesizing the described HDL modules. The output calculation unit was successfully synthesized. The reports generated by Syneigy synthesis tools indicate a total area of 218,446.11 units (578 cells), a longest path delay of 41.49 nanoseconds, and a maximum clock frequency of 24.1 Mhz. An earlier version of the weight unit was successfully synthesized. However, the current algorithmic version, even though it passed the synthesizability check, did not produce a gate level schematic. Area and timing reports were produced however, showing a total area of235,940 units (612 cells), a longest path delay of 34.48 nanoseconds and a maximum clock frequency of 30 Mhz. No errors were reported by the synthesis tools. Presumably, the netlist file being generated for the weight unit was too large to be handled by the current system resources. This problem is currently being investigated. One option being studied is the further refinement of the weight module in smaller sub-modules which will be synthesized separately. The control unit synthesis step generated a synthesized schematic. However, when the gate level schematic was inspected, two functional blocks had been included to simulate the memory module that stores the training and testing patterns. Thus, a layout was not generated by the place and route tools. The control unit will be implemented in the future as a microcontroller to allow for user programming with several parameters and options such as specific training/testing vectors, slope for the sigmoid, and different error measures.
95 96
Cadence Design Systw. inc. Cose Report rile Circuit: weight3 MODULE: weight3 COll stdpdown scdnor2 3tdor2 stdnor3 stdnand3_2x stdand4_3x stdnand2_6x stdao22_3x stdao3l_3x
5Cdoa31_3x
Count 1 1 1 1 1 1 1 1 1 1 1 2 i 3 3 2 1 1 2 2 1 2 2 3 3 2 4 4 6 5
<rea
total 215.46 249.66 279.18 32?.3B 379.08 428.22 431.73 431.73 438.75 445.77 519.03 558.09 642.33 642.33 649.80 655.65 659.88 666.90 666.90 711.45 718.20 758.16 789.75 964.44 1000.35 1003.86 1109.16 1248.30 1284.66 1521.90 1667.25 2250.36 2334.15 2496.60 3050.19 3396.69 3613.50 4738.50 4896.45 5945.94 5961.60 6318.00 8213.40 8992.62 11969.10 23060.70 33957.00 .45 .16 38680.20 43973.28 235904.63 report t ...................
scdoai2l_2x stdmux2i stdoai22_4x stdinv~5x scdinv^x stdnor4 stddff_2x stddfS_3x stdaoi31 stdoaiJl stddff_4x stdor4_2x stdand4_2x stdinv 2x stdaoai211 stdaoi211 stdoai2ll_2x stdaoi21 3Cdmux2: J. stdnand2 stdor3_2x stdoa22 stdoaoi21l stdoai211 stdor2_2x scdoaiCl stdnux2_2x stdaux2 ?~2x stdinv scdnand4 scdnand3 stddfi _? stdbuinv_3x atdlatcti_c~2x stdnand2_2x scdDui_3x stdaddh stdadd stdaoi22 stddif_o Total
S
7 7
IS
22
10 11 11 10 30
a IS
42 ss 45 40 40
116
58 612
cadence Design Systems, inc. Optimizer Timing Report sorted by 3lac* Time Circuit : wtiqhti Precision: tops Time unit: Ins. Maximum Clock Frequency 29.004 MHz Longest Path Delay 34.49 ns Longest Path witn Smallest Slack : SeartPin elk R Shortest Path with SeartPin Ar reset R< Arc.time 9.00 Smallest r.Time 3. J9I EndPin 000210 Slack : EndPm W_new ; 0 ; Arr .Time R( 0.98 J Slack 0.08 Rl 34.48 1 SlacK ms
Figure 5.13: gate count, total area and maximum delay reports for a) weight unit, b) output calculation unit
97
Cadanee Design Systems, Inc. Cose Report File Circuit: nauro2
Count MODULE: atdor2_3x neuro2 Call stdor2 stdaux2i stdaoi21 stdoux2_2x stdor4~3x stdnor2 stdao22_3x stdaoSll^Sx stda22~3n stdoai21~2x stdor3~2x stdinv2x stdlnv atdinv_4K stdoaoilll stdaoi21l stdaoi31 sednor3 sednoc4 stdnand4 atdoal22 acdoal2ll_2n stdoai3l~2x scdeal22 2x stdaoai2ll stddff j> stdnan!2 stdor2_2x stdoai3l stdbuf_3x stdoai21 stdnand3 stdoai2ll stdnand2_2x stdbuinv~Jx stdd_c stdaddh stdadd stddffjx stdaol22 Total 570 1 1 1 1 1 1 2 1 1 1 1
1
area each
area tocal
'
4 4 3 2 2 2 3 3 3 3 2 2 2 4 2 9 11 9 IS 14 20 22 53 68 28 52 33 48 121
249.66 249.66 259.51 277.29 308.79 359.10 215.46 431.73 431.73 438.75 445.77 304.38 157.95 157.95 214.11 321.48 333.45 333.45 270.18 324.90 326.43 333.43 501.93 501.93 501.93 321.48 745.20 214.11 249.66 333.45 217.62 277.29 270.27 333.45 214.11 1S7.9S 758.16 512.46 849.92 (55.65 333.45 218446.11
249.66 249.66 259.51 277.29 308.79 359.10 430.92 431.73 431.73 438.75 445.77 608.76 631.80 631.80 642.33 642.96 666.90 666.90 810.54 974.70 979.29 1000.35 1003.86 1003.86 1003.86 1285.92 1490.40 1926.99 2746.26 3001.05 3264.30 3882.06 5405.40 7335.90 11347.83 13899.60 21228.48 26647.92 28014.52 31471.20 40347.45
For details look at syn.report.t Cadence Design Systems, Inc*. Optimizer Timing Report Sorted by Slack Tine Circuit: neuro2 Tine unit: Ins. Precision: lOps Maxima Clock Frequency - 24.099 MHz Longest Path Delay - 41.49 ns longest Path StartPm elk Shortest Path StartPm resat with Snallest Arr.Tune R( 0.00) with Sasllest Arr.Tune R( 0.001 Slack : EndPin 000225 Slack : end?in out[01 Arr.Tuae R( 0.09 > Slack 0.09 Arr.Tune Rt 41.49 1 Slack Inf
Figure 5.13 (continued): gate count, total area and maximum delay reports for a) weight unit, b) output calculation unit
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission
5.7. Placement and routing

The netlist produced by the neuron synthesized schematic was exported to Cascades Epoch place and route environment. Please, refer to Appendix B for a step-by-step process on how to use these tools. Figure 5.14 shows the final chip layout of the output calculation unit The inputs to the chip are for power, for the system clock, power-on-reset An estimated final area of 0.26S mm 2 for each of the weight units (estimated from synthesis reports, using a 0.5 micron rule for place and route) and of 0.245 mm 2 for the output calculation units (from place and route reports) resulted from the synthesis, place and route steps. The control unit is considered to be off-chip at this point, and is currently being implemented as a microcontroller. A mixed level simulation was run for the complete system to verify its functionality.
Figure 5.14: final layout for output calculation unit
99
Chapter 6 CONCLUSIONS AND FUTURE WORK

6.1 Summary
A complete top-down design process was described and applied to mapping a deterministic (single layer perceptron) learning algorithm to hardware. Using the developed methodology, a stochastic learning algorithm was mapped to hardware. The modifications made to make the original learning algorithms VLSI implementable were described. In summary, the following steps were implemented and are recommended for a top-down design methodology to map a learning algorithm to hardware:
(Optional) Simulate the learning algorithm in a high level language such as C or C++ to gain a thorough understanding of its behavior.
Define a black box for the top level system and write the complete behavioral description in HDL.
Define a module for the system clock, reset, etc., outside the main system module (since reset and clock operations may not be synthesizable)
Do a preliminary module partitioning by separating the section of the HDL description with the control statements (this becomes the control unit module) from the rest of the program (this becomes a second module).
100
Write the top level structural description indicating the connectivity among the various modules defined until now, i.e., the control unit module, the second main module, the reset module and the system clock module.
Leave the control unit module alone for now and woik on the second module, trying to identify tasks and functions that are called by this module. Define a new submodule for each of these sections of code.
Continue refining this module to find other submodules trying to go further into the behavioral description until possibly ending with a structural description of only interconnected low level, independent modules. Try to design module/s so they can be cascaded to produce any size network. Ideally, these modules should only contain state machines (implemented with case statements) and instantiations of lower level modules. In practice, a final implementation, however, may still contain some behavioral constructs as the refinement process may not achieve the desired purely structural description. If the synthesis and layout tools produce the desired gate level schematic and chip layout, and the simulation of their behavior using Verilog XL is satisfactory, then a complete structural description may not be necessary (this iterative module partitioning step is illustrated in Figure 4.4.
As each module is further refined, test the hierarchy of modules within the top level system, making sure the overall system still functions as desired.
Work on the control unit, implement it as a state machine (i.e., with case" statements if possible, and lower level module instantiations), or a microcontroller. Mixed level simulations can be used to verify the complete system functionality if this step is not completed.
Once the functional/structural views are simulated using Verilog XL and their behavior is satisfactory, the complete system (or individual modules at a time) is passed through the Synergy synthesis tools. Make sure you only send synthesizable modules to the synthesis tools. You may
101
do a preliminary synthesis check with the Synergy synthesis tools to make sure. If portions of the Verilog
description, for which a gate level implementation is desired, are not synthesizable, revise the code and try to follow the modeling style for synthesis described in the on-line manuals, [Stem,1993] and [QuickRef]. The complete gate level description produced is verified using Verilog XL. If the design does not work as the behavioral/structural descriptions, or the area and timing constraints are not satisfactory, re-write the Verilog descriptions of the modules. Most of the time, the way the modules are coded in Verilog makes the difference between a synthesizable block and one that is not. Re-synthesize the design and verify it with Verilog XL. The synthesized design is placed and routed. The design is backannotated and verified against the gate level simulation using Verilog XL. Several iterations of the previous steps may be necessary until the final layout performs as the gate and/or behavioral level descriptions.
6.2 Conclusions
One of the present implementations has three neurons with six weights arranged as one input, one hidden layer and one output layer and can solve linear separable problems using the Alopex learning algorithms. Assuming a clock speed of 20 Mhz, it takes 25 microseconds per training iteration, with an average of 56 iterations for solving linear separable problems (i.e., total time of 1.4 milliseconds). The current implementation can easily be expanded to larger systems. The HDL descriptions can be modified to accept different sizes of input/output ports and weights and a completely new design could be generated within
102
few days. The output calculation unit description could be easily modified to accept more than the present implementations two inputs and two weights, by increasing the number of input ports or by sequentially accepting several sets of inputs and weights through the same input ports. This will require an additional control signal to coordinate the timing of data transfers. The present design is fiilly asynchronous, with each module handshaking with the interconnected module/s with which it communicates, so an added delay in the reading in of inputs and weights by the output calculation units would not require the original design to be modified much. A short loop to read in as many inputs and weights as a parameter indicates, and another loop to add the product terms sequentially.
By modifying the Verilog HDL description, the effects in functionality, timing and silicon area of all these design options can be studied. After a satisfactory functionality is achieved, the automatic synthesis, place and route tools generate a new gate level schematic and floorplan. Several iterations of these steps may be again necessaiy since, when going from the top down, at each level, new undesirable effects may be uncovered by the changes introduced a level up in the design hierarchy requiring the designer to go back up and modify the HDL description to solve this new challenge.
Powerful software tools allow for complete designs to be accomplished by a single designer in a reasonable period of time. The methodology developed for the deterministic single layer perception learning algorithm was applied to a novel, stochastic learning algorithm, Alopex. Results obtained were comparable to designs of neural network hardware implemented by other researchers. For example, in (Melton, 1992) , the authors implement the Kohonen learning algorithm, with a maximum speed of operation or 15 Mhz, and with the area of a single neuron occupying over 1 mm 2. Their network takes 5.1 microseconds per training iteration. The single layer perceptron implemented for this dissertation can operate at a speed of 37.553 Mhz , it has an area of 0.208 mm
2
for a complete training module (two weights), and the time for a training iteration is 3.2
microseconds. The Alopex implementation resulted in 103
the area of a single neuron to be 0.245 mm microseconds for a training iteration.
, capable of operating at about 30 Mhz and taking 25
Pandya & Venugopal report a CPU time per iteration* for software simulations of Alopex of0.228 ms on a VAXstation 2000, with a comparable speed of operation to the hardware implementation done for this dissertation. The time per iteration for the hardware implementation is 25 microseconds and includes the processing of four patterns as opposed to one pattern for the VAX simulation. This represents a speed up of almost forty times of the hardware implementation over the software simulation.
The results obtained for the simulations done for this dissertation could be improved by optimizing the operations done by the individual units at each clock cycle, thus reducing the delay associated with each units operation. Today, 166 MHz clock frequency is widely available, which would allow for a five time reduction into the timing estimates, i.e., 0.64 microseconds per iteration for perceptron and 5 microseconds per iteration for Alopex. Similarly, our educational tools allowed for a feature size of 0.5 microns at best while 0.2 micron is standard today which will reduce the chip area five times. With the best technology available today, which is 0.07 micron, we can reduce the chip area one-hundred fold, i.e., 2,080 micrometer squared for perceptron training module and 2,450 micrometer squared for Alopex neurons. The objective of this dissertation was to accomplish a functionally correct architecture, without major concerns for obtaining the fastest and smallest design, while developing a methodology that can be applied to other design problems.
* One iteration for Alopex was considered by Pandya & Venugopal as one weight update, one pattern processed, and one error calculation. For this dissertation, one iteration for Alopex includes one weight update, all patterns processed, all individual errors calculated (for every pattern) plus overall error /measure of global network performance.
104
The iterative technique with backannotation at each design step produced many changes in the original behavioral description. The use of the software tools allowed for all these modifications and verifications to be
done in a reasonable period of time and by a single designer. VLSI implementations are now at the reach of many since there is no need to manually design every component, build prototypes, test, modify, build and test, etc. Use of these software tools, coupled with a top-down design methodology, allow for a more effective, shorter turn around time with all the steps involved being handled by a single designer. Estimate of design feasibility, manufacturability, timing and area requirements can be done early in the design process. Each design may then be optimized for improved performance. The objective was to produce a hardware implementable design with estimates for speed and area that can be used to study the learning power of a parallel architecture in real time and not with software simulations using a high level language such as C. Higher levels of synthesis may be available in the near future so many limitations of the current implementations, such as modules that now require a manual design entry or a modification to the original algorithms compromising performance, may no longer be an issue. More complex designs may be automatically generated by these tools. Presently, the Synergy synthesis tools available do not allow for memory array (RAM or ROM), multiplication and division operations to be synthesizable. Even though a gate level schematic was generated for the modules that included ROM and a functional block was generated that replaces the given sections of code (i.e., 0 silicon area in the cost reports and completely executed in one clock cycle) thus not allowing for an automatic place and route operation. These functional blocks have to be manually designed, adding to the area, delay and giving rise to a complete new set of synchronization problems. The hardware signed multiplier was manually designed and incorporated in each design. The effects of this addition was an increased in timing of execution proportional to the size of the operands being multiplied, and an increase in the final chip area. 105 Additional synchronization problems were created, requiring in some cases, the modification of other modules to add handshake signals that will make them wait for the added delay introduced by the multiplier before using the results of the multiplication.
6.3 Directions for future work

Directions for future work include the addition of the digital sigmoid described in chapter 3 since the current implementation uses a linear transfer function with saturation. Other suggested additional work is the inclusion of different annealing schedules and different error measures to evaluate the influence on speed and area of adding more complex routines to the HDL descriptions. The current implementation could be optimized for improved performance, i.e., for better area and timing results, by refining the Verilog HDL code to group in a single clock cycle more operations that can be done concurrently by the modules, by manually designing fast adders and multipliers, by further partitioning the existing modules while studying the effect this will have in reducing delays or silicon area, by increasing the system clock speed of operation to the maximum possible for each module (maximum delay paths as given by synthesis reports), by pipelining operations. However, the objective of this dissertation was accomplished since the concern was mainly to develop a methodology to map a learning algorithm to a VLSI implementable architecture and obtain a correct functionality. No particular constraints were placed on area or timing since a hardware implementation would surely produce a faster system than a software simulation using a sequential machine and a high level language such as C. In addition, future research may include the cascading of modules to build larger networks and test them for more complex applications, to modify further the Alopex algorithm to avoid the inclusion of a multiplier in each weight unit and to allow for accumulation of more product terms in neurons.
106
Chapter 7
Appendix A
A.1. Perceptron HDL behavioral/structural descriptions & results
107
// Perceptron //12/8/95 - modified 2/15/96; 2/24/96;3/28/96 // 4/1/96 // Laura Ruiz // multiplier hand designed in training module //control unit still has "*" since it won't be synthesized // module control_unit(enable_train, out, traindat_mtx_l, traindat_mtx_2, w_l, w_2, elk, reset, done); input elk,reset,done; input[3:0] w_l, w_2; output enable_train; output[3:0] traindat_mtx_l, traindat_mtx_2, out;
reg[3:0] traindat_mtx_l,traindat_mtx_2, out; reg[3:0] wl,w2,wl_old, w2_old, temp3; reg[ll:0] temp; reg[ll:0] tempi,temp2; reg[7:0] iter_curr; reg[3:0] numrec,count,k; reg enable_train; parameter iter_max=8'h50, maxrec=4; //RAM for training vectors: reg[3:0] traindat_mtx[0:7], testdat_mtx[0:3], targetdat_mtx[0:3];
always @ (posedge elk) begin:control_unit if (reset) disable control_unit; //begin training else begin 6(posedge elk) // enable_train <= l'bl; count <= 4'hO; while ((count<maxrec) && (iter_curr<iter_max)) begin @ (posedge elk) k <=4'h0; count <= 4'h0; 6 (posedge elk) Sdisplay("count=%h\n",count); for (numrec=0; numrec<=maxrec-l; numrec=numrec+l) begin @(posedge elk) $display("numrec=%d\n", numrec); traindat_mtx_l [3:0] <= traindat_mtx[k]; . @(posedge elk) k <= k+1; @(posedge elk) traindat_mtx_2[3:0] <= traindat_mtx[k]; @(posedge elk) out[3:0] <= targetdat_mtx[numrec]; k <= k+1; @(posedge elk) enable_train <= l'bl; $display ("traindat_mtx_l=%h traindat_mtx_2=%h out (desired) =%h\n", traindat_mtx_l, traindat_mtx_2, out); wait(done) //wait for signal from "training" module @ (posedge elk) enable_train <= 1'bO; wl[3:0] <= w_l[3:0]; w 2 [ 3 : 0 ] <= w_2[3:0]; iter_curr <= iter_curr+l; $ (posedge elk) $display ("wl= %h w2= %h epoch=%d wl_old= %h w2_old = %h\n", wl, w2, iter_curr, wl_old,w2_old); 0(posedge elk) if((wl[3:0]==wl_old[3:0])&&(w2[3:0]==w2_old[3: 0])) count=count+l; 6(posedge elk) wl_old[3:0] <= wl[3:0]; w2_old[3:0] <= w2[3:0]; $display("count=%h wl_old = %h w2_old= %h\n",count, wl_old, w2_old); end end 6(posedge elk) k <= 0; end // begin testing now & display results $display ("\n testing\n\n"); @(posedge elk) tempi[11:0] <=mult(wl,testdat_mtx[0]); 0(posedge elk) tenp2 [11:0] <=mult (w2, testdat_mtx [1].); 0(posedge elk) temp[ll:0] <=templ[11:0] + temp2[ll:0]; $di splay (H * * * * ^ ^ ^^ j 0(posedge elk) if(temp[11]==1) //truncate output for 4-bit data path begin temp3[3:0] = 4'h9; $display("class=C2, output -0.875 \n"); end else begin
temp3[3:0] = 4'h7; $display("elass=Cl, output= +0.875 \n"); end @(posedge elk) $display("w_l= %h testdat_mtx[0] = %h \n w_2 = %h testdat_mtx[1]=%h class=%h \n" wl, testdatjmtx[0], w2, testdat_mtx[1], temp3); $display(M*********************************************** * * \ n ) * 0(posedge elk) tempi[11:0] <* mult(wl, testdat_mtx[2]); 0 (posedge elk) temp2[ll:0] <= mult(w2, testdat_mtx[3]); 0 (posedge elk) temp[ll:0] <= templ[ll:0] + temp2[11:0]; 0(posedge elk) if(temp[11]==1) begin temp3[3:0] = 4'h9; $display("elass=C2, output 0.875\n"); end else begin temp3[3:0]= 4'h7; $display("elass=Cl, output= +0.875\nB); end 0(posedge elk) $display("w_l= %h testdat_mtx[2]= %h\n w_2 = %h testdat_mtx[3] = %h class = %h \n wl, testdat_mtx[2], w2, testdat_mtx[3], temp3); $display (11 ************************* * **************\nw) $inish; ** * end always 0(reset) if(!reset) begin disable control__unit; t raindat_mtx[0]=4'h2; traindatjmtx[1]=4'hi; traindatjmtx[2]=4'hi; t raindat_mtx[3]=4' hf; traindat_mtx[ 4]=4'hO; traindat_mtx[5]=4'h2; traindat_mtx[6]=4' he; traindat_mtx[7]=4'hi; targetdat_mtx[ 0]=4'h7; targetdat_mtx[1]=4' h7; targetdat_mtx[2]=4'h9; targetdat_mtx[3]=4'h 9; testdat_mtx[0]=4'hf; testdat_mtx[1]=4'h2; testdat_mtx[2]=4'hi; testdat_mtx[3]=4'hl; assign temp = 0; assign traindat_mtx_l = 0; assign traindat_mtx_2 = 0; assign iter_curr = 1; assign nurturec =0; assign k = 0; assign enable_train = 0; assign wl_old =0; assign w2_old =0; assign wl = 0; assign w2 =0; assign count =0; end else begin deassign temp; deassign traindat_mtx_l; deassign traindat_mtx_2; deassign iter_curr; deassign numrec; deassign k; deassign enable_train; deassign wl_old; deassign w2_old; deassign wl; deassign w2; deassign count; end function[11:0] mult; input[3:0] X; input[3:0] W; reg[3:0] x_l; reg [3:0] w_l; reg[11:0] temp; reg sign, signl, sign2; begin w_l[3:0]= W[3:0]; x_l[3:0]= X[3:0];
if (w_113]=l) begin w_l[3:0] =~w_l [3:0]+l'bl; signl=l'bl; end else begin w_l [3:0]= w_l [3:0]; signl=l' bO; end if (x_l[3]==l) begin x_l[3:0]=~x_l[3:0]+l'bl; sign2=l' bl; end else begin sign2=l'b0; x_l [3:0]= x_l [3:0]; end temp[ll:0]=w_l[3:0]*x_l[3:0]; temp[ll:0] = temp[ll:0] 1; sign=signlAsign2; if(sign==l'bl) temp[ll:0]=~temp[ll:0]+l'bl; else temp[11:0]=temp[11:0]; mult [11:0] = temp[ll:0]; end endfunction endmodule module training(done, wl, w2, xl, x2, out, elk, reset, enable_train); input [3:0] xl,x2,out; input elk,reset,enable_train, output[3:0] wl,w2; output done; reg done,sign,signw,signx; reg[3:0] X_1,X_2,0UT; reg[3:0] W_1,W_2, wl,w2; reg[11:0] tempi,temp2; reg[ll:0] temp, result; always @(posedge elk) begin: training if(!reset) disable training; if (enable_train) begin 0(posedge elk) X_1[3:0) <=xl[3:0]; X_2[3:0] <=x2[3:0]; OUT[3:0] <= out[3:0]; @(posedge elk) //first multiplication signw <= wl[3]; signx <= X_l[3]; 0(posedge elk) if(signw) wl [3:0] <= ~wl[3:0] + 1; else
wl[3:0] <= wl[3:0]; if(signx) X_1[3:0] <= ~X_1[3:0] + 1; else X_1[3:0] <= X_1[3:0]; sign <= A signx signw; result[ll:0] <= 12'h000; 6(posedge elk) begin i(X_l[0]>
result[ll:0] <= result[ll:0]+wl[3:0]; wl[3:0] <= wl[3:0] 1;

end @ (posedge elk) begin if(X_l[1]) result[ll:0] <= result[ll:0]+wl[3:0]; wl[3:0] <= wl[3:0] 1; end @(posedge elk) begin if(X_l[2]) result[ll:0] <= result[11:0]+wl[3:0]; wl [3 :0] <= wl [3:0] 1; end @(posedge elk) begin if(X_l[3]) result[ll:0] <= result[ll:0] + wl[3:0];
wl [3:0] <= wl [3:0] 1; end @(posedge elk) result [11:0] <= result[11:0] << 1; //adjust for fractional data @(posedge elk) if(sign) templ[ll:0] <= -result[11:0] + 1; else tempi[11:0] <= result[11:0]; @(posedge elk) #0 $display("w*x_inside multiplier_templ=%h \n", tempi); //start second multiplication signw <= w2[3]; signx <= X __2 [3]; @(posedge elk) if(signw) w2[3:0] <= ~w2[3:0] + 1; else w2[3:0] <= w2[3:0]; if(signx) X_2[3:0] <= ~X_2[3:0] + 1; else X_2[3:0] <= X_2[3:0]; sign <= signx * signw; result[11:0] <= 12'h000; @(posedge elk) begin if(X_2[0]) result[11:0] <= result[ll:0] + w2[3:0]; w2 [3:0] <= w2 [3:0] 1; end @(posedge elk) begin if(X_2[l]) result[11:0] <= result[11:0] + w2[3:0]; w2 [3:0] <= w2 [3:0] 1; end @(posedge elk) begin if (X_2 [.?]) result[11:0] <= result[11:0] + w2[3:0]; w 2 [ 3 : 0 ] < = w 2 [ 3 : 0 ] 1 ; end 0(posedge elk) begin if (X_2 [3]) result[11:0] <= result[11:0] + w2[3:0]; w 2 [ 3 : 0 ] < = w 2 [ 3 : 0 ] 1 ; end //adjust result for fractional data 0(posedge elk) result[11:0] <= result[11:0] 1; 0 (posedge elk) if(sign) temp2[ll:0] <= ~result[ll:0] + 1; else temp2 [11:0] <= result [11:0]; 0(posedge elk) #0 $display("w*x_inside_multiplier_temp2= %h \n", temp2); //accumulate results of multiplications: temp[11:0] <= tempi[11:0] + temp2[11:0]; 0(posedge elk) Sdisplay("x_l= %h x_2 = %h wl=%h w2= %h out= %h wl*xl+w2*x2= %h \n" x_i, X_2, W_l,W_2, OUT,temp); if (OUT==4'h9) //class C2, output=-0.875; begin if(temp[11]==0) begin 0(posedge elk) W_2[3:0]<=W_2[3:0]-X_2[3:0]; W_1 [3:0]<=W_1[3:0]-X_l[3:0] ; end else begin 0(posedge elk) W_2[3:0]<=W_2[3:0]; W_1[3:0]<=W_1[3:0]; end end else
if (OUT==4'h7) begin if (temp[11]==1) begin 0 (posedge elk) W_2[3:0]<=W_2[3:0]+X_2[3:0]; W_1[3:0]<=W_1[3:0]+X __ 1[3:0]; end else begin 0(posedge elk) W_2[3:0]<=W_2[3:0]; W_1[3:0]<=W_1[3:0]; end end 0(posedge elk)
//class Cl,output=+0.875;
wl[3:0] <= W 1[3:0];
w2[3:0] <= W_2[3:0]; done <= l'bl; @(posedge elk) done <= 1'bO;
II outputs result using "old weights #0 $display ("W_l_new= %b; X_l=%b;

\nW_2_new= %b; X_2=%b\n",W_l,X_l,W_2,X_2); end end always @(reset) if (!reset) begin disable training; assign X_1 =0; assign X_2 =0; assign OUT = 0; assign W_1 =0; assign W_2 =0; assign done = 0; assign temp = 0; assign wl = 0; assign w2 = 0; end else begin deassign deassign deassign deassign deassign deassign done; deassign deassign deassign temp; end X_l; X_2; OUT; W_l; W_2; wl; w2;
task multi; output[11:0] temp; input[3:0] W, X; reg signw, signx, sign; reg[11:0] result,temp; reg[3:0] x, w; begin @(posedge elk) x[3:0] <= X [ 3:0 ];
w[3:0] <= W[3:0]; @(posedge elk) signw<= w[3]; signx <= x[3]; @(posedge elk) $display("x=%h ; w=%h\n", x, w); if(signw) w[3:0] <= ~w[3:0] + 1; else w[3:0] <= w[3:0]; @(posedge elk) if(signx) x[3:0] <= ~x[3:0] + 1; else x[3:0] <= x[3:0]; sign <= signx A signw; result <= 12'h000; @(posedge elk) $display(sign=%b; signw=%b; signx=%b\n", sign, signw, signx); begin if (x[0])
result[ll:0] <= result[ll:0] + w[3:0]; w[3:0] <= w[3:0] 1; end 6 (posedge elk) begin if(x[l]) result[ll:0] <= result[ll:0] + w[3:0]; w[3:0] <= w[3:0] 1; end @(posedge elk) begin if(x[2]) result[11:0] <= result[ll:0] + w[3:0]; w[3:0] <= w[3:0] 1; end 0 (posedge elk) begin if(x[3]) result[11:0] <= result[11:0] + w[3:0]; end // 0(posedge elk) // result[ll:0] <= result[11:0]1; @(posedge elk) if(sign) begin temp[11:0] <= -result[11:0] + 1; end else begin temp[ll:0] <= result[11:0]; end /* #0 $display("w*x inside multiplier = %h (times) %h = % h ; signw= %b signx = %b sign=%b\n", w, x, temp,signw, signx, sign); */ end endtask /*
function[11:0] mult; input[3:0] X; input[3:0] W;
reg[3:0] x_l; reg[3:0] w_l; reg[11:0] tenp3;

reg sign, signl, sign2; begin w_l[3:0] = W[3:0]; x_l [3:0] = X [ 3:0]; if(w_l[3]==1) .begin w_l [3:0]=~w_l[3:0]+l'bl; signl=l'bl;
end else
begin signl=l'b0; w _ l [ 3 : 0 ] = w _ l [ 3 : 0 ] ; end if(x_l[3]==l) begin x_l[3:0]=~x_l[3:0]+l'bl; sign2=l'bl;
end else begin sign2=l'b0; x_l[3:0]=x_l[3:0]; end temp [ 11:0]=w_l [3:0] *x_l [3:0]; temp[ll:0]=temp[ll:0] 1; s ign=s ignl*s ign2; if(sign==l) temp[11:0]=~temp[11:0]+l'bl; else temp=temp; $display("w*x_inside_mult_in_training= %b\n",temp); mult = temp; end endfunction
*/
endmodule module m(clk); output elk; reg elk; initial begin elk = 1'bO; #5 elk = l'bl; end always #25 elk = ~clk; endmodule
module power_on_reset(reset); output reset; reg reset; initial begin reset = l'bl; #5 reset = 1'bO; #100 reset = ~reset; end endmodule
module net_t op(); wire done, enable_train,elk,reset; wire[3:0] xl, x2, out; wire[3:0] wl,w2; power_on_reset power_on preset); m eloek_generator(elk) ; training train (done, wl, w2, xl, x2, out, elk, reset,enable_train); eontrol_unit c_u(enable_train, out, xl, x2, wl, w2, elk, reset, done); endmodule Host command: verilog Command arguments: finalmult.v VERILOGXL 2.2.13 log file created Apr 1, 1996 16:37:59 VERILOGXL 2.2.13 Apr 1, 1996 16:37:59 Copyright (c) 1995 Cadence Design Systems, Inc. All Rights Reserved. Unpublished rights reserved under the copyright laws of the United States. Copyright (c) 1995 UNIX Systems Laboratories, Inc. Reproduced with Permission. THIS SOFTWARE AND ON-LINE DOCUMENTATION CONTAIN CONFIDENTIAL INFORMATION AND TRADE
SECRETS OF CADENCE DESIGN SYSTEMS, INC. USE, DISCLOSURE, OR REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF CADENCE DESIGN SYSTEMS, INC. RESTRICTED RIGHTS LEGEND Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252,227-7013 or subparagraphs (c) (1) and (2) of Commercial Computer Software Restricted Rights at 48 CFR 52.227-19, as applicable. Cadence Design Systems, Inc. 555 River Oaks Parkway San Jose, California 95134 For technical assistance please contact the Cadence Response Center at 1-800-CADENC2 or send email to crc_customers0cadence.com For more information on Cadence's Verilog-XL product line send email to
talkverilog@cadence.com
Compiling source file "finalmult.v" Highest level modules: net_top count=0 numrec= 0 traindat_mtx_l=2 traindat_mtx_2=l out(desired)=7 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= 000 x_l= 2 x_2 = 1 wl=0 w2= 0 out= 7 wl*xl+w2*x2= 000 W_l_new= 0000; X_l=0010; W_2_new= 0000; X_2=0001 wl= 0 w2= 0 epoch= 2 wl_old= 0 w2_old = 0 count=l wl_old = 0 w2_old= 0 numrec= 1 traindat_mtx_l=l traindat_mtx_2=f out(desired)=7 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= 000 x 1= 1 x 2 = 1 wl=0 w2= 0 out= 7 wl*xl+w2*x2= 000
W_l_new= 0000; X_l=0001; W_2_new= 0000; X_2=0001 wl= 0 w2= 0 epoch= 3 wl_old= 0 w2_old = 0 count=2 wl_old = 0 w2_old= 0 numrec= 2 traindat_mtx_l=0 traindat_mtx_2=2 out(desired)=9 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= 000 x_l= 0 x_2 = 2 wl=0 w2= 0 out= 9 wl*xl+w2*x2= 000 W_l_new= 0000; X_1=0000; W_2_new= 1110; X_2='0010 wl= 0 w2= e epoch= 4 wl_old= 0 w2_old = 0 count=2 wl_old = 0 w2_old= 0 numrec= 3 traindat_mtx_l=e traindat_mtx_2=l out(desired)=9 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= ffc x_l= 2 x_2 = 1 wl=0 w2= e out= 9 wl*xl+w2*x2= ffc W_l_new= 0000; X_l=0010; W_2_new= 1110; X_2=0001 wl= 0 w2= e epoch= 5 wl_old= 0 w2_old = e count=3 wl_old = 0 w2_old= e count=0 numrec= 0 traindat_mtx_l=2 traindat_mtx_2=l out(desired)=7 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= ffc x_l= 2 x_2 = 1 wl=0 w2= e out= 7 wl*xl+w2*x2= ffc W_l_new= 0010; X_l=0010; W_2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 6 wl_old= 0 w2_old = e count=0 wl_old = 0 w2_old= e numrec= 1 traindat_mtx_l=l traindat_mtx_2=f out(desired)=7 w*x_inside multiplier_templ=004
w*x_inside_multiplier_temp2= 002 x_l= 1 x_2 = 1 wl=2 w2= f out= 7 wl*xl+w2*x2= W_l_new= 0010; X_l=0001; W__2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 7 wl_old= 2 w2_old = f count=l wl_old = 2 w2_old= f numrec= 2 traindat_mtx_l=0 traindat_mtx_2=2 out(desired)=9 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= ffc x_l= 0 x_2 = 2 wl=2 w2= f out= 9 wl*xl+w2*x2= W_l_new= 0010; X_1=0000; W_2_new= 1111; X_2=0010 wl= 2 w2= f epoch= 8 wl_old= 2 w2_old = f count=2 wl_old = 2 w2_old= f numrec= 3 traindat_mtx_l=e traindat_ratx_2=l out (desired)=9 w*x_inside multiplier_templ=ff8 w*x_inside_mu'ltiplier_temp2= ffe x 1 2 x_2 = 1 wl=2 w2= f out= 9 wl*xl+w2*x2= W_l_new= 0010; X_l=0010; Vf_2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 9 wl_old= 2 w2_old = f count=3 wl_old = 2 w2_old= f count=0 numrec= 0 traindat_mtx_l=2 traindat_mtx_2=l out(desired)=7 w*x_inside multiplier_templ=008 w*x_inside_multiplier_temp2= ffe x_l= 2 x_2 = 1 wl=2 w2= f out= 7 wl*xl+w2*x2=
W 1 new=
ffc
ff 6
006
__ 0010; X_l=0010; W_2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 10 wl_old= 2 w2_old = f count=l wl_old = 2 w2_old= f numrec= 1
traindat_mtx_l=l traindat_mtx_2=f out(desired)=7 w*x_inside multiplier_templ=004 w*x_inside_multiplier_temp2= 002 x 1= 1 x_2 = 1 wl=2 w2= f out= 7 wl*xl+w2*x2= 006 W_l_new= 0010; X 1-0001; W_2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 11 wl_old= 2 w2_old = f count=2 wl_old = 2 w2_old= f numrec= 2 traindat_mtx_l=0 traindat_intx_2=2 out(desired)=9 w*x_inside multiplier_templ=000 w*x_inside_multiplier_temp2= ffc x_l= 0 x_2 = 2 wl=2 w2= f out= 9 wl*xl+w2*x2= ffc W_l_new= 0010; X_1=0000; W_2_new= 1111; X_2=0010 wl= 2 w2= f epoch= 12 wl_old= 2 w2_old = f count=3 wl_old = 2 w2_old= f numrec= 3 traindat_mtx_l=e traindat_mtx_2=l out(desired)=9 w*x_inside multiplier_templ=ff8 w*x_inside_multiplier__temp2= ffe x 1= 2 x_2 = 1 wl=2 w2= f out 9 wl*xl+w2*x2= ff6 W_l_new= 0010; X_l=0010; W_2_new= 1111; X_2=0001 wl= 2 w2= f epoch= 13 wl_old= 2 w2_old = f count=4 wl_old = 2 w2_old= f testing
A-********************************-*-*-*-***
class=C2, output= -0.875 w_l= 2 testdat_mtx[0] = f w_2 = f testdat_mtx[l]=2 class=9
class=Cl, output +0.875
Reproduced with permission of the copyright owner. Further reproduction
prohibited without permission.
w _1= 2 testdat_mtx[2]= 1 w 2 = f testdat mtx[3] = 1 class = 7
L134 "finalmult.v": $finish at simulation time 19650 6726 simulation events CPU time: 2.2 secs to compile + 2.2 secs to link + 0.6 secs in simulation End of VERILOG-XL 2.2.13 Apr 1, 1996 16:38:18
A.2. Sigmoid HDL behavioral description
module sigmoid_int(x, y,in,elk,reset); //module sigmoid_int(elk,x, y, reset); //2/12/96 //Laura Ruiz // input elk, reset; input[7:0] x; output [15:0.] y; input in; reg[7:0] X; reg[15:0] y; //reg elk; parameter xsatl=8'hfd, xsat2=8'h02, zero=8'h7f; // initial // begin // y[15:0] = 16'hO; // elk = 1'bO; // end //always #25 elk = -elk; always @(posedge elk) begin:sigmoid
122
if(reset) disable sigmoid; if(in) begin

X[7:0] <= X[7:0]; /*
//
*/
for(X=0;X<=8'hff;X=X+l) begin if((y!=8'h0)(y!=8'hff)) $display("%d %d %h %h ",X,y,X,y);
//always @(posedge elk) II begin:sigmoid // if (!reset) disable sigmoid; // X [7 :0] <= x [7:0]; @(posedge elk) if(X>=xsat l) y [15:0] =16'hff f f; else if(X<=xsat2)
y [15:0]=16'h0;
else if((X<=xsatl)&&(X>=zero)) y [15:0] = ( ((X-zero) * (Xzero) )5) ; else if((X>=xsat2)&&(X<=zero)) y [15:0] = ((16'h7e02-( (zero-X) * (zeroX))) 5); else $display("error"); // $display("%d %d %h %h \n", X, X, y, y); // end end else $display("in=0"); end always @ (reset) if(!reset) begin assign X = 0; assign y = 0; end else begin deassign X; deassign y; end
endmodule
A.3. Alopex C language implementation and results
/Program:hidnew8.c * Laura Ruiz * 4/15/95 */ include include include include <stdio.h> <math.h> <stdlib.h> <string.h>
125
define MAXHIDNEURON 10 define MAXINPNEURON 10 define MAXOUTNEURON 10 define MAXTRAINREC 10 define MAXTARREC MAXTRAINREC define MAXITER 1 00000 define true 1 double Delta; double P; double w_hid_tO_inp[MAXHIDNEURON] [MAXINPNEURON] [3]; double w_hid_tO_OUt [MAXOUTNEURON] [MAXHIDNEURON] [3]; double w_hid2_tO_hid[MAXHIDNEURON] [MAXHIDNEURON] [3]; double hid2out[MAXHIDNEURON] ; double hid_out[MAXHIDNEURON]; double 0Ut_0Ut [MAXTARREC] [MAXOUTNEURON] ; double traindat_mtx[MAXTRAINREC] [MAXINPNEURON]; double targetdat_mtx[MAXTRAINREC] [MAXOUTNEURON]; double drand48(); double error_curr[3]; double out_error[MAXTRAINREC] [MAXOUTNEURON]; double threshold, T, To; long iter_curr; double delta_error; int maxrec, maxreclenth; int max_input_neuron; int max_hidden_neuron; int max_output_neuron; int readinput(char *trfile); void initweights(); void preferences(long *iter_max, double *ETA, double *error_max, double *DELTA, double *T, double *To, char *trfile, char *ttfile, int *actic void train(double ETA, long iter_max, double error_max, double DELTA, double T, double To); void
saveweights(char *trfile); int restoreweights(char *trfile); void decode(char *ttfile, char *trfile, } double To); void change_ext (char infilename[], char outfilename[], char extension[]);
/*MAIN: ext files to read IN: training data: p.readinput testing data : p.decode weights data .: p.restoreweights keyboard : user's preferences files to write OUT. : weights data : p.saveweights */ void mainO
{
double error_max, ETA, DELTA, Delta, T, To; int n, action; long iter_max, iter_curr; char ttfile[50], trfile[50]; threshold=0.8; while(true)
{ { )
preferences(&iter_max, SETA, serror_max, &DELTA, &T,STo, trfile, ttfile, action) if (action == 1) /*l=test net*/ . printf("test file = %s\n,ttfile); printf("To = %lf\n",To); printf("action=testing \n"); decode(ttfile,trfile, To); /*2=train net*/
if (action = 2)
{ {
printf("action = training \n"); if(readinput(trfile)) initweights(); train(ETA,iter_max, errorjmax,DELTA,T,To); saveweights(trfile);
} } >
else exit (1); } /*end infinite loop main*/
/EEADINPUT: read training file & store it in array, called by MAIN calling NONE input file TRAIN decided by user */ int readinput(char *trfile) { FILE *traindat; int numrec; int reclength; int inp_idx, out_idx; if ((traindat = fopen(trfile, "rt)) == NULL) printf("Unable to open training input file: %s", trfile); exit(0);
> !
fscanf(traindat,"%d, fimaxrec); /printf("%dn,maxrec); */ fscanf(traindat,"%d",&max_input_neuron);

/* print f("%d",max_input_neuron);*/ fscanf(traindat, "%d",&max_hidden_neuron);
/* printf (%d,inax_}iidden_neuron); */ fscanf(traindat, "%d",&max_output_neuron); /* printf (%d\n'',max_output_neuron); */ for (numrec=0; numrec<=maxrec-l;nunirec++)

1
for (inp_idx=l; inp_idx<=max_input_neuron; inp__idx++) {fscanf(traindat,"%lf",itraindat_mtx[numrec][inp_idx]) ; /* printf("%lf\n,traindat_mtx[numrec][inp_idx]); */

}
for (out_idx=l; out_idx<=max_output_neuron; out_idx++) {fscanf(traindat,"%lf",fitargetdat_mtx[numrec][out_idx]);

} }
fclose(traindat); return(1); /INITWEIGHTS: initializes weights to random numbers between 0 and 1, called by MAIN calling NONE */
void initweights () { int inp_idx, hid_idx,hid2idx, out_idx; for (hid idx=l; hid_idx<=max_hidden_neuron; hid idx++)
I
for (inp_idx=l; inp_idx<=max_input_neuron;inp_idx++)

{
w_hid_to_inp[hid_idx] [inp_idx] [0]=drand48 (); w_hid_to_inp[hid_idx] [inp_idx] [l]=drand48 () ; w_hid_to_inp [hid_idx ] [ inp_idx ] [ 2 ] =drand4 8 0; /*
*/ } }
printf ("weight[%d] [%d] [0] =% 1.151f\n",hid_idx,inp_idx,w_hid to inplhid idx] [inp
for (hid2idx=l; hid2idx<=max_hidden_neuron; hid2idx++)

{
for(hid_idx=l; hid_idx<=max_hidden_neuron; hid_idx++) i w_hid2_to_hid[hid2idx] [hid_idx] [0]=drand48(); w_hid2_to_hid[hid2idx] [hid_idx] [l]=drand48 (); w_hid2_to_hid[hid2idx] [hid_idx] [2]=drand48 (); /* printf ("weight [%d] [%d] [0]=%1.151f\n",hid2idx,hid_idx,w_hid2_to_hid[ hid2idx][hid_idx][0]);
*/ } }
for (hid2 idx=l; hid2 idx<=max_hidden_neuron; hid2 idx++)

{
for(out idx=l;out idx<=max output neuron;out idx++)

{!
w_hid_to_out [out_idx] [hid2idx] [21 =drand48 (); w_hid_to_out [out_idx] [hid2idx] [l]=drand48(); w_hid_to_out [out_idx] [hid2idx] [0]=drand48 () ; printf ("weight[%d] [%d] [0]=%1.151f\n",out idx,hid2idx,w hid to out[out idx] [hi
/*
>'_ } }
/PREFERENCES: get info from user called by MAIN calling NONE */ void preferences(long *iter_max, double *ETA, double *error_max, double *DELTA, double *T,double *To, char *trfile, char *ttfile, int *actio
{
action =0; printf("Enter action: 1 for testing, 2 for training, 0 for quiting "); scanf("%d",action); printf("action selected is: %d\n",*action); if(*action == 2)
{
printf("Enter max # of iterations: "); scanf("%u", iter_max); printf("max # of iterations selected is: %u\n",*iter_max); printf("Enter learning rate: "); scanf("%lf",ETA); printf("learning rate selected is: %lf\n",*ETA); printf("Enter max allowed error: "); scanf("%lf n,error_max); printf("max allowed error selected is: %lf\n", *error_max);
printf("Enter step size: "); scanf("%lf,DELTA); printf("step size selected is: %lf\n", *DELTA); printf("Enter value for T: "); scanf("%lfn,T); printf("T selected is: %lf\n",*T); printf("Enter value for To for training: ); scanf("%lf", To); printf("To selected is: %lf\n",*To); printf("Enter name of training file: "); scanf("%s", trfile); printf(trfile); printf("\n");
} {
else if(*action == 1) printf("Enter name of testing file: "); scanf("%s", ttfile) ; printf(ttfile); printf("Enter value of To for testing: "); scanf("%lf",To); printf("To = %lf\n", *To); printf("\n");
}
else exit(1);
/TRAIN*/ void train(double ETA, long iter_max, double error max, double DELTA, double T, double To) { int hid_idx, inp_idx,hid2idx, out_idx; int n, numrec=0, numrecr; double sum, sumerror, sumT=0; long iter_curr=l; double temp=0.0; error_currf1]=error_curr[0]=1.0 ; threshold=0.8; while ((iter_curr<=iter_max) & s (error_max<error_curr [ 0 ])) { /*Anneal T*/ /* printf("T- %1.151f\n",T) ;*/ if(((iter_curr%10)==0)&&(numrec==maxrec-l))
{
T=sumT/10; if (T==0.0) T=0.00000001; sumT=0;
for (hid_idx=l;hid_idx<=max_hidden_neuron; hid_idx++)

{
/calculate output of hidden neurons for hidden layer #1 */ sum=0; for(inp_idx=l; inp_idx<=max_input_neuron; inp__idx++) sum=(w_hid_to_inp[hid_idx][inp_idx][0] *traindat_mtx[numrec][inp_idx])+sum; /calculate error for output neuron*/
} <
sum=(sum-threshold)/To; if(sum>40) sum=40; if(sum<=-40) sum=-40; hid_out [hid_idx] =1/ (1+exp (-sum));
}
for(hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++) /calculate output of hidden neurons for hidden layer #2 */ sum=0; for(hid_idx=l;hid_idx<=max_hidden_neuron;hid_idx++) { sum=(w_hid2_to_hid[hid2idx][hid_idx][0] *hid_out[hid_idx])+sum; sum=(sum-threshold)/To; if (sum>40) sum=40; if(sum<=-40) sum=-40; hid2out[hid2idx]=l/(1+exp(-sum));
<
for (out_idx=l; out_idx<=max_output_neuron; out_idx++) { sum=0; for(hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++)

{ }
sum=(w_hid_to_out[out_idx][hid2idx][0] *hid2out[hid2idx])+sum; if(sum>40) sum=40; if (sum<=-40) sum=-40;
sum=(sum-threshold)/To;
/*
*/
out_out [numrec] [out_idx]=l/ (1+exp (-sum)); printf("out[%d][%d]= %1.151f\n",numrec,out idx,out out[numrec][out idx]) /output for output neuron*/
f'9
for (out_idx=l; out_idx<=max_output_neuron; out_idx++)

{
/squared error measure*/ out_error [numrec] [out_idx] = (targetdatjmtx [numrec] [out_idx] -out_out [numrec] /* out_error [numrec] [out_idx]=loglO (1/ (l-out_out [numrec] [out_idx])); else out_error [numrec] [out_idx]=loglO(l/out_out [numrec] [out idx]);
/
/* printf ("numrec = %d out_error [%d] [%d] = %1.151f \n", numrec, numrec, out idx, out errori */
>
numrec++; /* after training all patterns once,find global error and adjust the weightsV if(numrec==maxrec)
{
sumerror=0; for (numrecr=0;numrecr<=maxrec-l;numrecr++) for (out_idx=l; out_idx<=max_output_neuron; out_idx++)

{ >
sumerror=out_error[numrecr][out_idx]+sumerror;
error_curr[0]=sumerror/maxrec; / printf("%1.151f error[%d]; %d iterations ; %1.151f delta_error; %1.151f Deltc if (iter_curr%100=0) printf("%1.151f error; %d iterations\n",error_curr[0], iter_curr); numrec-0; delta_error=error_curr[0]error_curr[l]; error_curr[1]=error_curr[0]; /calculate the new weights for hidden neurons layer #1 */ for (hid_idx=l; hid_idx<=max_hidden_neuron; hid_idx++)
{
for (inp_idx=l; inp_idx<=max_input_neuron; inp_idx++)

{
Delta=(w_hid_to_inp[hid_idx] [inp_idx] [1] w_hid_to_inp[hid_idx][inp_idx][2]) delta_error; / printf("w[2] - w[l] = %1.151f %1.151f = %1.151f \n",w_hid_to_inp[hid_idx][ w_hid_to_inp[hid_idx][inp_idx][1]); printf("Delta=%l.151f ", Delta);/ / printf("w_hid_to_inp(-1)= %1.151f , w_hid_to_inp(-2)= %1.151f\n", w_hid_to_inp[hid_idx][inp_idx][(iter_curr-l)%3] ,w_hid_to_inp[hic (iter_curr-2)%3]);
/
temp=Delta/T; if (teii55<40) temp=-40; if(temp>=40)
temp=40;
P=l/(1+exp(ETA*temp)); /* printf("P= %lf ",P);*/ temp=drand48(); /* print f C'rand#= %lf\n",temp);*/ if(P>temp) w_hid_to_inp[hid_idx] [inp_idx] [0] = w_hid_to_inp[hid_idx] [inp_idx] [1] +DELTA;
} { {
else
w_hid_to_inp[hid_idx][inp_idx] [ 0] = w_hid_to_inp[hid_idx][inp_idx][1]-DELTA;
}/* */
printf("new w_hid_to_inp = %1.151f\n",w hid to inp[hid idx][inp idx] w_hid_to_inp[hid_idx] [inp_idx] [2]=w_hid_to_inp[hid_idx] [inp_idx] [ 1 ]; w_hid_to_inp[hid_idx] [inp_idx] [l]=w_hid_to_inp [hid_idx] [inp_idx] [0];
} }
for (hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++)
{
for (hid_idx=l;hid_idx<=max_hidden_neuron;hid idx++) { Delta=(w_hid2_to_hid[hid2idx] [hid_idx] [1] w__hid2_to_hid[hid2idx] [hid_idx] [2]) * delta_error; * * printf("Delta=%l.151f delta_error=%l.151f\n",Delta,delta_error);*/ printf("w_hid_to_inp(-1) = *1.151f , w_hid_to_inp(-2)= %1.151f\n", w_hid_to_inp[hid_idxl [inp_idxj [ (iter_curr-l)<%3],w_hid_to_inp[hiri temp=Delta /T; if(temp<40) temp=-40; if(temp>=40) temp=40; printf("P= %lf",P); */ if(P>temp)
{
[ /* /*
P=l/ (1+exp (ETA*temp)) ;
t emp=drand4 8 0; printf(rand#= %lf\n",temp); */ w_hid2_to__hid[hid2idx] [hid_idx] [0] = w_hid2_to_hid[hid2idx][hid_idx][1]+DELTA;

} {
else w_hid2_to_hid[hid2idx][hid_idx] [0] = w_hid2_to_hid[hid2idx][hid_idx][1]-DELTA;

}/*
printf("new w_hid_to_inp = %1.151f\n",w_hid_to_inp[hid_idx][inp_idx] w_hid2_to_hid[hid2idx] [hid_idx] [2]=w_hid2_to_hid[hid2idx] [hid_idx] [1]

}
weoutput=fopen(outdatname,"wt");
/calculate the new weights for output neurons*/ for (out_idx=l; out_idx<=max_output_neuron; out_idx++)
{
for (hid2 idx=l; hid2 idx<=max_hidden_neuron; hid2 idx++)

{
/*
Delta=(w_hid_to_out[out_idx] [hid2idx] [1]w_hid_to_out [out_idx] [hid2idx] [2]) * delta_error; printf("Delta' = %1.151f\n", Delta);*/ sumT=sumT+fabs (Delta); temp=Delta/ T; if(temp<40) tentp=40; if(temp>=40 ) temp=40; P=l/ (1+exp (ETA*temp)); /update weights*/ temp=drand4 8 0; if(P>temp) w_hid_to_out[out_idx][hid2idx][0]= w_hid_to_out [out_idx] [hid2idx] [1]+DELTA; else w_hid_to_out [out_idx] [hid2idx] [0] = w_hid_to_out [out_idx] [hid2idx] [1 ] -DELTA;
')
/* |
} >
printf("new w_hid_to_out = %1.151f\n",w_hid_to_out[out_idx][hid_idx][ w_hid_to_out [out_idx] [hid2idx] [2]=w_hid_to_out [out_idx] [hid2idx] [1]; w_hid_to_out[out_idx][hid2idx][1]= w_hid_to_out [out_idx] [hid2idx] [0];
iter curr++; } /*endif*/ } /*endwhile*/

}
/SAVEWEIGHTS*/
void saveweights(char trfile) {FILE weoutput; int hid_idx,hid2idx, inp_idx, out_idx; char outdatname[50]; printf("saveweights\n");
w_hid2_to_hid[hid2idx] [hid_idx] [1] = w_hid2_to_hid[hid2idx] [hid_idx] [0];
change_ext(trfile,outdatname,"wgh"); for(hid_idx=l;hid_idx<=max_hidden_neuron;hid_idx++)
{ }
for (inp_idx=l; inp_idx<=max_input_neuron; inp_idx++) fprintf(weoutput,"%1.151f\n",w_hid to_inp [hid_idx] [inp_idx] [0]);
for (hid2idx=l; hid2idx<=max_hidden_neuron; hid2idx++) i for (hid_idx=l; hid_idx<=max_hidden_neuron;hid_idx++) fprintf (weoutput, "%1.151f\n", w_hid2_to_hid[hid2idx] [hid_idx] [0]) /
}
for (out_idx=l; out_idx<=inax_output_neuron; out_idx++)

{ } }
for (hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++) fprintf(weoutput,"%1.151f\n",w_hid_to_out[out_idx][hid2idx][0]);
fclose(weoutput);
/*RESTOREWEIGHTS*/ int restoreweights(char *trfile) (FILE *weinput; int hid_idx, inp_idx, hid2idx,out_idx; char outweinameTsO]; printf("Retrieving weight values\n"); change_ext(trfile, outweiname,"wgh"); if((weinput=fopen(outweiname,"rt"))==NULL) printf("Unable to open weights input file\n"); return(0);
}
for (hid_idx=i; hid_idx<=max_hidden_neuron; hid_idx++)

{ }
for (inp_idx=l; inp_idx<=max_input_neuron; inp_idx++) fscanf (weinput, "%1.151f", &w_hid_to_inp[hid_idx] [inp_idx] [0]) ;
for (hid2 idx=l; hid2 idx<=max_hidden_neuron; hid2 idx++) for (hid_idx=l; hid_idx<=inax_hidden_neuron; hid_idx++) fscanf(weinput,"%1.151f",&w_hid2_to_hid[hid2idx][hid idx][0]);
for(out_idx=l;out_idx<=max_output_neuron;out_idx++) for (hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++) fscanf(weinput,"%1.151f\n",4w_hid_to_out[out_idx][hid2idx][0])

} }
;
fclose(weinput); return(1); void decode(char *ttfile,char *trfile, double To) {FILE *tinput; int numrec=l; int hid_idx, inp_idx,hid2idx, out_idx; double sum;
sum=(w hid2_to_hid[hid2idx] [hid_idx] [0]*hid out [hid idx]) + sum; }
if((tinput=fopen(ttfile,"rt"))==NULL)
{
printf("Unable to open test file:%s",ttfile); exit(0);
fscanf(tinput,"%d",fimaxrec); fscanf(tinput,"%d",&max_input_neuron); fscanf(tinput,"%d",&max_hidden_neuron); fscanf(tinput,"%d", &max_output_neuron); printf("max_rec: %d \n",maxrec); printf("max_input_neuron: %d\n ",max_input_neuron); printf("max_hidden_neuron: %d \n",max_hidden_neuron); printf("max_output_neuron: %d \n",max_output_neuron); restoreweights(trfile); printf("calculating outputs for test file : \n"); threshold=0.8; while(!feof(tinput)&&(numrec<=maxrec))
{
for(inp_idx=l;inp_idx<=max_input neuron;inp_idx++)
{
fscanf (tinput, "%lf", &traindat_mtx [numrec] [inp_idx]); /* printf("traindat_mtx[%d][%d]= %lf \n", numrec, inp_idx, traindat_mtx(numrec][inp_ic
*/ }
for (hid_idx=l; hid_idx<=max_hidden_neuron; hid_idx++)

{ { }
/calculate output for hidden neurons layer #1 */ sum=0; for (inp_idx=l; inp_idx<=max_input_neuron; inp_idx++) sum= (w_hid_to_inp [hid_idx] [inp_idx] [0] *traindat_mtx[numrec] [inp_idx]) +sum;
/calculate sigmoid for hidden neuron*/ sum=(sumthreshold) /To; if(sum>40) sum=40; if(sum<=40) sum=40; hid_out[hid_idx]=1/(1+exp(-sum));
}
for (hid2idx=l; hid2idx<=max_hidden_neuron; hid2idx++)

{
/calculate output for hidden neurons layer #2 */ sum=0; for (hid_idx=l; hid_idx<=max_hidden_neuron; hid_idx++) { /calculate sigmoid for hidden neuron*/ sum=(sum-threshold)/To; if(sum>40) sum=4 0; if(sum<=-40) sum=-40; hid2out [hid2idx] =1/ (1+exp (-sum));
for (out_idx=l; out_idx<=max_output_neuron; out_idx++)

{
/calculate output for output neuron*/ sum=0; for (hid2idx=l;hid2idx<=max_hidden_neuron;hid2idx++) sum= (w__hid_to_out [out_idx] [hid2idx] [0] hid2out [hid2idx]) + sum; /calculate sigmoid for output neuronV sum= (sum-threshold) /To; if(sum>40) sum=40 ; if(sum<=-40) sum=40; out_out [numrec] [out_idx] =1/ (1+exp (-sum)); printf ("%d %lf\n",numrec,out_out[numrec][out_idx]); numrec++;
}
.' )
}
/print results^/ fclose(tinput);
void change_ext(char infilename[], char outfilename[], char extension!]) /change the iextensin of the file*/ [int i=0,j,found=0; while(! found)
{
if(infilename [i] != '.') outfilename [i] = infilename[i++] ; else { found=l; outfilename [i++] =
>
>
for(j=0;j<=2;j++) outfilename[i++]=extension[ j]; out filename[i]=0x0 0;

}
6221 1 0 0.1 }
sum=(w hid2_to_hid[hid2idx] [hid_idx] [0]*hid out [hid idx]) + sum;
2 1 0.9 1 2 0.9 3 1 0.9 1 3 0.9

1 0.1
f#
*$
Repr oduc ed with perm issio n of the
Enter action: 1 for testing, 2 for training, 0 for quiting action selected is: Enter max # of iterations: max # of iterations selected is: 2000 Enter learning rate: learning rate selected is: 1.500000 Enter max allowed error: max allowed error selected is: 0.001000 Enter step size: step size selected is: 0.005000 Enter value for T: T selected is: 10.000000 Enter value for To for training: To selected is: 0.500000 Enter name of training file: trfile4.w action = training 0.083790722105794 error; 100 iterations 0.061472119018033 error; 200 iterations 0.057704180817928 error; 300 iterations 0.055681928729816 error; 400 iterations 0.054415647164308 error; 500 iterations 0.054062445716231 error; 600 iterations 0.053870712183253 error; 700 iterations 0.053819672094691 error; 800 iterations 0.053846631848974 error; 900 iterations 0.053752152550334 error; 100 iterations 0.053797034192116 error; 110 iterations 0 0.053851260079234 error; 0 120 iterations 0.053759940532910 error; 0 130 iterations 0.053775085690418 error; 0 140 iterations 0.053804955312124 error; 0 150 iterations n c c n \J U J . J / J O J 7 0 error; 0 160 iterations 0.053775554467712 error; 0 170 iterations 0.053855153908538 error; 180 iterations 0 0.053756628790888 error; 0 190 iterations 0.053864869964830 error; 200 iterations 0 saveweights 0
Enter action: 1 for testing, 2 for training, 0 for quiting action selected is: 1 Enter name of testing file: ttfile4.wEnter value of To for testing: To = 0.100000 test file = ttfile4.w To = 0.100000 action=testing max_rec: 5 max_input_neuron: 2 max_hidden_neuron: 2 max_output_neuron: 1 Retrieving weight values calculating outputs for test file : 0 0.999984 1 0.999984 ' 2 0.000339 3 0.000338 4 0.999984 Enter action: 1 for testing, 2 for training, 0 for quiting action selected is: 0 2 2 1
0
0 1 0.9 1 0 0.9
1
0 0.1
0.1
Reprodu ced with permissi on of the copyrigh t owner. Further
Enter action: 1 for testing, 2 for training, 0 for quiting action selected is: 2 Enter max # of iterations: max # of iterations selected is: 5000 Enter learning rate: learning rate selected is: 1.500000 Enter max allowed error: max allowed error selected is: 0.001000 Enter step size: step size selected is: 0.005000 Enter value for T: T selected is: 10.000000 Enter value for To for training: To selected is: 0.600000 Enter name of training file: trfile.w action = training 0 .168259779473345 error 100 iterations 0 .165123067650220 error 200 iterations 0 .163767297370619 error 300 iterations 0 .162377073031144 error 400 iterations 0 .160173738737463 error 500 iterations 0 .158311034351551 error 600 iterations 0 .155659845968456 error 700 iterations 0 .151625071646242 error 800 iterations 0 .148287449356743 error 900 iterations 0 .144071391644504 error 1000 iterations 0 .142323385319672 error 1100 iterations 0 .139245496745843 error 1200 iterations 0 .136778671938104 error 1300 iterations 0 .134677589403852 error 1400 iterations 0 .133209904166981 error 1500 iterations 0 .131973476868168 error 1600 iterations 0 .129988832190930 error 1700 iterations 0 .128066881771030 error 1800 iterations 0 .126680779350106 error 1900 iterations 0 124794608265575 error 2000 iterations 0 123115432449045 error 2100 iterations 0 119522515898817 error 2200 iterations 0 117265260072319 error 2300 iterations 0 113269186460996 error 2400 iterations 0 . 108225605479347 error 2500 iterations . 104950813095311 error 2600 iterations 0 . 101249497624252 error 2700 iterations 0 0 . 097751523379228 error 2800 iterations 0 . 095977911976831 error 2900 iterations . 092202291232662 error 3000 iterations 0 0 . 089119999812971 error 3100 iterations 0 . 085475981059144 error 3200 iterations . 082869463476562 error 3300 iterations 0 . 081030114686507 error 3400 iterations 0 . 078843868195085 error 3500 iterations 0 0 . 077541037752258 error 3600 iterations 0 . 075960538466115 error 3700 iterations 0 . 074625807256360 error 3800 iterations 0 . 074089346533248 error 3900 iterations 0 . 073451932923013 error 4000 iterations 0 . 072811441663876 error 4100 iterations 0 . 072051114859130 error 4200 iterations 0 . 071179903296751 error 4300 iterations 0 . 070798448531594 error 4400 iterations . 070463242842834 error 4500 iterations 0 . 069531238666617 error 4600 iterations 0 0 . 068987389350651 error 4700 iterations 0 . 068073912342076 error 4800 iterations 0 . 067400359390076 error 4900 iterations . 066357900663609 error 5000 iterations 0 . saveweights Enter action: 1 for testing, 2 for training, 0 for quiting action selected is: 1 Enter name of testing file: ttfile.wEnter value of To for testing: To = 0.200000 test file = ttfile.w To = 0.200000 action=testing max_rec: 4 max_input_neuron: 2 max_hidden_neuron: 2 max_output_neuron: 1 Retrieving weight values calculating outputs for test file :
1 0.021487 2 0.996726 3 0.961094 4 0.020268 Enter action: 1 for testing, 2 for training, 0 for quiting action selected is:
PLEASE NOTE Page(s) not included with original material and unavailable from author or university. Rimed as received.
UMI
A. 4. Alopex HDL behavioral/structural description
145
f9
module neural_net_top (in); //March 14, 1996 ; March 29; April 1 ; April 12 //Laura Ruiz errorll.v; linearly separable // with overflow check and negative weights and inputs // weights=siii.ffff inputs=s.fffffff // initializing weights with small pos.or neg.values, not random, all same; // multiplier designed by hand // input in; wire elk, enable_train, reset, delta_E_ready, noise_serial_l, noise_serial_2, noise_serial_3, noise_serial_4, noise_serial_5, noise_serial_6, noise_serial 7, noise_serial_8, not_noise_serial_5, not_noise_serial_7, done wl, done_w2, done_w3, done_w4, done_w9, done_wlO, en_l, en_2, en_5, ready_nl, ready_n2, ready_n5, enable_l, enable_2, enable_5, finished_train, enable_test, ready_nl_and_n2, finished_train_and_enable_test; wire[7:0] w_new_l, w_new_2, w_new_3, w_new_4, w_new_9, w_new_l0; wire[7:0] in_l, in_2, out_nl, out_n2, out_n5; wire[15:0] delta E; // power on reset power_on_reset // system clock m clock_generator(elk); power_on(reset);
// NM810 serial number generator noise_serial bit_serial_l(elk,noise_serial_l) noise_serial bit_serial_2(elk,noise_serial_2) noise_serial bit_serial_3(elk, noise_serial_3) noise_serial bit_serial_4(elk,noise_serial_4) noise_serial bit_serial_5(elk,noise_serial_5) noise_serial bit_serial_6(elk,noise_serial_6) noise_serial bit_serial_7(elk,noise_serial_7) noise_serial bit_serial_8 (elk,noise_serial_8) // bank of weight units weight3 weight_unit_l (enable_train, elk, reset, delta_E_ready, delta_E,not_noise_serial_5, w_new_l, done_wl); weight3 weight_unit_2(enable_train,elk,reset, delta_E_ready,delta_E, noise_serial_8, w_new_2, done_w2); weight3 weight_unit_3 (enable_train, elk, reset, delta_E_ready, deltaJE, noise_serial_3, w_new_3, done_w3); weight3 weight_unit_4 (enable_train, elk, reset, delta_E_ready, delta_E, not_noise_serial_7, w_new_4, done_w4); weight3 weight_unit_9 (enable_train,elk,reset,delta_E_ready,delta_E, noise_serial_4, w_new_9, done_w9); weight3 weight_unit_10(enable_train,elk,reset, delta_E_ready, delta_E, noise_serial_l, w_new_l0, done_wl0); // bank of output calculation units neuro2 n_l(in_l, in_2, w_new_l,w_new_3,out_nl,en_l, ready_nl,reset, elk); neuro2 n_2(in_l, in_2, w_new_2,w_new_4,out_n2,en_2, ready n2,reset,elk); neuro2 n_5 (out_nl,out_n2, w_new_9,w_new_10, out_n5, en_5, ready_n5, reset, elk); // control unit control_unit sequencer(elk,done_w4,reset, out_n5,ready_n5,enable_train, enable_test, delta_E, delta_E_ready,in_l,in_2, finished_train) //glue logic
not not_l (not_noise_serial_5,noise_serial_5); not not_2 (not_noise_serial_7, noise_serial_7); and and_l(enable_l, done_wl, done_w3, enable_test); and and_2(enable_2, done_w4, done_w2, enable_test); and and_4(ready_nl_and_n2, ready_n2, ready_nl); and and_5(finished_train_and_enable_test, finished_train, enable_test); and and_8 (enable_5, enable_test,ready_nl_and_n2, done_w9, done_wlO); or or_l(en_l, finished_train_and_enable_test, enable_l); or_2(en_2, finished_train_and_enable_test, enable_2); or_5(en_5, finished_train_and_enable_test, enable_5); endmodule or or
// Verilog HDL for "Class", "weight3" "_functional" module weight3(Enable, elk, reset, delta_E_ready, delta_E, noise, W_new, done); input Enable, elk, reset, noise, delta_E_ready; input [15:0] delta_E; output [7:0] W_new; output done; reg reg reg reg [7:0] W_old, W_new, weight_id, delta_W; reg [9:0] W_temp; reg [15:0] deltaE; [15:0] Noise; , [20:0] x, result; done,sign_deltaE, sign_delta_W, sign; reg[4:0] i;
parameter T = 2; always 0(posedge elk) begin: training if(!reset) disable training; done <= l'bO; x[20:0] <= 21'hO; if (Enable) begin 0(posedge elk) done <= l'bO; for(i=16; i; i=i-l) begin @(posedge elk) Noise[i-l]<=noise; end 0(posedge elk) $display("Noise= %h ",Noise[15:0]); deltaE[15:0] <= delta_E[15:0]; 0(posedge elk) $display("delta_E_received_by_weight[%d] = %h - ",weight_id,deltaE); if(deltaE[15]==1) //take absolute value of deltaE begin @(posedge elk)
deltaE[15:0] <= -deltaE[15:0] + 1; sign_deltaE <= 1; end else 0(posedge elk) sign_deltaE <= 0; $display("delta_E=%h\n",deltaE); 0(posedge elk) begin // multiplier designed by hand starts here // multiplies a positive 16-bit number by a positive 5-bit number //result is 21 bits and replaces:
// //
// x[20:0] <= deltaE [15:0] * delta_W[4:0]; $display("deltaE=%h ; deltaW= %h \n",deltaE,delta_W); result[20:0] <= 21'hOOOOOO; if(delta w[0]) result[20:0] <= result[20:0] + {deltaE[15:0],5'hOO}; else result[20:0] <= result[20:0]; @ (posedge elk) result[20:0] <= result[20:0] 1; end 0 (posedge elk) begin f(delta W[1]) result[20:0] <= result[20:0] + [deltaE[15:0], 5'h00}; else result[20:0J <= result[20:0]; 0 (posedge elk) result[20:0] <= result[20:0] 1; end @ (posedge elk) begin if(delta_W[2]) result[20:0] <= result[20:0] + {deltaE[15:0], 5'h00}; else result[20:0] <= result [20:0]; 0(posedge elk) result[20:0] <= result[20:0] 1; end 0(posedge elk) begin if <delta_W[3]) result[20:0] <= result[20:0] + [deltaE[15:0], 5' hOO}; else result[20:0] <= result[20:0]; 0(posedge elk) result[20:0] < result [20:0] 1; end @(posedge elk) begin if (delta_W[4]) result[20:0] <= result[20:0] + [deltaE[15:0], 5'hOO]; else result[20:0] <= result[20:0]; 0 (posedge elk) result[20:0] <= result[20:0] 1; end 0(posedge elk) result[20:0] <= result[20:0] << 1; sign <= sign_deltaE A sign_delta_W; @(posedge elk) $display("result = %h \n",result); x[20:0] <= result[20:0];
*9
x[20:0] = ~x[20:0]+1;
@(posedge elk) if(sign==l) else begin if(x[19:16]==4'h0) x[20:0] = x[20:0] 4; else

x[20:0] = x[20:0];
end 0(posedge elk) $display ("deltaE*deltaW=x[?d] = %h %d \n \n ",weight_id, x,x); $display ("deltaE*deltaW = %h - sign_deltaE= %b - sign_delta_W = %b sign = %b x, sign_deltaE, sign_delta_W, sign);
// //
//new weight adjustment (March 14) 0 (posedge elk)
W_temp[9:0] <=({{2{W_old[7]}},W_old[7:0]}) - ({{5<x[20]}},x[20:16]}) + Noise[15:0];
0(posedge elk) $display("W_temp=%h %d W_old= %h %d x= %h %d ANoise=%b\n",W_temp, W_temp, W_old, A w_old, x,x, Noise); //saturation arithmetic if(W_temp[9]==0) begin if(W temp[8:7] != 2'b00) W_new[7:0] <= 8'h7f; else W_new[7:0] <= W_temp[7:0]; end begin if(W temp[8:7] != 2'bll) W_new[7:0] <= 8'h81; else W_new[7:0] <= W_temp[7:0]; end
else
0(posedge elk) Sdisplay("W_new[%d]= %h, %d, W_old[%d]= %h, %d, x=%h %d, ANoise= %h\n", weight_id, W_new, W_new, weight_id, W_old, W old, x,x, ANoise); delta_W[7:0] <= W_new[7:0]-W_old[7:0] ; 0(posedge elk) Sdisplay("delta_W[%d]=%h\n",weight_id,delta_W); if(delta_W[7]==1) begin 0(posedge elk) delta_W[7:0] <= ~delta_W[7:0] + 1; sign_delta_W <= 1; end else 0 (posedge elk) sign_delta_w <=0; 0(posedge elk) $display("sign_delta_W = %b delta_W[%d] = %h \n",sign_delta_W,v;eight_id,delta_W); W_old[7:0] <= W_new[7:0]; done <= l'bl; wait(delta_E_ready) @(posedge elk) done <= 1'bO;
end
end
always @(reset) if(reset) begin disable training; assign delta_W = 8'h01; assign W_new = 8'hO; assign done = 1'bO; assign deltaE = 16'hlO; assign Noise = 16'hO; assign x = 21'hO; assign sign_deltaE = 1'bO; assign sign_delta_W = 1'bO; assign sign = 1'bO; assign W_old = 8'hfO; assign weight_id = 8'hff; assign result = 21'h000000; end else begin deassign delta_W; deassign w_old; deassign W_new; deassign done; deassign deltaE; deassign Noise; deassign x; deassign sign_deltaE; deassign sign_delta_W; deassign sign; deassign result; end endmodule
//*************************************************************************+*+*************** module
neuro2(ini, in2, weightl, weight2, out, en, ready, reset, elk); parameter To=l, threshold=0; input[7:0] ini, in2; input[7:0] weightl, weight2; input en, reset, elk; output ready; output[7:0] out; reg [7:0] Ini, In2, In, out; reg [7:0] Weightl, Weight2, Weight; reg [19:0] temp; reg ready, signw, signin, sign; reg [15:0] result,tempi,temp2; reg [3:0] i; //may be >0 or <0 // may be >0 or <0
always 6(posedge elk) begin: neuron if(!reset) disable neuron; if(en) begin 6(posedge elk)
*9
begin
end @ (posedge elk) $display("ini = %d %h , in2= %d %h, weightl = %d %h, weight2 = %d %h\n", Ini, ini, In2, in2, Weightl, weightl, Weight2, weight2); @ (posedge elk) //multiplier starts here, replaces: //tempi[15:0] <= mult(Ini[7:0],Weightl[7:0]); signw <= Weightl[7]; signin <= Ini[7]; @ (posedge elk) if(signw) Weightl[7:0] <= ~Weightl[7:0] + 1; if(signin) Ini[7:0] <= ~Inl[7:0] + 1; sign <= signw A signin; result <= 16'h0000; [3:0] <= 4'hO; @(posedge elk) In[7:0] <= Ini[7:0]; Weight[7:0] <= Weightl[7:0]; $display("signw=%b,signin=%b,sign=%b\n",signw, signin,sign); for(i=0; i<=7; i=i+l) begin $display("i=%h; Inl=%h; Wl=%h;result=%h\n",i,In,Weight,result); G(posedge elk) begin if (ln[0]) result[15:0] <= result[15:0] + {Weight[7:0],8'h00]; else result [15:0] <= result[15:0]; @(posedge elk) result[15:0] <= result[15:0] 1; In [7 : 0] <= In [7:0] 1; end end @(posedge elk) result[15:0] <= result[15:0] 1; //adjust for fractional data @(posedge elk) if(sign) tempi[15:0] <= -result[15:0] + 1; else tempi[15:0] <= result[15:0]; @(posedge elk) // temp2[15:0] <= mult(In2[7:0],Weight2[7:0]); Sdisplay("tempi[15:0]= %d %h\n",tempi, tempi); signw <= Weight2[7]; signin <= In2[7]; @(posedge elk) if(signw) Weight2[7:0] <= ~Weight2[7:0] + 1; if(signin) In2[7:0] <= ~In2[7:0] + 1; sign <= signw A signin; result <= 16'h0000; i [3:0] <= 4'hO; @(posedge elk) In[7:0] <= In2 [7:0]; Weight[7:0] <- Weight2[7:0];
tempi [15:0] = 16'h0; temp2[15:0] = 16'h0; temp[19:0] = 20'h0; Ini[7:0] <= ini[7:0]; Xn2[7:0] <= in2[7:0]; Weightl[7:0] <= weightl[7:0]; Weight2[7:0] <= weight2[7:0];
f'9
$display("signw2=%b, signin2=%b, sign=%b\n",signw,signin,sign) ; for(i=0; i<=7; i=i+l) begin @(posedge elk) $display("i=%h; In2= %h; W2= %h; result = %h\n", i, In, Weight, result); begin if (In [0]) result[15:0] <= result[15:0] + {Weight[7:0],8'h00}; else result[15:0] <= result[15:0]; @(posedge elk) result [15: 0] <= result [15:0] >> 1; In[7 :0] <= In[7:0] 1; end end @ (posedge elk) result[15:0] <= result[15:0] << 1; //adjust for fractional data @(posedge elk) if(sign) temp2[15:0] <= -result[15:0] + 1; else temp2[15:0] <=* result [ 15: 0]; 0(posedge elk) $display("temp2=%h \n", temp2); temp [19:0] <= {{4 {tempi [15]}}, tempi [15:0]} + {{4{tenrp2 [15]}}, temp2[15:0]} @(posedge elk) if(temp[19]==0) begin if(temp[18:12] != 7'b0000000) out[7:0] <= 8'h7f; else begin if(temp[4:l] < 4'h8) out[7:0] <= {temp[19],temp[ll:5]}; else if(temp[4:l] > 4'h8) out[7:0] <= {temp[19], temp[ll:5]} + 1; else if((temp[4:l]==4'h8)&&(temp[5]==0)) out[7:0] <= {temp[19],temp[ll:5]]; else if((temp[4:1]==4'h8)&& (temp[5]==l)) out[7:0] <= {temp[19],temp[ll:5]} + 1; else Sdisplay ("error in out value"); end end else begin i f(t emp[18:12] != 7'blllllll) out[7:0] <= 8'h81; else begin //rounding: if (temp[4:1] < 4'h8) out [7:0] <= {temp[19],temp[11:5]}; else if(temp[4:l] > 4'h8) out [7:0] <= {temp[19],temp[11:5]} + 1;. else if((temp[4:1]==4'h8)&&(temp[5]==0)) out[7:0] <= {temp[19],temp[11:5]} ; else if((temp[4:l]==4'h8)S&(temp[5]==l)) out [7:0] <= {temp[19],temp[11:5]} + 1; else
*$
end end
$display ("error \n");
end
@ (posedge elk) $display("temp = %h %d ; out = %h %d \n",temp,temp,out,out); ready <=l'bl; @(posedge elk) ready <= l'bl; @ (posedge elk) ready <= 1'bO; end always @ (reset) if(reset) begin disable neuron; assign out=0; assign ready=l'b0; assign temp =0; assign temp2 = 0; assign tempi = 0; end deassign out; deassign ready; deassign temp; deassign temp2; deassign tempi; begin
else
end
endmodule
//**************************************#****************************************************
II Verilog HDL for "Class", "control_unit" "_functional"
module control_unit(elk, done_w, reset, out_out, done, enable_train, enable_test, delta_E, delta_E_ready, traindat_mtx_l, traindat_mtx_2, finish input elk, reset, done, done_w; input [7:0] out_out; output enable_train, enable_test, delta_E_ready, finished_train; output [7:0] traindat_mtx_l, traindat_mtx_2; output [15:0] delta_E; reg[7:0] traindat_mtx_l, traindat_mtx_2; reg[7:0] temp, temp_out,temp3, tempi; reg[15:0] iter_curr; reg[15:0] delta_E; reg[19:0] error_curr, error_curr_nl, saved_error, delta E_l; reg[9:0] temp4; reg[19:0] tempo, sumerror; reg[19:0] temp2;
reg[3:0] numrec, k; reg enable_train, enable_test, delta_E_ready, finished_train; // RAM for training & testing vectors: reg [7:0] traindat_mtx [0:15], targetdat_mtx [0:15], parameter To = 8'hi, threshold = 8'h8, iter_max = 16'h70, errorjmax = 19'hOOOOf, maxrec = 4; always 0(posedge elk) begin: control_unit if(!reset) disable control_unit; //begin training begin while((iter_ourr <* iter_max)&&(error_curr >= error_max)) begin
0(posedge elk) sumerror[25:0] <= 26'hO; k <= 4'hO; @(posedge elk) enable_train <= l'bl; @(posedge elk) enable_train <= 1'bO;
save_results_mtx[0:15];
for(numrec=0; numrec<=maxrec-l; numrec=numrec+l) begin @ (posedge elk) $display("numrec= %d \n", numrec); traindat_mtx_l[7:0] <= traindat_mtx[k]; 0(posedge elk) k <= k+1; @ (posedge elk) traindat_mtx_2 [7:0] <= traindat_mtx[k]; 0(posedge elk) k <= k+1; / / $display ("traindat_mtx_l=%d traindat_mtx_2=%d\n", traindat_mtx_l, traindat_mtx_2); wait(done_w) @(posedge elk) enable_test <= l'bl; wait(done) begin @(posedge elk) enable_test <= 1'bO; temp_out(7:0] <= out_out[7:0]; temp2[19:0] <= 20'h0; tempo[19:0] <= 26'hO; temp3[7:0] <= targetdat_mtx[numrec]; @(posedge elk) temp4[9:0] <= [temp3[7],temp3[7],temp3[7:0]) {temp_out[7],temp_out[7],temp_out[7:0]}; @(posedge elk) begin $display("delta_error=%h %d \n",temp4/temp4); if(temp4[9]==1) temp4[9:0] <= ~temp4[9:0]+l'bl; end @(posedge elk) $display("target-temp_out= %h Id ",temp4, temp4); temp2[19:0] <= temp4[9:0]*temp4[9:0]; @(posedge elk) temp2[19:0] <= temp2[19:0] 1; @(posedge elk) $display("squared error = %h %d ",temp2, temp2); tempo[19:0] <= temp2[19:0] + sumerror[19:0]; @(posedge elk) sumerror[19:0] <= tempo[19:0]; 0(posedge elk)
$display(" out_out$display("sumerror = %h %d \n \n targetdat_mtx[%d] (desired_output) = %d %h ' (calculated_output) = %d %h ; \n",tempo, tempo); end end //next record //after all records done, calculate new delta_E and adjust weights ( (posedge elk) tempo[19:0] <= sumerror[19:0] 2; G(posedge elk) error_curr[19:0] <= tempo[19:0]; @(posedge elk) $display("sumerror = %d %h ; error_curr = %d %h\n",sumerror, sumerror, error_curr,error_ct delta_E_l[19:0]<= error_curr[19:0] - error_curr_nl[19:0]; @(posedge elk) $display("error_curr= %d %h error_nl= %d %h \n",error_curr, error_eurr,error_curr_nl, error_eurr_nl); //saturation arithmetic: if (delta_E_l[19] == 1'bO) begin if(delta_E_l[16:15] != 2'h0) temp3[15:0] <= 16'h7fff; else temp3[15:0] <= delta_E_l[15:0]; end else if (delta_E_l[19] == l'bl) begin if(delta_E_l[16:15] != 2'h3) temp3[15:0] <= 16'h8001; else temp3[15:0] <= delta_E_l[15:0]; end else begin $display("error\n"); $finish; end @ (posedge elk) delta_E[15:0] <= temp3[15:0]; delta_E_ready <= l'bl; @ (posedge elk) $display ("delta_E= %h %d delta_E_l = %h %d \n", delta_E, delta_E, delta_E_l, delta_E_l); if(error_curr[19:0] < saved_error) begin saved_error[19:0] <= error_curr[19:0] ; end $display ("saved_error=%d %h\n", saved_error, saved_error); 0 (posedge elk) error_curr_nl[19:0] <= error_curr[19:0]; @(posedge elk) delta_E_ready <= 1'bO; @(posedge elk) iter_curr <= iter_curr+l; 0(posedge elk) // if((iter_curr%10)==0) $display(Stime,,"iteration = %d, error_current = %d %h saved_error=%d %h \n",iter_curr,error end //next iteration
/*
//begin testing now //reading test vectors, for now, use the same as the training vectors 0(posedge elk) k<=4'h0; for(numrec=0; numrec<=maxrec-l; numrec=numrec+l) begin 0(posedge elk) traindat_mtx_l <= traindat_mtx[k]; 0(posedge elk) $display("traindat_mtx_l= %h ; traindat_mtx[%d] = %h", traindat_mtx_l, k, traindat mf k <= k+1; @(posedge elk) traindat_mtx_2 <= traindat_mtx[k]; @(posedge elk) $display("traindat_mtx_2=%h ; traindat_mtx[%d] = %h \n",traindat_mtx_2,k, traindat mi k <= k+1; @(posedge elk) finished_train <= l'bl; enable_test <= l'bl; wait (done) 0(posedge elk) temp <= out_out; 0(posedge elk) $display ("temp = %h \n",temp); save_results_mtx[numree] <= temp; @ (posedge elk) enable_test <= 1'bO; finished_train <=l'bO; 0 (posedge elk) $display ("save_results_ratx[%d]= %h\n", numree, save_results_mtx [numrec]); $display ("save_results_mtx [%d]= %h\n",numrec, save_results_mtx [numrec]); $display($time,,,"result[%d] = %h\n",numrec, save_results_mtx[numrec]); end
*/
end end
$finish;
always 0(reset) if(!reset) begin disable control_unit; traindat_mtx[0] = 8'hff traindat_mtx[1] = 8'h90 traindat_mtx[2] = 8'h20 traindat_mtx[3] = 8'hcO traindat_mtx[4] = 8'h40 traindat_mtx[5] = 8'h60 traindatmtx[6] = 8'hlO traindat_mtx[7] = delta_E = 19'hlOj temp = 0; temp_out = 0; temp3 = 0; tempi = 0; traindat_mtx_l = 0; traindat_mtx_2 = 0; iter_curr = 16'hi; sumerror = 0; error curr = 20'h70ff0; 8'h20 targetdat_mtx[0]= 8'h81 targetdat_mtx[1]= 8'h81 targetdat_mtx[2]= 8'h7f targetdat_mtx[3]= 8'h7f assign assign assign assign assign assign assign assign assign assign
end else begin deassign delta_E; deassign temp; deassign temp_out; deassign temp3; deassign tempi; deassign traindat_mtx_l; deassign traindat_mtx_2; deassign iter_curr; deassign sumerror; deassign error_curr; deassign error_curr_nl; deassign delta_E_l; deassign temp2; deassign tempo; deassign enable_train; deassign enable_test; deassign finished_train; deassign saved_error; end
assign error_curr_nl = 20'h70ffe; assign delta_E_l = 0; assign temp2 = 0; assign tempo = 0; assign enable_train = 1'bO; assign enable_test - 1'bO; assign finished_train = 1'bO; assign saved_error = 20'h3ffff;
function[18:0] mult; input[7:0] x, w; reg[7:0] X,W; reg[18:0] temporary; reg sign, signx, signw; begin
X[7:0] = x[7:0];
W[7:0] = w[7:0] ;
if(W[7:0]==l) begin W[7 :0]=~W[7 :0]+l'blf signw = l'bl; end else begin W[7:0]=W[7:0]; signw=l'b0; end if (X[7:0]==l) begin X[7:0]=~X[7:0]+l'bl; signxl'bl; end else begin signx=l'b0; X[7:0]=X[7:0]; end temp [18:0]=X[7:0]*W[7:0]; s ign=s i gnw A s ignx; if(sign==l) temp [18:0]=~tenp
[18:0]+l'bl; else temp[18:0]=temp[18:0] ; mult[18:0]=temp[18:0]; end endfunction endmodule
module m(clk); output elk; reg elk; initial begin elk = 1'bO; #5 elk = l'bl; end always #25 elk = -elk; endmodule
II Verilog HDL for "Class", "noise_serial" "_functional"
module noise_serial(elk, noise_serial); input elk; output noise_serial; reg noise_serial; reg[15:0] noise; reg[4:0] i; always begin noise = $random; for(i=16; i; i=i-l) begin @ (posedge elk) noise_serial <= noise[i-l]; end end endmodule
module power_on_reset(reset); output reset; reg reset; initial begin reset = l'bl; #5 reset = 1'bO; #1000 reset = -reset; endmodule
end
Chapter 8 APPENDIX B
B.1 Design, synthesis and analysis tools
Design Framework II1 is a design environment for hardware development. It provides the designer with data base management, a common looking field among various tools and a way to interact with these tools. These powerful tools allow the user to compile hardware descriptions in Verilog HDL, to simulate, synthesize, place and route system designs. This environment facilitates the use of modular, hierarchical structures making it ideal for top-down design of digital systems. At the time of this writing, Design Framework II runs on Sun SPARCstations under version 3 of the Open Windows operating system. Figure B. l shows the data flow in the Cadence Design Framework II environment
160
r#
Design Flow Within the Design Framework li Environment
Figure B.l: Data Flow in Cadence Framework II Environment

(from Cadence, 1993)
(1) A product of Cadence Design Systems, Inc.
161 Design Framework II is built on the concept of component libraries. To create a design, the user can access standard libraries of components that are included within the system software, or create custom component libraries. A custom component may represent an entire system or any part of it, at any level of complexity. Furthermore, any custom component may be made up of any number of other custom components or parts from
the standard libraries, supporting a hierarchical, top-down, design methodology. A system with many levels of hierarchy can be simulated and synthesized as a whole or in parts. Each custom component may have different cellviews: behavioral or functional, which is a text file description of the component in Verilog HDL; schematic, which is a diagram of the parts that make up the custom component, automatically generated from a behavioral or structural description by the synthesis tools, or manually entered by the designer; symbol, which is a simple black box representation of the components with inputs and outputs that can be automatically generated or entered manually; and layout, which is the netlist describing the placed and routed design. When the user edits one of the views, Design Framework II automatically compares it with the other existing cellviews for input/output pin consistency, external bus width matching, name consistency, etc.
Design Framework II is centered around a main window, the Command Input Window (CIW). From this window, the user may access the tools available within the design environment and open new or existing designs. One of the environments basic tools is the Library Browser. This tool enables the user to manage component libraries and open component files for editing. When the user opens a behavioral cellview, Design Framework II invokes a text editor for making revisions or a file viewer for reading, printing, simulation or synthesis. The environment includes tools for editing and plotting schematics and symbols. Verilog XL Integration tool runs simulations on designs. Simulation results can then be viewed as waveforms using the cWaves tool. The HDL Synthesizer and Optimizer takes the behavioral description of a component and synthesizes a corresponding schematic which is optimized for minimum area or other parameters set by the designer. The placement and routing tools used are outside this environment so the
162
synthesized views must be exported and the files edited for use by Cascades Epoch Place & Route environment. The backannotation of the design is once again done under Cadence Design Framework II environment so the placed and routed design has to be exported and edited for format. (Freytag, 1995)
163
B.2 Tutorials
B.2.1 Introduction to software tools and environment (CDA.1995)

Create directory example for vour work: Click the right mouse button in blank area of screen: Programs => Command Tool. In the shell window that appears type mkdir example and press Return. Change to the new directory: cd example. Use textedit to create files and2str.v and testand2.v: Click the right mouse button, in blank area of screen: Programs => Text Editor Type the following program into the text editor window: //structural module AND2 (ini, in2, out); input ini; input in2; output out; wire ini, in2, out; and ul (out, ini, in2); endmodule Click the right mouse button on menu bar editor File => Store As New File. Add /example to pathname in Save window. Tab to filename line, type and2sir.v and press Return. Click the right mouse button on menu bar of editor: File => Empty Document Type the following program into text editor window (from Stemheim,l993): , module test_and2; regjl,j2; wire o; AND2 u2 (jl, j2, o); initial begin jl = 0;j2 = 0; #1 $display(jl= %b, j2= %b,o=%b\n,jl, j2,o); jl =0;j2= 1; #1 $display(jl= %b, j2= %b, o= %b\n, jl, j2, o); jl = 1; j2 = 0; #1 $display(jl= %b, j2= %b, o= %bW\ jl, j2,o); jl = l;j2= 1; #1 $display(jl= %b, j2= %b, o= %b\n, jl, j2, o); end endmodule Save the program the way you saved the first one. Leave the editor open in case you need it. 164 Run Veriloe from the command line: In the shell window, type verilog -c and2str.v test_and2.v and piess Return. Verilog will compile the programs and report errors; use textedit to fix problems. To compile and run, type verilog and2str.v testand2.v and press Return. Veiilog compiles the programs and shows the output in text form. Run Cadence and create vour design library for this tutorial: In a shell window, type icds & and press Return. Wait for the icds window. In the shell window, type pwd to find the path to your example directory. Click the left mouse button on the icds window menu bar: Open => Library. Library name: trial Library path: (path to your example directory).
Click OK. After creating the library. Design Manager => Set Search Path on icds menu bar. Add the path to your example directory to the search path. On the icds menu bar. Design Manager => Library Browser. The trial library should appear in the Library Browser window. Import vour programs into Cadence: On the icds menu bar Translators => Verilog In. Target Library: trial Design Files: (paths/names of your Verilog files) Verilog Cell Modules: Click on the diamond next to Import. Verilog Structural Modules: Click on the diamond next to Junctional On the next line, erase functional and type behavioral. Click OK. When import is done, you should find the files under trial in the Library Browser. Click on them one at a time. You will see two views: behavioral and symbol for each. Click on any of these views and youll see a version number. These versions will get updated as a designer revises a design. Use the middle mouse button to open a pull-down menu on behavioral of testand2. Release the middle button on read. Anew window opens up to show the testand 2.v file you imported. Click on Design => Check and Save in this read window. This will bind the two modules, in case there was any discrepancy or time-stamp mismatches. The icds window will display the message: check and save completed for cell view testand2 behavioral. Verilog XL control window invocation: Open the READ window for the test fixture module, testand2.v by pointing at its behavioral view in the Library Browser and clicking the middle mouse button to select read. Click on Tools => Verilog-XL A Setup Environment window opens up. Click on OK. In Verilog-XL Integration. Setup => Netlist. Verify that the first entry under Netlist these views is behavioral. Click on More. Deselect the Use Test Fixture option by clicking on the right hand side button.. Change the Design Instance Path to test (erase the extension .top). Click OK. In Verilog - XL Integration, Setup => Record Signals. Press left button on Top Level Primary VO, drag to All Signals and release. Click OK. 165
In Verilog XL Integration click on Start Interactive. This will netlist the modules, i.e., extract connectivity information on which module is connected to which module by which wire. The icds window, upon successful completion, will display: CELL NAME VIEW NAME NOTE AND2 behavioral "stopping view1" testand2 behavioral The Verilog-XL Integration window will display: Highest level modules: testand2 Type ? for help Cl > View the output in cWaves: In the Verilog XL window, click on Debug > Utilities -> View Waveforms The cWaves window appears In the cWaves window, click Edit -> Browser/Display Tool Use the Browser/Display Tool to locate signals to be displayed (pressing top and/or down after highlighting a module instance in the Subscopes list) When a signal appears on the Signals list, click on the signal name, then click on the Add To button. The signal should appear in the Strips list and in the cWaves window. Click Cancel in the Browser/Display Tool when all the desired signals have been selected Click on Continue to continue the simulation To run HDL Synthesizer on a design:
In the Library Browser, open the behavioral cellview of the component or design to be synthesized using the Read command on the popup menu. A large window appears, listing the behavioral description In this window, click on Tools -> Design Synthesis -> HDL Synthesizer The HDL synthesizer may take a few minutes to initialize When the Initialize Run Directory window appears, you can use the default run directory name or change it if desired. Click OK In the HDL Synthesizer window, click on Session -> Options -> Synthesis Library In the HDL Synthesizer Synthesizer Options window select the Source Type to be None Select the target Library to be epoch_std_Iibrary. Click OK. Click on Constraints -> FSM. Type the name of the reset signal if it exists, indicate if its async. or sync. In the HDL synthesizer window, click on Run > Synthesizer Click on generate schematic (Composer) Do a synthesis check only first After it passes synthesis check, click on Run - Sythesizer > Normal To monitor the progress of the synthesis click on Show -> Log -> Synthesizer log and/or Session - > Job Monitor... After synthesis complete, chose to install schematic. Indicate the design Library you want. The schematic is temporarily saved in the Opt Library. You must install it in your design Library. Look at the run directory, find the reports about area, gate count and timing
166
f-9
To place and route a design: Create a dirctory called cascade at the top of your home directory Change directory to cascade Invoke Cascades Epoch (i.e.t type epoch&) When the cascade window appears, click on PROJECT -> PROJECT -> NEW to create a new project for your design. in the New Project requester, give the entire path to your cascade run directory and append the name of your design. Give a unique project ID, one lower case letter followed by a number Choose a suitable ruleset for layout generation. In the New Project requester pull down PROJECT -> RULESET and choose anyone (for ex., CDA.5u3mlp3V, which means a 3v logic, with a feature size of 1/2 micron, 3micron-wide mtal and 1-micron-wide poly. Check your project directory and make sure the files chip_config, chip_data, chipJogs, layout_parts, and lib_parts were created ADAPTING THE CADENCE NETLIST FOR CASCADE: a) Back in the Unix shell, run a Cascade script called synfilter on the netlist file in your synthesis directory: synfilter <path to your hierarchical netlist filexpath to your output file> e.g.: synfilter ~/synthesis/weight3.runl/syn.v ~/cascade/newname.v Synfilter removes some illegal constructs from the code such as underscores, that will cause compilation errors in Cascade b) When synfilter is finished, enter your cascade directory, open this file in a text editor and comment out all timescale directives using a double slash. For vi use: :1,$ s/'timescale/W'timescale/g Save the file and quit the editor. IMPORTING THE VERILOG NETLIST FOR COMPILATION a) Back in Epoch window, open the INPUT > VERILOG COMPILE menu item a.l.) Under LIBRARY DIRECTORY, hit the DIRECTORY button and choose the path to where newname.v is (should be <your home directory>/cascade/.) Click on that item and make sure the library directory that appears in the main form is the path to the cascade directory in your home account, where newname.v is a.2.) Make sure the VERILOG FILE EXTENSION is v a.3) Choose newname.v from the EXISTING VERILOG FILES table to the right of the form. This will put the path to newname.v in the DESIGN ROOT FILE field, and put newname in the DESIGN ROOT MODULE field a.4.) Change the DESIGN ROOT MODULE to your_module_name_imported_from_cadence a.5.) Hit RUN and close the form. Cascade will create a directory inside your project directory named your_module_name. a.6.) Look inside this new directory to make sure there are files with the extensions: .net = the netlist file .def = the definition file .vcmd and .ref NETLIST INPUT a) choose INPUT>NETLIST INPUT to import the netlist b) When form opens, choose your_module_name and hit RUN 167
GENERATING THE LAYOUT: a) Generate the layout by choosing AUTOMATIC COMPILE from the PHYSICAL DESIGN menu b) Choose your_module_name in the form and make sure PLACEMENT, ROUTING, BUFFER SIZING/POWER are chosen. Dont choose NETLIST INPUT.
VIEWING THE LAYOUT a) Choose PHYSICAL DESIGN > MANUAL COMPILE to open the Floorplan tool b) Open layout by choosing PROJECT MANAGER > PARTS > EDIT c) Choose VIEW > GEO VIEWING>EXPAND ALL to see contents of each cell
HIERARCHICAL BACK-ANNOTATION: a) EXPORTING THE VERILOG FILE Open Epoch on your project Open SIMULATION > PARAMETERS menu option In the form set: RC DELAY ANALYSIS THRESHOLD = 0 RC RESISTANCE SEGMENT LIMIT = 100 ohms Select ruleset In the SIMULATION menu, select SIMULATION OUTPUT > VERILOG and make sure the RC INTERCONNECT DELAY option is chosen, (two directories created: .sdf and .v) Quit Cascade b) REFORMATTING THE VERILOG FILE In a Unix shell, run the script called fixepoch on the Verilog output file name.v: fixepoch name.v IMPORTING THE BACK-ANNOTATED VERILOG FILE Re-enter Cadence and create a new library (set the paths) In the browser, use the middle button to open the menu on the new library. Choose CREATE VIEW and call this new view epochNetlist with a view type of netlist.v Use the middle button to open the pulldown menu on the new library Choose CREATE CELL and call the cell your_module_name. Create a cellview in your_module_name with a view name of epochNetlist. This is your template Verilog file. Edit the epochNetlist cellview in your_module_name and remove all the lines in the file except the comment header. Include file name.vf (located in your directory) Save file and quit the editor You can run Verilog XL to verify the design.
168
r*
Chapter 9
BIBLIOGRAPHY
Aylor, J.H., et. al., A Semicustom VLSI System Design Course Supported by the General Electric Microelectronics Center, IEEE Transactions on Education, Vol. E-29, No.2, May 1986, pp.85-89. Berkes, O., and Williams, R., Magic as a PC Layout Tool for Small Budget VLSI Circuit Design, T . . . FFF Transactions on Education, Vol. 34, No. 1, Feb. 1991, pp. 52-55. Cadence IC Design Composition and Analysis & Conversion, Video Manual, Version 4.2.2,1993. CDA 6214, class notes for course CDA6214, Structured VLSI Design, F.A.U., Fall 1995. Chen, T., From System Design to IC Design in 14 Weeks - Teamwork Makes it Possible, IEEE Transactions on Education, Vol. 36, No.l, February 1993, pp. 137-140. Comer, D., Application of Top-Down Principles to Digital System Design, IEEE Transactions on Education, Vol. E-26, No. 4, November 1983, pp. 170-172. De Yong, M., Findley, R., and Fields, C., The Design, Fabrication, and Test of a New VLSI Hybrid AnalogDigital Neural Processing Element, IEEE Trans.on Neural Networks, vol.3, No.3, pp.363-374, May 1992. Fauseu, L., Fundamentals of Neural Networks, Prentice Hall, New Jersey, 1994. Franca, J., Integrated Circuit Teaching Through Top-Down Design, IEEE Transactions on Education, Vol.37, No.4, November 1994, pp. 351-357. Freytag, G., MSEE Thesis, FAU, Dec. 1995. Fukushima, K., A hierarchical neural network model for associative memory, Biological Cybernetics, 50,1984, pp. 105-113. Gander, R. et. al., An Electrical Engineering Design Course Sequence Using Top-Down Design Methodology, IEEE Transactions on Education, Vol. 37, No.l, Febr. 1994, pp-30- 35 Grossberg, S., Competitive Learning: From interactive activation to adaptive resonance, Cognitive Science, 11,1987, pp. 23-63. Grossberg, S., editor, The Adaptive Brain, Vol. I. and II., North Holland, Amsterdam, 1987. Hamilton, A., et.al., Integrated Pulse Stream Neural Networks: Results, Issues and Pointers, IEEE Trans.on Neural Networks, vol.3, No.3, pp.385-393. May 1992. Harrer, H., Nossek, J., and Stelzl, R., An Analog Implementation of Discrete-Time Cellular Neural Networks", IEEE Trans.on Neural Networks, vol.3, No.3, pp.466-476,1992. 169 Harth, E., Visual Perception: a dynamic theory, Biological Cybernetics, Vol.22, pp.169-180,1976. Harth, E., Unnikrishnan, K. P., and Pandya, A., The inversion of sensory processing by feedback pathways. A model of visual cognitive functions, Science, vol.237, pp.187-189,1987. Harth, E., and Pandya, A., Dynamics of the Alopex process: Applications to the optimization problem, Biomathematics and related computational problems, L.M. Ricciardi (Ed.), Reidel Publ., Amsterdam, pp.459-
471,1988. He, Y., and Cilingiroglu, U., A Charge-Based On-Chip Adaptation Kohonen Neural Network, IF.EF. Trans.on Neural Networks, vol.4, No.3, pp.462-469, May 1993. Herman et.al., Optimization for pattern classification using biased random search techniques. Annals of Operations Research, 1990. Hinton, G., Connecdonist Learning Procedures, Artificial Intelligence, pp. 184-234,1989. Hopfield, J. J., and Tank, D. W., Neural computation of decisions in optimization problems. Biological Cybernetics, 52,1985, pp. 141-152. Khanna, T., Foundations of Neural Networks, Addison-Wesley, Massachusetts, 1990. Kohonen, T., Self-organization and associative memory, 2nd ed Springer-Verlag, New York, 1988. Kondo, Y and Sawada, Y Functional Abilities of a Stochastic Logic Neural Network", IEEE Trans.on Neural Networks, vol.3, No.3, pp.434-443, May 1992. Kosko, B., Neural Networks and Fuzzy Systems, Prentice Hall, New Jersey, 1992, pp. 197-210. Lehman, C., Viredaz, M. and Blayo, F A Generic Systolic Array Building Block For Neural Networks with On-Chip Learning, IEEE Trans.on Neural Networks, vol.4, No.3, pp.400-407,1993. Linares-Bairanco, B., Sanchz-Sinencio, E.. Rodriguez-Vazquez, A., and Huertas, J., A CMOS Analog Adaptive BAM with On-Chip Learning and Weight Refreshing, IEEE Trans.on Neural Networks, vol.4, No.3, pp.445455,1993. Macq, D., Verleysen, M Jespers, P., and Legat, J., Analog Implementation of a Kohonen Map with On- Chip Learning, IEEE Trans.on Neural Networks, Vol.4, No.3, pp.456-461,1993. Mauduit, N., Duranton, J., Gobert, J., and Sirat, J., Lneurol.O: A Piece of Hardware LEGO for Building Neural Network Systems, IEEE Trans.on Neural Networks, vol.3, No.3, pp.414-422, May 1992. Melton, M., Phan, T Reeves, D., Van den Bout, D., The TinMANN VLSI Chip, IEEE Trans.on Neural Networks, vol.3, No.3, pp.375-384, May 1992. Minsky and Papert, Perceptrons, MIT Press, Cambridge, MA, 1969. Moon, G., et.al., VLSI Implementation of Synaptic Weighting and Summing in Pulse Coded Neural- Type Cells, IEEE Trans.on Neural Networks, vol.3, No.3, pp.394-403, May 1992. Motorola DSP56001 Users Manual.
170
Munford, M., Andes, D., and Kern, L., The Mod 2 Neurocomputer System Design, TF.F.F. Trans.on Neural Networks, vol.3, No.3, pp.423-433, May 1992. Newbridge Microsystems, CALMOS RBG1210 Random Bit Generator data sheets, pp. 4.75-4.78,1992 Pandya, A., Shankar, R., and Freytag, L., SIMD architecture for Alopex neural network, in Parallel Architecture for Image Processing, J. Ghosh (Ed.), Proceedings of SPIE1246, pp.275-287,1990, Pandya, A., and Venugopal, K. A Stochastic Parallel Algorithm for Supervised Learning in Neural Networks, IEICE Trans.Inf. & Systems, Vol77-D, No. 4., April 1994. Pandya, A., and Macy, R., Neural Networks for Pattern Recognition in C++, CRC Press, Boca Raton; copublished by IEEE Press, Piscataway, NJ, 1995. Quick Reference for Verilog HDL, Automata Publishing Company, San Jose, California. Ramacher, U., and Ruckert, U., VLSI Design of Neural Networks, Kluwer Academic Publishers, Boston, MA, 1991. Reid, R., "Computer-Aided Engineering for Computer Architecture Laboratories, IEEE Transactions on Education, Vol.34, No.l, February 1991, pp.56-61. Rosenblatt, F., Principles of neurodynamics, Spartan Books, New York, 1962. Rosenfeld, A., Picture processing: 1986, Computer Vision, Graphics and Image Processing, vol.38, pp. 147-225,1987 Rucinski, A., A Course in VLSI Semicustom Design in a Small School Environment, IEEE Transactions on Education, Vol. 31, No. 2, May 1988, pp. 93-99. Ruiz, L and Pandya, A., VLSI Implementable parallel stochastic learning algorithm, Proceedings of Conference on Neural, Morphological and Stochastic Methods in Image and Signal Processing, San Diego, 1995. Sackinger, E., Boser, B., Bromley, J., LeCun, Y., and Jackel, L., Application of the ANNA Neural Network Chip to High -Speed Character Recognition, IEEE Trans.on Neural Networks, vol.3, No.3, pp498-505,1992. Sait, S., Integrating UAHPL-DA Systems with VLSI Design Tools to Support VLSI DA Courses, IEEE Transactions on Education, Vol.35, No.4, Nov. 1992, pp.321-330. Sanchez-Sinencio, E., and Newcomb, R., Guest Editorial Neural Network Circuit Implementations, IEEE Trans.on Neural Networks, vol.4, No3, p.385,1993. Sandige, R., Top-Down Design Process for Gate-Level Combinational Logic Design, IEEE Transactions on Education, Vol. 35, No.3, August 1992, pp. 247-252. Soma, M., A PC-Based VLSI Design and Test System for Education, IEEE Transactions on Education, Vol. 31, No.l, Feb. 1988, pp-26-30. Stemheim, F,., Singh, R Madhavan, R., Trivedi, Y Digital Design and Synthesis with Verilog HDL,Automata, San Jose, CA, 1993 171
't
Thomas, D., and Moorby, P., The Verilog Hardware Description Language, Kluwer Academic Publishers, Norwell,MA, 1991. Thrun, S. B Bala, J., et. al., The MONKS Problems: A performance comparison of different learning algorithms, Carnegie Melon University, CMU-CS-91-197,1991. Unnikrishnan, K. P. and Venugopal, K. P., Alopex: A Correlation-Based Learning Algorithm for Feed- Forward and Recurrent Neural Networks, Neural Computation, Vol.6,1994. Weibos, P. J., Beyond regression: New tools for prediction and analysis in the behavioral sciences, Master Thesis, Harvard University, 1974. Williams, R., An Undergraduate VLSI CMOS Circuit Design Laboratoiy, IEEE Transactions on Education, Vol.34, No.l, Feb. 1991, pp.47-51. Wirth, N., On the composition of well-structured programs, Comput. Surveys, vol.6, pp. 247-259, Dec. 1974. Wolf, W., Synthesis Tools Help Teach Systems Concepts in VLSI Design, IEEE Transactions on Education, Vol. 35, No.l, February 1992, pp. 11-16.
Curriculum Vita
LAURA V. RUIZ
EDUCATION: 19% Ph.D. in Computer Engineering, Florida Atlantic University, Boca Raton, Florida. Dissertation research: A VLSI Implementable Learning Algorithm 1985-1986 Master of Science in Electrical Engineering. GPA 3.6/4.0. Florida International University, Miami, Florida. Master Thesis: Hardware & software design of a cost effective microprocessor-based (MC68000) data acquisition and digital signal processing system. 1982-1985 Bachelor of Science in Electrical Engineering GPA 3.8/4.0. High Honors. Florida International University, Miami, Florida.
WORK EXPERIENCE:
172 1994-present Florida Engineering Education Delivery System (FEEDS) Dector. College of Engineering and Design, Florida International University, Miami, Florida Direct the College of Engineering and Design Distance Learning program by coordinating various tasks such as promoting distance education among private corporations and public and private educational institutions, contract negotiations, scheduling of TV studios, supervision of two full-time and eight part-time employees, overseeing the operation of two TV studios, recommending purchasing of equipment, managing several projects
related to improving the quality of courses delivered, including teaching and presentation skills and video production aspects of distance education 1987-1994 Instructor, Laboratory Coordinator, Soldent Counselor. Department of Electrical & Computer Engineering, Florida International University, Miami, Florida. Taught undergraduate courses and laboratories; involved in curriculum development, program feasibility, planning and implementation proposals; ABET accreditation reviews; committee participation (Status of Women, Academic Conduct Review Board, Search and Screen for Electrical, Industrial and Civil Engineering Departments, Teaching Incentive Program); involved in public service activities with the FIU Student Chapter of the Society of Women Engineers (and responsible for its initiation in 1988); career counseling and recruitment at community colleges, middle and high schools; science fair judge. Courses taught' Microcomputers I & n (hardware and software design), Computer Design, Logic Design, Electronics I, Circuits . Received excellent students evaluations.
1988-present Faculty Advisor Society of Women Engineers. Florida International University, Miami, Florida.
1992 Consultant, Freshman Programs Division of the American Society for Engineering Education.
1992 Program Coordinator, MU-SPIN @ FIU. Minority University-Space Interdisciplinary Network, Goddard Space Flight Center, Greenbelt, Maryland.
1991 Editor and co-author of the Solutions Manual 16/32 bit Microprocessors 68000168010/68020 by Subbarao Wunnava.
173 Adjunct Instructor. Practical Applications of the MC68000 Microprocessor Seminar. FIU/Florida Power & Light, Miami, Florida.
1986-1987 Research Assistant, Hardware & software design of microprocessor-based system for a defibrillator, Cordis Corporation, Miami, Florida.
1985-1987 Teaching Assistant, taught several undergraduate courses and laboratories for the Electrical Engineering Department at FIU.
RELEVANT PUBLICATIONS: Teaching a Structured VLSI Design Course, (Ruiz/Pandya/Shankar), 1996; and Application of TopDown Design Principles to Neural Network Hardware Implementations, (Ruiz/Pandya/Shankar)(to be submitted for publication) 1) A Stochastic Parallel Optimization Algorithm, IS&T/SPIEs Symposium on Electronic Imaging: Science & Technology, January 1996.(Sen/Pandya/Ruiz) 2) VLSI Implementable Parallel Stochastic Learning Algorithm, Proceedings of Neural, Morphologic and Stochastic Methods in Image & Signal Processing Conference. San Diego, Ca lifornia, July 1995.(Ruiz/Pandya) 3) Implementing a Teaching Incentive Program, 1995(Baker/Ruiz/Munroe/Tsihrintzis) 4) Tault The NEA Higher Education Journal,
12) Pseudo Expert System for Signal Recognition, MSEE Thesis, December 1986.
Tolerant Software Algorithms, SOUTHCON, Orlando, Florida, March 1990.(Ruiz)
5) The Evolution of Advising Electrical Engineering Students at Florida International University, Southeastern Advising Association, NACADA, Sixth Annual Conference, Miami, Florida, March 1990. (Ruiz) 6) Microprocessor-Based System Design Courses for Graduate and Undergraduate College Education, 11th Wald Computer Congress, San Francisco, California, August 1989.(Salinger/Subbarao/SaIinas/Ruiz) 7) Non Invasive Sensors for the Development of an Expert System for Measurement of Fatigue in an Automated Environment, International Symposium on Automotive Technology and Automation (ISATA), May-June 1988 (Subbarao/Ruiz) 8) Intelligent I/O Interface Schemes for the MC68000 Family of Microprocessors, Southcon 88, Orlando, Florida, March 1988. (Ruiz/Subbarao) 9) Ultra Low Power Activation of Microprocessors for Bio Instrumentation", Proceedings of the TF.FF. 1987 Engineering in Medicine and Biology Society, Boston, November 1987 (Subbarao/Ruiz/Yomtov) 10) Subroutines and Macros in Microprocessor Program Development, Proceedings of the 1987 Miami Technicon,1987 (Ruiz/Salinger/Subbarao) 11) Microcomputer Design Laboratories in College Education, Proceedings of the 1987 Miami Technicon Conference, 1987 (Ruiz/Salinger) 174
HONORS & AWARDS: Treasurer Society of Women Engineers, Miami Section, 1995/96 President Society of Women Engineers, Miami Section, 1992/94 and responsible for its chartering in 1992. Faculty Development Award, FIU, 1992/93. Faculty Excellence in Advising Award, FIU, 1991/1992 Grant-In-AidJFIU,1988/1989 and 1994/1995. Ralph Sanchez Endowed Graduate Fellowship, FIU, 1986. Delta Mu Omega, Electrical Engineering Honor Society, FIU, 1985. MEMBERSHIP PROFESSIONAL SOCIETIES: Society of Women Engineers, (SWE) since 1991 Institute of Electrical & Electronic Engineers (IEEE) and IF.F.F.Computer Society, 1988-1992. National Academic Advising Association (NACADA), 1991-1992 American Society for Engineering Education
'9
(ASEE), 1991-1992 National Society of Professional Engineers, student chapter at MDCC, 1981

A VLSI Implementable Learning Algorithm

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A VLSI Implementable Learning Algorithm

Uploaded by

Copyright:

Available Formats

TABLE OF CONTENTS

Table 5.1. Actual weight adjustments for 8 iterations ........................................................... 77

Table 5.3. Machine cycles for one training iteration _

1.3 Contributions of the dissertation

for a novel architecture.

1.4 Organization of the dissertation

Chapter 2 BACKGROUND INFORMATION

2.1 On hardware implementation of neural networks

2.2 On top-down design methodology

Chapter 3 The learning algorithms

3.1 Deterministic learning algorithm

Figure 3.1: The Perception (adapted 1990) from Khanna,

Figure 3.2: Hierarchipal module partitioning

Else if t = 1 then if y<0then wi(new) =

3.2 Original stochastic learning algorithm

+ P n >. 5 5, S A# = Rj < 8^+8. ea

Weight Dacrauad Wteght Inaeaied

Figure 3.4: Flowchart for Alopex C language implementation

Figure 3.4(continued): Flowchart for Alopex C language implementation

Figure 3.4(continued): Flowchart for Alopex C language implementation

train while(iter_curr <= iter_max) && (error_max < error_current)

decode {read test_file; restoreweight s; for all

readinput initweights preferences saveweights restoreweights change_ext

3.3 Modifications made for VLSI implementation

3.3.1. Digital Sigmoid:

/y<IS 0 > -o /elk .

sig/cwadBjfclg {CDfl,sli3nlp3V) Cell; Up_(485.3 x 414.6) B 2.0,le*05 sg wlcro Figure

3.3.2. Modification of the original weight adjustment mechanism:

3.3.3. Data representation:

= 32767]0. Merely truncating

js.Most Significant Product Lust Significant Produet |Q |

Figure 3.7: Integer/fractional multiplication comparison

Chapter 4 TOP-DOWN DESIGN METHODOLOGY

Figure 4.1:Top-down design steps

CO* 11111 O"00i

claaa-Cl, output* 1 w_l- 2 tsstdatjstxIO) - l 5 2 - taacdat_mtxlll0 claaa-1

cl*aa-C2. output* >1 w_l- 2 t9tdatjacx(2l- *_2 - cascdac acx(31 - 2 claaa -

Figure 4.2: sample run for results of 13 iterations of perceptron algorithm

* Training Unit: if (enable_train) do: read in training

Figure 4.4: Training module partitioning 50

c.u.w.23 0 /c.u/dk OSN

IIXIXiXC(IXIXIXIXIXIXI)(L XDCEXIXZIXIXIXIZIXIXIXL XE XZZLZZXL

r.tunf IM> O nisi

Figure 4.5: Verilog XL timing waveforms 52

Eighth Step: Synthesis:

aaence Dealer. Sysceas. Inc.

Cost Report file

T. que uru c._ssItipLicacaonO

Cadenco Design Syaetas, Inc.

Optmirsr. Timing: Report. Sorted by, slide Xia circuit: add

C*dnce Ddaign Systems, .Inc.. Cost Rtport rilft Circuit: add

XSdaua-Clock Frequwicy . 38.806 KBs

area ecu. Countt nr.r mAA V/; *'; d total:

tdbuiiiw 3x 3 aedaddh L stddff c 2

1ST. 95 '4730S 512.46 .. 512.46 "TSB.IS - .ISIS.32

Ziaa unit. Ins.

Figure 4.8: Verilog XL timing waveforms of the gate level simulation

Csdsnc* Oasiqn Systems. Inc.

1 A product of Cascade Design

HMuma Clock fraqutncy - 37.333 ms Longest Path Daisy - 26.63 n 249.66

259.31 4.31 321.a t

805.41 1000.35 1000.33

Figure 4.9: Perceptron (training module) final chip layout

claa-C2. output >1 w_l- 2 t9tdatjacx(2l- *_2 - cascdac acx(31 - 2 claaa -

nat_l.x{20:0| <.023218 unitl/Enable: StO

t>l115:0]- 65344 40 tM^l 115:0) 1344 0540 tp2 00

out.out (clcult_cutpue) 220 dc j tar0cdst_mtx( 0) <diradoutpuc) 12? l :ua*rror 53136 0e92

" XX X X X~X X '

X x X X X X X " X*. " X X^~X X~ X X X i