Bela

CHAPTER 1
1. INTRODUCTION
An embedded system is a computer system designed to do one or a few dedicated and/or specific functions often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal computer (PC), is designed to be flexible and to meet a wide range of end-user needs. Embedded systems control many devices in common use today. Last few decades have seen the rise of computers to a position of prevalence in human affairs. It has made its mark in every field ranging personal home affairs, business, process automation in industries, communications, entertainment, defense etc... An embedded system is a combination of hardware and software and
perhaps other mechanical parts designed to perform a specific function. Microwave oven is a good example of one such system. This is in direct contrast to a personal computer. Though it is also comprised of hardware and software and mechanical components it is not designed for a specific purpose. Personal computer is general purpose and is able to do many different things. An embedded system is generally a system within a larger system. Modern cars and trucks contain many embedded systems. One embedded system controls antilock brakes, another monitors and controls vehicles emission and a third displays information on the dashboard. Even the general-purpose personal computer itself is made up of numerous embedded systems. Keyboard, mouse, video card, modem, hard drive, floppy drive and sound card are each an embedded system. Tracing back the history, the birth of microprocessor in 1971 marked the booming of digital era. Early embedded applications included unmanned space probes, computerized traffic lights and aircraft flight control systems. In the 1980s, embedded systems brought microprocessors into every part of our personal and professional lives.
1
Presently there are numerous gadgets coming out to make our life easier and comfortable because of advances in embedded systems. Mobile phones, personal digital assistants and digital cameras are only a small segment of this emerging field. One major subclass of embedded systems is real-time embedded systems. A real time-system is one that has timing constraints. Real-time systems performance is specified in terms of ability to make calculations or decisions in a timely manner. These important calculations have deadlines for completion. A missed deadline is just as bad as a wrong answer. The damage caused by this miss will depend on the application.
1.1.1. EMBEDDED SYSTEM

All embedded systems contain a processor and software. The processor may be 8051 micro-controller or a Pentium-IV processor (having a clock speed of 2.4 GHz). Certainly, in order to have software there must be a place to store the executable code and temporary storage for run-time data manipulations. These take the form of ROM and RAM respectively. If memory requirement is small, it may be contained in the same chip as the processor. Otherwise one or both types of memory will reside in external memory chips. All embedded systems also contain some type of inputs and outputs (Fig. 1). For example in a microwave oven the inputs are the buttons on the front panel and a temperature probe and the outputs are the human readable display and the microwave radiation. Inputs to the system generally take the form of sensors and probes, communication signals, or control knobs and buttons. Outputs are generally displays, communication signals, or changes to the physical world.
1.1.2 CHARACTERISTICS OF EMBEDDED SYSTEM

1. Embedded systems are designed to do some specific task, rather than be a generalpurpose computer for multiple tasks. Some also have real-time performance constraints
that must be met, for reasons such as safety and usability; others may have low or no performance requirements, allowing the system hardware to be simplified to reduce costs. 2. Embedded systems are not always standalone devices. Many embedded systems consist of small, computerized parts within a larger device that serves a more general purpose. For example, the Gibson Robot Guitar features an embedded system for tuning the strings, but the overall purpose of the Robot Guitar is, of course, to play music. Similarly, an embedded system in an automobile provides a specific function as a subsystem of the car itself. 3. The program instructions written for embedded systems are referred to as firmware, and are stored in read-only memory or Flash memory chips. They run with limited computer hardware resources: little memory, small or non-existent keyboard and/or screen.
1.1.3. ASIC AND FPGA SOLUTIONS

A common array of n- configuration for very-high-volume embedded systems is the system on a chip (SoC) which contains a complete system consisting of multiple processors, multipliers, caches and interfaces on a single chip. SoC s can be implemented as an application-specific integrated circuit (ASIC) or using a field-programmable gate array (FPGA). Peripherals Embedded Systems talk with the outside world via peripherals, such as: Serial Communication Interfaces (SCI): RS-232, RS-422, RS-485 etc. Synchronous Serial Communication Interface: I2C, SPI, SSC and ESSI (Enhanced Synchronous Serial Interface) Universal Serial Bus (USB)
Tools
Multi Media Cards (SD Cards, Compact Flash etc.) Networks: Ethernet, LonWorks, etc. Fieldbuses: CAN-Bus, LIN-Bus, PROFIBUS, etc. Timers: PLL(s), Capture/Compare and Time Processing Units Discrete IO: aka General Purpose Input/Output (GPIO) Analog to Digital/Digital to Analog (ADC/DAC) Debugging: JTAG, ISP, ICSP, BDM Port, BITP, and DP9 ports.
As with other software, embedded system designers use compilers, assemblers, and debuggers to develop embedded system software. However, they may also use some more specific tools: In circuit debuggers or emulators (see next section). Utilities to add a checksum or CRC to a program, so the embedded system can check if the program is valid. For systems using digital signal processing, developers may use a math workbench such as Scilab / Scicos, MATLAB / Simulink, EICASLAB, MathCad, Mathematica, or FlowStone DSP to simulate the mathematics. They might also
use libraries for both the host and target which eliminates developing DSP routines as done in DSP nano RTOS and Unison Operating System. Custom compilers and linkers may be used to improve optimization for the particular hardware. An embedded system may have its own special language or design tool, or add enhancements to an existing language such as Forth or Basic.
4
Another alternative is to add a real-time operating system or embedded operating system, which may have DSP capabilities like DSPnano RTOS.
Modeling and code generating tools often based on state machines
1.1.4. TOOL ARCHITECTURE OVERVIEW

Figure 1-1 depicts the embedded software tool architecture. Multiple tools based on a common framework allow the user to design the complete embedded system. System design consists of the creation of the hardware and software components of the embedded processor system, and optionally, a verification or simulation component as well. The hardware component consists of an automatically generated hardware platform that can be optionally extended to include other hardware functionality specified by the user. The software component of the design consists of the software platform generated by the tools, along with the user designed application software. The verification component consists of automatically generated simulation models targeted to a specific simulator, based on the hardware and software components.
Tool Flows A typical embedded system design project involves the following phases: hardware platform creation, hardware platform verification (simulation), software platform creation, software application
5
creation,
and
software
verification
(debugging).Xilinx provides tools to assist in all the above design phases. These tools play together with other, third-party tools such as simulators and text editors that may be used by the designers. Xilinx Platform Studio The Xilinx Platform Studio (XPS) tool provides a GUI for creating the MHS and MSS files for the hardware and software flow. XPS also provides source file editor capability and project and process management capability. XPS is used for managing the complete tool flow, that is, both hardware and software implementation flows. Platform Generator The embedded processor system in the form of hardware netlists (HDL and EDIF files) is customized and generated by the Platform Generator (PlatGen). HDL Synthesis PlatGen generates hierarchal NGC netlists in the default mode. This means that each instance of a peripheral in the MHS file is synthesized. The default mode leaves the top-level HDL file untouched allowing any synthesis tool to be used. Generator only supports XST (Xilinx Synthesis Technology). Simulation Model Generator The Simulation Platform Generation tool (simgen) generates and configures various simulation models for the hardware. It takes a Microprocessor Hardware Specification (MHS) file as input. Library Generator XPS calls the Library Generator tool for configuring the software flow. The Library Generator (LibGen) tool configures libraries, device drivers, file systems and
interrupt handlers for the embedded processor system. The input to LibGen is an MSS file.
1.1.5. DEBUGGING
Embedded debugging may be performed at different levels, depending on the facilities available. From simplest to most sophisticate they can be roughly grouped into the following areas: Interactive resident debugging, using the simple shell provided by the embedded
operating system (e.g. Forth and Basic) External debugging using logging or serial port output to trace operation using
either a monitor in flash or using a debug server like the Remedy Debugger which even works for heterogeneous multicore systems. An in-circuit debugger (ICD), a hardware device that connects to the
microprocessor via a JTAG or Nexus interface. This allows the operation of the microprocessor to be controlled externally, but is typically restricted to specific debugging capabilities in the processor. An in-circuit emulator (ICE) replaces the microprocessor with a simulated
equivalent, providing full control over all aspects of the microprocessor. A complete emulator provides a simulation of all aspects of the hardware,
allowing all of it to be controlled and modified and allowing debugging on a normal PC. Unless restricted to external debugging, the programmer can typically load and run software through the tools, view the code running in the processor, and start or stop its operation. The view of the code may be as HLL source-code, assembly code or mixture of both. Because an embedded system is often composed of a wide variety of elements, the debugging strategy may vary. For instance, debugging a software- (and microprocessor-)
7
centric embedded system is different from debugging an embedded system where most of the processing is performed by peripherals (DSP, FPGA, and co-processor). An increasing number of embedded systems today use more than one single processor core. A common problem with multi-core development is the proper synchronization of software execution. In such a case, the embedded system design may wish to check the data traffic on the busses between the processor cores, which requires very low-level debugging, at signal/bus level, with a logic analyzer, for instance. Tracing Real-time operating systems (RTOS) often supports tracing of operating system events. A graphical view is presented by a host PC tool, based on a recording of the system behavior. The trace recording can be performed in software, by the RTOS, or by special tracing hardware. RTOS tracing allows developers to understand timing and performance issues of the software system and gives a good understanding of the highlevel system behavior. A good example is RTXCview, for RTXC Quadros by Quadros Systems, Inc...
1.2. COARSE GRAINED RECONFIGURABLE ARCHITECTURES

While the first systems for reconfigurable computation featured fine grained FPGAs, it was soon discovered, that FPGAs bear different disadvantages for computational tasks. First, due to the bit-level operations, operators for wide data paths have to be composed of several (bit-level) processing units. This includes typically a large routing overhead for the interconnect between these units and leads to a low silicon area efficiency of FPGA computing solutions. Also, the switched routing wires use up more power than hardwired connections. A second drawback of the fine granularity is the high volume of configuration data needed for the large number of both processing units and routing switches. This implies the need for a large configuration memory, with according power dissipation. The long configuration time, which is implied by this problem, makes execution models relying on a steady change of the configuration impossible.
8
As a third disadvantage, application development for FPGAs is very similar to VLSI design due to the programmability at logic level. The mapping of applications from common high-level languages is difficult compared to the compilation onto a standard microprocessor, as the granularity of the target FPGA does not match that of the operations in the source code. The standard way of application specification is still a hardware description language, which requires a hardware expert. In the following design process, the large number of processing units leads to a complex synthesis, which uses up much computation Time. Coarse grained reconfigurable architectures try to overcome the disadvantages of FPGA-based computing solutions by providing multiple-bit wide data paths and complex operators instead of bit-level configurability. In contrast to FPGAs, the wide data path allows the efficient implementation of complex operators in silicon. Thus, the routing overhead generated by having to compose complex operators from bit-level processing units is avoided. Regarding the interconnects between processing elements, coarse grain architectures also differ in several ways to FPGAs. The connections are multiple bits wide, which implies a higher area usage for a single line. On the other hand, the number of processing elements is typically several orders of magnitude lower than in an FPGA. Thus, much fewer lines are needed, resulting in a globally lower area usage for routing. The lower number and higher granularity of communication lines allows also for communication resources, which would be quite inefficient for fine grained architectures. Examples for such resources are time-multiplexed buses or global buses, which connect every processing element. In the past years, several approaches for coarse grained reconfigurable architectures have been published. In this chapter, several example architectures will be presented to give an overview over the developments in the area of coarse grain reconfigurable computing. A time line of the presented architectures is given in Table 21, together with basic properties of each architecture. The properties considered are:
The basic interconnect structure The width of the data paths The reconfiguration model
Table 1.2 Timeline example of CGRA The basic interconnect structure is determined by the global arrangement of the processing elements. This in turn is often motivated by the targeted applications for the architecture or by implementation considerations. Obviously, the type of the communication architecture has a direct impact on the complexity of application mapping. The predominating structure of the architectures presented here is a twodimensional mesh. In other architectures, the processing elements are arranged in one or more linear arrays. The third interconnect structure found is a crossbar switch being used to connect processing elements. This allows basically arbitrary connections. The width of the data path ranges from two bits to 32 bits. The selection of the data path width for architecture is a trade-off between flexibility and efficiency.
10
While a wide data path allows an efficient implementation of the processing element for operations on a whole data word, the execution of operations on parts of a data word, which occur in certain applications, requires typically several execution steps, including the extraction of the relevant bit range. A smaller data path allows the direct execution of such operations or implies less extraction overhead. On the other hand, operators for wider data words have to be composed of several processing elements, which inflicts routing overhead. The reconfiguration model determines when a new configuration is loaded into the architecture. For architectures featuring static reconfiguration, a configuration is loaded at the beginning of the execution and stays during the computation phase. When a new configuration has to be loaded, the execution must stop. The dynamic reconfiguration model allows a new configuration to be loaded while the application is running. This includes the case that the execution relies on steady reconfiguration of the processing elements. Some architectures use normally static reconfiguration, but allow dynamic reconfiguration for special purposes like the configuration of unused parts for the next task, while a computation is running. Such architectures are denoted as static, as this remains the main reconfiguration model. The presentation of the example systems is ordered by the interconnect structure. In the next subsection; nine mesh-based architectures are described. Then, two systems based on linear arrays are presented, and finally, two systems employing a crossbar switch are shown.
1.2.1 MESH-BASED ARCHITECTURES

Mesh-based architectures arrange their processing elements in a rectangular array, featuring horizontal and vertical connections. This structure allows efficient parallelism and a good use of communication resources. However, the advantages of a mesh are traded for the need of an efficient placement and routing step. The quality of this step can have a remarkable impact on the application performance. Though, due to the relative low number of processing elements, the placement and routing is often much less complex than for e.g. FPGAs. The arrangement of the
11
processing elements encourages nearest neighbor links between adjacent elements as an obvious communication resource. Typically, longer lines are added with different lengths, which allow connections over several processing elements. The first architecture called DP-FPGA [ChLe94] was published by D. Cherepachaand D. Lewis from the University of Toronto. This approach shows clearly its relation to FPGAs, featuring bit-sliced ALUs and a routing architecture quite similar to its fine-grained ancestors. The Kress Array-I [Kres96] by R. Kress of University of Kaiserslautern is an approach with a very wide data path, which required to reduce the communication resources in order to achieve a feasible chip design. However, the Kress Array features also a serial bus as a high-level connection, which connects all processing elements, and is globally scheduled. The Kress Array-I was accompanied by a control unit and integrated into the Xputer prototype MoM-3 [Rein99]. The Colt [BAM96] architecture by R. Bittner,P. Athanas and M. Musgrove from Virginia State University features a single integer multiplier in addition to the mesh of processing elements. This architecture is a study for the wormhole runtime-reconfiguration computing paradigm, which relies on heavy dynamic reconfiguration. The MATRIX [MiDe96] system by E. Mirsky and A. DeHon of MIT is the first one to extend the processing elements to small processors, featuring local instruction memories in each element as well as small data memories. While the previous systems resembled stand-alone architectures (sometimes accompanied by a control unit),the Garp [HaWa97] architecture by J. Hauser and J. Wawrzynek from University of Berkeley is systems composed of a normal microprocessor with are configurable coprocessor. It features lookup-table based processing elements with only two-bit wide data paths. In fact, this architecture comes very close to an FPGA and is often considered to be fine-grained. It appears in this chapter, as the basic unit of reconfiguration is a whole row of processing elements, which forma kind of reconfigurable ALU. Though, the available routing resources of this approach classify it as a mesh-based architecture. The Raw machine paradigm[WTSS97], which has been
12
developed by E. Waingold et al. from MIT features the most complex processing elements with a complete RISC processor, data and instruction memory, and a programmable switch for time-multiplexed nearest neighbor interconnects. In contrast to other approaches, which try to provide a more complex FPGA, this architecture tries to realize a mesh of less complex microprocessors, with the main responsibility for proper execution moved to the software environment at compile time. The REMARC [MiOl98a] architecture by T. Miyamori and K. Olukotun from Stanford University is another coprocessor, consisting of 64 small processors with memory. The MorphoSys [LSLB99] machine by G. Lu et al. from Irvine University is also a hybrid system consisting of a RISC processor with a reconfigurable coprocessor. Here, the reconfigurable mesh is accompanied by a frame buffer for intermediate data. The last architecture shown is the CHESS [MVS99] array by A. Marshall et al. from Hewlett Packard Laboratories. This architecture features a unique PE arrangement in the form of a chess board, with embedded memories to support multimedia applications. CGRA Architecture CGRA is essentially an array of processing elements (PEs), connected through a mesh-like network. Each PE can execute an arithmetic or logic operation, multiplication, or load/store. PEs can load or store data from the on-chip local memory, but they can also operate on the output of a neighboring PE connected through the interconnect network. Many resource-constrained CGRA designs have some PEs dedicated for some specic functionality. For example, in each row, typically a few PEs are reserved for multiplication in addition to ALU operations, and a few can perform loading and storing from/to the local memory. The functionality of a PE, i.e., the choice of source operands, destination of the result, and the operation it performs is specied in the conguration, which is generated as a result of compiling the application on to the CGRA.
13
Fig .1.2.1 CGRA architecture & application mapping A CGRA processor is used as a coprocessor to a main processor. The main processor manages CGRA execution, such as loading of CGRA congurations and initiating CGRA execution, through memory-mapped I/O. Once the CGRA starts execution, the main processor can perform other tasks. Interrupts can be used to notify the completion of CGRA execution. The local memory of a CGRA is managed by the CGRA through DMA. Hardware double buffering allows for full overlap between computation and data transfer on the CGRA, as well as quick switches between buffers; this be-comes very critical for large loops, which may require multiple buffer switches during the execution of a single loop. Application Mapping CGRAs are typically used to accelerate the innermost loops of applications, thereby saving runtime and energy. The innermost loop of a perfectly nested loop can be represented as a data ow graph, in which the nodes represent microoperations (arithmetic and logic operations, multiplication, and load/store), and the edges represent the data dependency between the operations. where dark nodes represent
14
memory operations. While not for this loop, the data dependency can be in general loopcarried. The task of mapping an application onto a CGRA traditionally comprises of mapping the nodes of the data ow graph onto the PE array of the CGRA, and to map the edges onto the connections between the PEs. Since the mesh-like interconnection can be restrictive for application mapping, most CGRAs allow PEs to be used for routing of data (routing PE). In the routing mode, the PE does not perform any operation, but just transfer one of the inputs to its output. This exibility can be exploited by allowing the edges in the data ow graph to be mapped onto paths (composed alternatively of interconnection and a free PE, starting and ending in an interconnection) in the CGRA. Pipelining is explicit in the CGRA, in the sense that the result of computation inside one PE can be used by the neighboring PEs in the next cycle. For effective application mapping, the compiler must software-pipeline the loop before mapping it onto the PEs for effective mapping. Thus in addition to the problem of expressing the application in terms of the functionality of PEs, a CGRA compiler must explicitly perform resource allocation, pipelining, and routing of data dependencies on the CGRA. It is for these reasons that the problem of application map-ping on CGRA is challenging.
15
CHAPTER 2
2. LITERATURE SURVEY
2.1. Reducing power consumption for Dynamically Recongurable Processor Array with partially xed conguration mapping by Kazuei Hironaka, Masayuki Kimura, Yoshiki Saito, Toru Sano, Masaru Kato, Vasutan Tunbunheng, Yoshihiro Yasuda and Hideharu Amano-2010 The Partially Fixed Conguration Mapping (PFCM) is a context mapping technique for Dynamically Recongurable Processor Array (DRPA) focusing on reducing the power consumption. It assigns operations into Processing Elements (PEs) so as to keep the conguration of the previous context as possible. It reduces the changing part of the data path structure on the PE array as well as its switching frequency. Preliminary evaluation results show that it can reduce the computing power by 6.7% 11.3%. The demonstration shows the power reduction directly by using the real chip MuCCRA-3, a prototype of DRPA executing signal processing applications with and without applying PFCM. The design environment for using PFCM is also exhibited. Coarse-grained Dynamically Recongurable Processor Array (DRPA) has advantages of area and power efciency compared with traditional eld-programmable devices such as FPGAs in certain application elds. These advantages are achieved by the following structures; (1) a simple coarse-grained processor, consisting of an ALU, a data manipulator, a register le and etc., is used as a Processing Element of an array, and (2) dynamic reconguration for time-multiplexed execution of PE array is introduced. 2.2.Power-Ecient Reconguration Control in Coarse-Grained Dynamically Recongurable Architectures Dmitrij Kissler, Andreas Strawetz, Frank Hannig, and Jurgen Teich- 2009 Coarse-grained recongurable architectures deliver high performance and energy eciency for computationally intensive applications like mobile multimedia and
16
wireless communication. This paper deals with the aspect of power-ecient dynamic reconguration control techniques in such architectures. Proper clock domain partitioning with custom clock gating combined with automatic clock gating resulted in a35% total power reduction. This is more than a threefold as compared to the single clock gating techniques applied separately. The corresponding case study application with 0.064 mW/MHz and 124 MOPS/mW power eciency outperforms the major coarse-grained and general purpose embedded processor architectures by a factor of 1.7 to 28. The achieved power eciency of the proposed WPPA architecture and compare it with some well-established commercial and academicals embedded processor architectures, see Table 3. With a power eciency of0.064 mW/MHz in a case study image processing algorithm (1024 768 XGA resolution, @30 fps, greyscale) our array outperforms the TMS320C6454 DSP from Texas Instruments (core power) by a factor of 28 and the low-power featuredARM946E-S embedded processor (with cache) by a factor of 1.7 in average. With performance-power eciency of up to 124 MOPS/mW for a case study FIR and edge detection implementation, it denitely provides an ASIC-like performance, simultaneously oering domain-specic exibility and reconguration features. 2.3. Power Reduction Techniques For Dynamically Reconfigurable Processor Arrays By T. Nishimura, K. Hirai, Y. Saito, T. Nakamura, Y. Hasegawa, S. Tsutsusmi, V. Tunbunheng, and H. Amano -2008 The power consumption of Dynamically Reconfigurable Processing Array (DRPA) is quantitatively analyzed by using a real chip layout and applications taking into account the re-configuration power. Evaluation result shows that processing power for PEs is dominant and reconfiguration power is about 20.7% of the total dynamic power consumption. Based on the above evaluation results, we proposed two dynamic power reduction techniques: functional unit-level operand isolation and selective context fetch. Evaluation results demonstrate that the functional unit-level operand isolation can reduce
17
up to 20.8% of the dynamic power with only 2.2% area overhead. On the selective context fetch, the power reduction is limited by the increasing of the additional hardware. The power consumption of DRPA is quantitatively analyzed, and two dynamic power reduction techniques; functional unit-level operand isolation and selective context fetch are proposed. Evaluation results demonstrate that the functional unit-level operand isolation can reduce up to 20.8% of the dynamic power with only 2.2% area overhead. On the selective context fetch, the power reduction is limited by the increasing of the additional hardware 2.4. Architectural Exploration of the ADRES Coarse-Grained Recongurable Array, Frank Bouwens1, Malden Berekovic, Andreas Kanstein, and Georgi Gaydadjiev vol. 5, no. 1, pp. 96105, 2007. Recongurable computational architectures are envisioned to deliver power efcient, high performance, exible platforms for embedded systems de-sign. The coarse-grained recongurable architecture ADRES (Architecture for Dynamically Recongurable Embedded Systems) and its compiler offer a tool ow to design sparsely interconnected 2D array processors with an arbitrary number of functional units, register les and interconnection topologies. This article presents an architectural exploration methodology and its results for the rst implementation of the ADRES architecture on a 90nm standard-cell technology. We analyze performance, energy and power trade-offs for two typical kernels from the multimedia and wireless domains: IDCT and FFT. Architecture instances of different sizes and interconnect structures are evaluated with respect to their power versus performance trade-offs. An optimized architecture is derived. A detailed power breakdown for the individual components of the selected architecture is presented. The main contributions of the paper are: A tool ow for energy, power and performance explorations of ADRES based CGRA architectures;
18
A novel methodology for power analysis that replaces RTL simulation with instruction set simulation to obtain activity stimuli;
Analysis of performance, power and energy tradeoffs for different array sizes and interconnects topologies;
Derivation of an optimized architecture for key multimedia and wireless kernels and the power breakdown of its components. In this paper they explored various architectures for the ADRES coarse-
grained recongurable array and selected the appropriate one based on power, energy and performance trade-offs. An especially dened methodology and tool ow 1) maps the FFT and IDCT benchmarks on the array using the DRESC compiler for high performance, low energy execution, 2) synthesizes each architecture into a front-end design and 3) simulates with either the compiled ISA simulator, ModelSimv6.0a or the Esterel simulator for performance and power evaluations. Power is calculated by annotating the switching activity after Esterel or RTL simulation onto the gate-level design. Our three-fold simulation approach permits to obtain results quickly, which is important for architecture exploration. Fourteen different ADRES instances were created based on 7 different architectural features and evaluated with the available tool ow. The obtained power, energy and performance charts show that a combination of mesh and mesh plus interconnection topologies with diagonal connections between functional units and local data register les results in good performance of 10.35 - 17.51 MIPS/mW and power of73.28 - 80.45mW with the least amount of energy 0.619 - 37.72uJ for FFT and IDCT, respectively.
2.5. A highly parameterizable parallel processor array architecture, in Proc. IEEE Int. Conf. Field Program. Technol., Bangkok, Thailand D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich,, 2006, pp. 105112.
19
In this paper a new class of highly parameterizable coarse-grained recongurable architectures called weakly programmable processor arrays is discussed. The main advantages of the proposed architecture template are the possibility of partial and differential reconguration and the systematical classication of different architectural parameters which allow to trade-off exibility and hardware cost. The applicability of our approach is tested in a case study with different interconnect topologies on an FPGA platform. The results show substantial exibility gains with only marginal additional hardware cost. The contribution of the paper can be summarized as follows: A novel highly parameterizable processor array template is presented. The ability of partial and differential reconguration of the processor array is provided. Our architecture is systematically described by different static and dynamic parameters. Flexibility/Hardware cost analysis based on the architectural parameters is performed. A new class of highly parameterizable parallel embedded processor architectures called weakly programmable processor arrays was discussed, as well as a novel approach for dynamically recongurable interconnection schemes for coarse-grained recongurable architectures. Furthermore, a classication of architectural parameters was made. 2.6. Energy-Aware Exploration of Coarse Grained Recongurable Processors by Praveen Raghavan AndyLambrechts , Murali Jayapala ,Francky Catthoor ,
Diederik Verkest-2005 In recent years Coarse Grained Reconfigurable Architectures (CGRAs) have emerged as a viable option in embedded systems. In this paper we present a power breakdown analysis of such an CGRA. We also present an energy aware exploration for one of the most important, but often neglected parts of processor architectures: the interconnect. The results show that the choice of different interconnection topologies has
20
a significant influence on resulting quality metrics, like performance and energy efficiency. A comprehensive power breakdown analysis for a CGRA processor was performed. It was shown that interconnect energy is dominant in CGRA processors. It was also shown that changing the interconnect topology affects both performance and energy consumption for the processor as a whole. It was identified that there is a Pareto trade-off between energy and performance and this was demonstrated using three different case studied
21
CHAPTER 3
3. METHODOLOGY
Xilinx ISE is a software tool produced by Xilinx for synthesis and analysis of HDL designs, which enables the developer to synthesize ("compile") their designs, perform timing analysis, examine RTL diagrams, simulate a design's reaction to different stimuli, and configure the target device with the programmer.
Xilinx FPGAs can run a regular embedded OS (such as Linux or vx Works) and can implement processor peripherals in programmable logic. Xilinx's IP cores include IP for simple functions (BCD encoders, counters, etc.), for domain specific cores (digital signal processing, FFT and FIR cores) to complex systems (multigigabit networking cores, Micro Blaze soft microprocessor, and the compact Pico blaze microcontroller). Xilinx also creates custom cores for a fee. The ISE Design Suite is the central electronic design automation (EDA) product family sold by Xilinx. The ISE Design Suite features include design entry and synthesis supporting Verilog or VHDL, place-and-route (PAR), completed verification and debug using Chip Scope Pro tools, and creation of the bit files that are used to configure the chip. Xilinx's Embedded Developer's Kit (EDK) supports the embedded PowerPC 405 and 440 cores (in Virtex - II Pro and some Virtex-4 and -5 chips) and the Micro blaze core. Xilinx's System Generator for DSP implements DSP designs on Xilinx FPGAs. A freeware version of its EDA software called ISE Web PACK is used with some of its non-high-performance chips.
3.1. REQUIREMENTS:
Software Operating system Type : Xilinx : FreeBSD, Linux, Microsoft Windows : EDA
22
REFERENCES
1. Kazuei Hironaka, Masayuki Kimura, Yoshiki Saito, Toru Sano, Masaru Kato, Vasutan Tunbunheng, Yoshihiro Yasuda and Hideharu Amano Reducing power consumption for Dynamically Recongurable Processor Array with partially xed conguration mapping -2010 2. D. Kissler, A. Strawetz, F. Hannig, and J. Teich, Power-efcient re-conguration control coarse-grained dynamically recongurable architectures, J. Low Power Electron, vol. 5, no. 1, pp. 96105, 2009. 3. Nishimura, K. Hirai, Y. Saito, T. Nakamura, Y. Hasegawa, S. Tsutsusmi, V. Tunbunheng, and H. Amano Power Reduction Techniques For Dynamically Reconfigurable Processor Arrays ,2008. 4. Frank Bouwens1, Malden Berekovic, Andreas Kanstein, and Georgi Gaydadjiev Architectural Exploration of the ADRES Coarse-Grained Recongurable Array, vol. 5, no. 1, pp. 96105, 2007. 5. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, highly parameterizable parallel processor array architecture, in Proc. IEEE Int. Conf. Field Program. Technol., Bangkok, Thailand, 2006, pp. 105112. 6. Praveen Raghavan AndyLambrechts , Murali Jayapala ,Francky Catthoor , Diederik Verkest Energy-Aware Exploration of Coarse Grained Recongurable Processors 2005
23

Bela

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bela

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.1.1. EMBEDDED SYSTEM

1.1.2 CHARACTERISTICS OF EMBEDDED SYSTEM

1.1.3. ASIC AND FPGA SOLUTIONS

Modeling and code generating tools often based on state machines

1.1.4. TOOL ARCHITECTURE OVERVIEW

1.2. COARSE GRAINED RECONFIGURABLE ARCHITECTURES

1.2.1 MESH-BASED ARCHITECTURES

You might also like