You are on page 1of 8

Simulink based Hardware-Software Codesign Flow for Heterogeneous MPSoC

Katalin Popovici Ahmed Amine Jerraya


TIMA Laboratory CEA LETI Minatec
46 Avenue Felix Viallet 17 Rue des Martyrs
38031, Grenoble, France 38054, Grenoble, France
katalin.popovici@imag.fr ahmed.jerraya@cea.fr

Keywords: hardware-software gradual refinement, level system modeling and verification. Such abstraction,
multimedia applications, abstraction levels primarily at the transaction-level, allows much faster
simulations and analysis, and enables design issues to be
Abstract detected early in the process. Several academic [3] and
SystemC based design methodology has been widely industrial research projects [4] proposed automatic MPSoC
adopted for heterogeneous multiprocessor System on Chip design flows from SystemC model. Therefore, SystemC can
(MPSoC) design. However, SystemC is hardware-oriented be a good candidate to design complex MPSoC. However,
language and it is not the standard language used by SystemC is still hardware-oriented language and not suitable
designers to specify complex applications at algorithm level. to specify, analyze, design and verify the overall application
On the other hand, Simulink is a popular choice for at algorithm level.
algorithm designer to specify complex system, but there are For the algorithm modeling, simulation and validation,
few design tools to implement Simulink models on MPSoC. Simulink has become a popular choice for many system
To deal with the increasing complexity of embedded designers [5]. Simulink offers a block set of algorithms for a
applications and MPSoC architectures, concurrent variety of applications. System designer can develop a
hardware/software design and verification at different model easily by combining the pre-defined blocks or user-
abstraction levels is an essential technique. In this paper, we defined blocks called S-Functions. Simulink also provide
present a Simulink-SystemC based multiprocessor SoC system designers with simulation environment to verify the
design flow that enables mixed hardware/software developed algorithm model. Real-Time Workshop (RTW)
refinement and simulation at different abstraction levels in and Simulink HDL Coder can automatically generate
addition to opening new facilities like communication software (C code) and hardware (VHDL/Verilog) code from
mapping exploration and interconnection component the algorithm model. However, mapping and refining
refinement. We applied the proposed approach for software algorithm models onto complete MPSoCs is still an open
and communication architecture refinement for three issue in the Simulink community.
multimedia applications: MP3, Motion JPEG and H.264. In this paper, we propose a hardware-software co-
design flow that enables gradual MPSoC design from a high
1. INTRODUCTION level application model. The high level application model
combines the algorithm with partitioning and mapping
Current embedded systems require a flexible and high- information. From this initial model captured in Simulink,
performance architecture to execute concurrently specific the hardware and software are gradually refined and
applications, including MPEG 2/4, H.263/4, CDMA 2000, simulated at different abstraction levels using SystemC.
WCDMA, and MP3. Heterogeneous multi-processor Additionally, the proposed flow allows for experimentation
System on Chip (MPSoC) architecture is becoming an with different communication mapping schemes and
attractive solution because it provides highly concurrent integration of different types of network components for the
computation and flexible programmability [1]. However, as global interconnection: bus based or Network on Chip
the complexity of system increases, it is more difficult and (NoC) based architectures. The main contribution of this
time consuming to develop and verify MPSoC software and work is the definition of the representation models at the
hardware. To solve the problem, new design methodology different abstraction levels.
and language are needed. The rest of the paper is organized as follows. Section II
SystemC [2] has become the preferred development presents previous work on Simulink and SystemC based
language for hardware/software to overcome the design design flows. Section III defines the four abstraction levels
complexity problem. SystemC, which is based on C/C++, adopted in the proposed flow and the hardware/software
provides the abstraction and constructs needed for high- representation models at each abstraction level. Section IV

SCSC 2007 497 ISBN # 1-56555-316-0


describes the main steps of the overall flow. Section V A processor subsystem (figure 1-b) is a programmable
shows experiment results for three multimedia applications subsystem that includes a parallel processing unit for
(MP3, Motion JPEG and H.264). Finally, section VI draws computation and specific hardware components such as
some conclusions. local memories, peripherals and hardware accelerators. In
heterogeneous MPSoC architectures, different kinds of
2. RELATED WORKS processing unit may be needed (i.e. DSP for data oriented
operations, RISC for control oriented operations and ASIP
To handle the ever increasing complexity of MPSoC for application specific computation).
design, System Level Design (SLD) methodology has been Each processor subsystem executes a specific software
proposed as the solution [5]. The key techniques are raising stack organized in two layers, the application and the
abstraction level and decoupling communication from hardware dependent software (HdS) layer (figure 1-c). The
computation. SystemC allows communication and top layer is the software application and it makes use of HdS
computation modeling at both levels of abstract, transaction Application Programming Interface (HDS_API) to abstract
and RT (Register Transfer) level and it is well adopted to the underlying platform.
implement SLD methodologies. Several academic (for The HdS allows the execution of the high level code on
instance ROSES [3]) and industrial (for instance Coware the hardware platform. The HdS layer is associated with the
[4]) tools are proposed to generate MPSoC automatically hardware dependent low level software behavior, such as
from the SystemC Transaction level modeling. However, interrupts routine services, context switch and specific I/O
SystemC is still oriented to hardware designer and not control. In fact, it is made of lower software layers that may
standard language used by designers to specify complex include an operating system (OS), specific I/O
system at algorithm level. communication and hardware abstraction layer (HAL).
Simulink is widely accepted as a system level design This mixed hardware-software representation of the
language and there are several design flows to generate MPSoC architecture can be modeled using different levels
hardware and software automatically. However, the target of details by abstracting the low level software layers,
architecture is single processor. For instance, Real Time processors and communication topology, as it will be
Workshop [5] can generate C code from a Simulink model. presented in the following section.
But, the generated C code can support only single task on Task 1 Task 2 Task n
single processor. In [6], they can generate an optimized HW Appli-
SW cation
RTL from a MATLAB/Simulink model. But they can Mem CPU
HW SS CPU SS HDS_API
accept a restricted subset of MATLAB/Simulink model and Specific
OS
don’t handle software refinement for a multiprocessor I/O SW
HdS
architecture. Network HAL_API
Communication Periph.
interface Hardware Abstraction
MATCH [7] can generate software and hardware code network
Layer (HAL)
automatically from MATLAB description for configurable
a) Generic architecture b) HW architecture c) SW stack executed by
computing systems. MATCH is targeted to a fixed of CPU subsystem a CPU subsystem
heterogeneous architecture composed with FPGAs (Field-
Programmable Gate Arrays), RISCs (Reduced Instruction Figure 1. MPSoC Hardware/Software Architecture
Set Computers), and DSPs (Digital Signal Processors) and it
does not generate SystemC models to allow design space 3.2. MPSoC abstraction levels
exploration. The mixed hardware software architecture concept
This paper provides a design flow that can generate allows modeling a heterogeneous MPSoC at different
SystemC model from Simulink model for multitasking abstraction levels, independent from the description
heterogeneous multiprocessor architecture. The generated language used by the designer. In this paper we will
SystemC model allows early performance estimation and consider four abstraction levels [9-11].
communication mapping exploration. The highest abstraction level is called System
Architecture; the lowest and fully detailed one is called
3. HARDWARE/SOFTWARE DESIGN FLOW Virtual Prototype. Between these levels exists a huge gap in
terms of hardware-software interfaces, as it will be detailed
3.1. MPSoC Hardware/Software architecture in the next sections. In order to reduce the gap between
The MPSoC design flow proposed in this paper is based these levels, we defined two intermediate abstraction levels:
on a generic heterogeneous MPSoC hardware-software the Virtual Architecture and the Transaction Accurate
architecture (figure 1-a) [8]. Architecture level [12].
The different processor and hardware processing
subsystems interact via a complex communication network.

ISBN # 1-56555-316-0 498 SCSC 2007


SW-SS1 SW-SS2 consists of the set of subsystems and inter-subsystem
T1 T2 T3
communication units. The simulation at this level allows
F1 F2 F3 F4 validation of the application’s algorithm. In this paper we
COMM3
use the standard Simulink environment to represent the
System Architecture model.
COMM1 The second level is the virtual architecture level (VA)
COMM2 (figure 2.b). The application functions are refined to C code
a) System Architecture for each task. In order to abstract the underlying hardware
T2 T3
Abstract CPU-SS1 Abstract CPU-SS2 architecture, the tasks code uses HdS APIs. The
T1
MEM
T2 T3
communication primitives of the HdS API (send/recv)
COMM1
HdS API
access explicit communication components. The software is
SWFIFO
executed using an abstract model of the hardware
COMM2 Abstract CPU-SS
architecture that provides an emulation of the HdS API. The
hardware platform is composed of these abstract
Abstract Communication Network
subsystems, abstract interconnection component and storage
b) Virtual Architecture
T3
resources. The simulation at this level allows validation of
T2

CPU-SS1 MEM SS CPU-SS2


the final code of tasks and may give useful statistics about
Abstract Abstract
the communication requirements, such as total amount of
CPU1 Memory Memory HdS API
COMM1 CPU2
Comm OS
data exchanged during the execution.
Interface Periph. COMM2 Interface Periph. HAL API The second intermediate level is called transaction
Abstract CPU2
accurate architecture level (TA) (figure 2.c). The tasks code
Communication Network (Bus/NoC) is linked with an OS and communication library to
c) Transaction Accurate Architecture implement the communication units and task management
CPU-SS1 MEM SS T2 T3
CPU-SS2 services. The resulting software makes use of hardware
CPU1 ISS Memory
COMM1 CPU2 ISS Memory abstraction layer primitives (HAL_API). The data transfers
Interface Periph. COMM2 Interface Periph.
HdS API
Comm OS
use explicit addresses, e.g. read_mem(addr, dst,
HAL API size)/write_mem(addr, src, size). The hardware is a more
Communication Network (Bus/NoC)
HAL
detailed platform which includes the network component
d) Virtual Prototype (NoC or bus), the explicit peripherals accessed by the HAL
APIs and an abstract computation model of the processor.
Figure 2. MPSoC at different abstraction levels The simulation at this level allows validating the integration
of the application with the OS and the communication layer.
Figure 2 illustrates the MPSoC abstraction levels for a It may also provide precise information about the
simplified application made of 3 tasks (T1, T2 and T3), that communication performances, such as conflicts on the bus
are mapped on an architecture made of 2 processing units or NoC congestion.
and several memory hardware subsystems. Simulation at The lowest level is called Virtual Prototype (figure 2.d).
each level will be used to validate the software and The HAL API and processor are implemented through the
hardware components at the corresponding abstraction level. use of a HAL software layer and the corresponding
The main differentiation between these levels is the detail processor model for each software subsystem. At the virtual
levels of hardware and software, the way of specifying the prototype level the communication consists of physical I/Os,
hardware-software interfaces and the communication. e.g. load/store. The hardware platform is fully detailed with
The highest level is called system architecture level RTL components for the hardware resources. The
(SA) (figure 2.a). This combines the application with simulation at this level allows performance validation and it
partitioning and mapping information. The application is corresponds to classical hardware/software cosimulation
made of a set of functions grouped into tasks. The function models with Instruction Set Simulators [13] [14] for the
is an abstract view of the behavior of an aspect of the processors.
application. Several tasks may be mapped on the same
processor subsystem. The communication between tasks 4. MPSOC DESIGN STEPS
mapped on the same subsystem (intra-subsystem) and the
communication between the different subsystems (inter- Figure 3 illustrates the main steps of the proposed
subsystem) make use of abstract communication links, e.g. MPSoC design flow. This starts with a manual modeling
standard Simulink signals or explicit communication units step to build the hierarchical Simulink model at the System
that correspond to specific communication paths of the Architecture level. This input Simulink model captures the
target platform. The corresponding hardware platform grouping of functions into tasks and the tasks into

SCSC 2007 499 ISBN # 1-56555-316-0


subsystems. The Simulink model may use explicit automatic generation and validation of the software stack,
communication units to abstract the intra-subsystem hardware platform and communication architecture.
communication (between the tasks mapped on the same Examples of software architecture parameters are:
processor) and inter-subsystem communication (between the OSType, which specifies the name of the OS running on the
tasks mapped on different processors). The proposed flow target processor (Linux, Mutek, DwarfOS, eCos),
allows incremental generation and validation of the CommType, which identifies the type of the communication
software, hardware and the communication architecture. At library used during the HdS integration (POSIX APIs, MPI
each step, the flow generates a specific software component [15], read/write [16], etc), SchedulerType to identify the
and a hardware platform that allows debugging the type of the scheduler (preemptive, cooperative) in the case
corresponding software component and exploring the that the OS provides more than one type of schedulers,
communication architecture. SchedulerAlgorithm to define the algorithm used for the
tasks management by the OS (round-robin, priority based)
Application Architecture etc.
Examples of hardware architecture parameters that
• Application modeling
Modeling
• Tasks Mapping information
annotate the System Architecture model are: ResourceType
• Comm. Mapping information
• Appl. Algorithm Validation
which specifies the type of the hardware resource in the case
Simulation
SA Model of a subsystem (CPU-ARM, CPU-DSP, CPU-XTENSA,
I/O) or the storage resource of the communication buffer in
VA • Task Code Generation
Design • Abstract HW Generation • Tasks Code Validation the case of a inter-subsystem communication unit (SRAM,
Deadlock free execution
• Partitioning Validation
register, global memory); MemoryAccessType which
VA Model Simulation • Communication Archit. identifies whether the memory access is performed directly
Number of exchanged bytes
by the processor or by using a DMA (Direct Memory
TA • OS & Comm Integration Access) component, etc.
Design • TLM HW Generation • HdS Integration Validation
• Communication Archit. The Simulink model is used as a reference model for
Simulation
Number of exchanged bytes
Number of conflicts on the bus
debugging the application’s algorithm. In the System
TA Model
Amount of NoC congestion Architecture level, we perform a discrete-time simulation to
VP • HAL Integration
validate the functionality of the application. For the inter-
Design • RTL HW Generation subsystem and intra-subsystem communication units,
• Final SW Binary Validation
• Performance Measurements Simulink use an abstract simulation model for each of these
• Memory Mapping Validation
VP Model Simulation units based on generic Simulink I/Os.

4.2. Virtual Architecture Design and Simulation


Figure 3. MPSoC Design Flow At Virtual Architecture level, the model is described
using SystemC language and is generated according to the
4.1. System Architecture Design and Simulation parameters specified in the initial Simulink model.
The System Architecture model represents a high level At this level, the software is refined to tasks C code by
application model annotated with partitioning and mapping using specific C code generators [17]. Each task is
information. We use Simulink environment to represent this encapsulated within a SC_THREAD and uses
level. The functional model of the application may be communication primitives (HdS APIs).
described using both standard Simulink blocks and user The hardware is refined to a set of abstract SystemC
defined S-functions and represents the application’s modules (SC_MODULE) for each subsystem. The
algorithm. SC_MODULE of the processor includes the tasks modules
Aspects related to the architecture model (e.g. processing that are mapped on the processor and the communication
units or storage resources available in the target hardware channels for the intra-subsystem communication between
platform) are combined into the application model (i.e. the tasks inside the same processor. The communication
multiple tasks executed on the processing units), resulting a channels between the tasks mapped on the same processor
combined architecture/application model. Thus, the System are implemented using standard SystemC channels. For the
Architecture model expresses parallelism in the target inter-subsystem communication, the hardware architecture
application through capturing the mapping of the application integrates also the resources addressed explicitly by the HdS
functions into tasks and the tasks into subsystems. It makes APIs. Typical examples are memories that serve to store the
also explicit the communication units between the tasks to communication buffers. The interconnection between the
abstract the implementation of the communication protocol different components uses an abstract model of the
used for the data exchange between them. communication network that allows the data transfer from
The System Architecture model is annotated with the source to the destination module.
software and hardware architecture parameters to allow the

ISBN # 1-56555-316-0 500 SCSC 2007


The simulation at the Virtual Architecture level allows the corresponding SystemC abstract processor module
validating the tasks C code of the refined software and the through the UNIX IPC layer. The hardware-software
partitioning. It uses pure SystemC engine for the tasks interface uses UNIX shared memory for the interaction
scheduling. between the software and hardware.
The Virtual Architecture simulation also gives The simulation allows validation of the integration of
important statistics regarding the communication the tasks code with the OS and communication libraries. On
requirements, such as the total number of bytes exchanged the hardware side, it gives more precise statistics on the
between the subsystems during the whole execution of the communication architecture such as number of conflicts on
application, the amount of data passing through the global the shared global bus due to the simultaneous access
interconnect component, the buffer size requirements in the requests in the case of a bus-based architecture topology.
worst case scenario for the storage resources in order to For a NoC based architecture topology, useful information
support the communication mapping specified at the System deduced during the simulation are related to the amount of
Architecture level or the amount of read/write operations NoC congestion, number of routing requests, number of
performed at the storage modules. Based on the application transmitted packets, the average amount of transmitted
requirements and the communication traffic resulted after bytes per packet or the number of times some routers failed
the virtual architecture simulation, the designer can fix the to transmit the packed due to the conflicts. For both
topology of the interconnect component that will be topologies (bus and NoC), the Transaction Accurate
included in the hardware platform at the next abstraction Architecture simulation allows extracting the total amount
level (bus-based topology or NoC based topology). of transmitted bytes.

4.3. Transaction Accurate Architecture Design and 4.4. Virtual Prototype Design and Simulation
Simulation The Virtual Prototype model is described in SystemC at
The Transaction Accurate Architecture model is RTL level. The software stack is fully explicit, including the
described using SystemC TLM language and is generated HAL layer to access the hardware resources and it is
according to the annotated architecture parameters of the detailed to ISA (Instruction Set Architecture) level for a
initial Simulink model and the results of the Virtual specific processor.
Architecture model simulation. The hardware architecture incorporates an ISS for each
At the Transaction Accurate Architecture level, the processor to execute the final binary code.
software is composed of task code, an OS, and The simulation performed at this level is cycle accurate.
communication layer which implements the HdS APIs. The It allows validating the memory mapping of the target
OS and communication components make use of HAL APIs architecture and the final software code. It also provides
to access the hardware resources. The intra-subsystem precise performance information such as software execution
communication is managed fully by the OS and time, computation load for the processors, the number of
communication libraries. The tasks are scheduled by the OS. cycles spent on communication, etc.
At the Transaction Accurate Architecture level, the
hardware is refined to a more detailed architecture. This 5. EXPERIMENTAL RESULTS
includes the local components of the different subsystems,
such as peripherals, synchronization components, local To show the efficiency of the proposed MPSoC design
memories and network interfaces. The different subsystems flow, we applied the presented approach onto three real
are interconnected by an explicit network communication multimedia applications: a MP3 Decoder, a Motion JPEG
component (bus or NoC). During the generation, design Decoder application and a H.264 Main Profile Encoder.
decisions such as NoC size definition, subsystems First, we developed the System Architecture model to
positioning over the global interconnect component, NoC validate the application’s algorithm and to specify the
topology, NoC routing algorithm and communication buffer partitioning and mapping.
size are implemented at the Transaction Accurate As target architecture we used a heterogeneous
Architecture level. architecture based on [19]. The architecture is composed of
The simulation at the Transaction Accurate an ARM9 processor, an ATMEL DSP subsystem, an I/O
Architecture is based on native software execution and subsystem and a global memory. The communication
SystemC simulation for the hardware platform [18]. Each between the processors may use different resources for
software stack is a SystemC thread which creates a UNIX mapping the data buffers (i.e. local memories of both
process for the software execution. At the beginning of the processors or global memory). The different subsystems
simulation, the SystemC platform launches a GNU standard may be interconnected using different components, such as
debugger (gdb) UNIX process for each software stack in an AMBA bus or NoC Mesh or Torus topologies) according
order to start its execution. The software stack interacts with to the application’s requirements.

SCSC 2007 501 ISBN # 1-56555-316-0


After the System Architecture model, we refined the To build the H.264 encoder we used as reference C
hardware and the software to the Virtual Architecture level code the x264 open source code. The CABAC part of the
in order to estimate by simulation the total number of the encoding algorithm was mapped on the DSP, while the
exchanged bytes during the execution of the whole other computation and control functions are mapped on the
application. Based on the communication requirements of ARM processor (figure 6). The communication between the
the application, we decided the topology of the global 2 tasks mapped o the processors requires 19 communication
interconnect component and then we refined further the units. The simulation time to encode in Simulink 10 frames
hardware and software to the Transaction Accurate Main Profile QCIF YUV 420 video sequence was 45s.
Architecture level to validate the implementation of the T1
T2
communication protocol and build the software stack. These +
Fn IT Q Reorder CABAC NAL
steps are detailed in the following paragraphs. -

.yuv ME
5.1. System Architecture Simulation F’n-1
Inter
MC
We developed the MPEG-1 Audio (layer 3) decoder Prediction
(shortly the MP3 Decoder) model in Simulink. During this Choose
Intra Pred.
Intra
Pred. Intra
step, the main functions of the MP3 decoder were isolated
into separate tasks (figure 4). Then, we mapped two tasks F’n
+
Filter IT-1 Q-1
(T1 and T3) on the ARM processor and task T2 on the DSP. +

To represent the communication protocol between the tasks, Figure 6. H.264 Encoder algorithm
we inserted 4 communication units: 1 between the 2 tasks
mapped on the ARM and 3 between the processors. The During the experimentation, we attempted different
simulation time in Simulink for an 80KB input MP3 audio communication mapping schemes between the processors.
file was 5s on a PC running at 1.73GHz, 1GMBytes RAM. In the first scheme, the data exchange is made only via the
The simulation allowed validating the application algorithm. global memory. The second communication scheme makes
T1 T2 T3 use of the local memories to store the communication
MP3 Packet Huffman Synthesis PCM buffers. The third case uses both local and global memories
IQ IMDCT
Decoder Decoder Filter bank for the communication (mixed case). In all the situations,
the communication between tasks mapped on the same
Figure 4. MP3 Decoder processor makes use of the software FIFO protocol. In the
following sections, we will consider the worst case scenario,
In the same manner, we built the Motion-JPEG when all the communication buffers are mapped on the
Decoder in Simulink using 7 S-functions and we grouped global memories. Thus, every data exchange between the
them into 3 tasks mapped on the two processors. The processors has to pass minimum two times through the
decoded image is displayed using a LCD panel connected to global interconnect component (the producer writes in the
the I/O Peripherals subsystem (POT). The System global memory, the consumer reads the data from the global
Architecture Model of MJPEG is illustrated in figure 5. It memory).
contains 8 communication units: 5 between the tasks
mapped on the ARM processor and 3 between the different 5.2. Virtual Architecture Simulation
subsystems. The simulation time for a 10 frames input We generated the Virtual Architecture model. The
bitstream encoded using QVGA YUV 444 format was 17s software is composed of C code for the application tasks.
in Simulink. The hardware is composed of three abstract subsystems
comm1
(DSP, ARM and I/O), the global memory and an abstract
comm3
network component, described in SystemC. At this level, the
ARM DSP POT
comm2
software tasks are scheduled by the SystemC simulation
engine.
The simulation at the virtual architecture level allowed
comm4
Task2 Specification gathering of important early performance measurements, e.g.
the amount of data exchanged between the processors, the
comm5
Task1 Task2 buffer size requirements and the amount of read/write
comm6
operations performed at the storage modules.
comm7
Table 1 illustrates the total amount of exchanged bytes
comm8
during the execution of the applications and the simulation
time for the 3 multimedia applications.
Figure 5. Motion JPEG System Architecture in Simulink

ISBN # 1-56555-316-0 502 SCSC 2007


Table 1. Virtual Architecture simulation results obtain a routing definition. In this simulation, 42.623.519
requests of routing were demanded. The first column of
Application Total bytes Simulation time
Table 2 illustrates the percentage of routing requests at each
MP3 Decoder 3014k ~2s
router, while the other details this information for each
80KB MP3 audio file
router port.
MJPEG Decoder 5256k ~14s The H.264 is the most communication intensive
10 frames QVGA YUV 444 application between the three case studies. In this case, the
H.264 Encoder 2488270k ~4m46s interconnect component between the different subsystems
10 frames QCIF YUV 420 was a NoC similar with the one used for the MJPEG
application, but it implements a Torus topology. In this NoC
5.3. Transaction Accurate Architecture Simulation model, every router has five bidirectional ports to
At the Transaction Accurate level, we used a tiny OS implement a 2D Torus topology with wraparound links at
for the tasks management and an MPI-based semantic for the edges of the network. The routing algorithm is a
the communication APIs. On the hardware side, the deadlock free version of the well known non-minimal west-
architecture is more detailed, including the local peripherals first algorithm proposed in [21]. The total number of the
of each processor subsystem (local memories, local bus, routing requests represents 14410M.
network interface, interrupt controller and mailbox for Table 3 summarizes the simulation time at the
synchronization). The network component becomes explicit Transaction Accurate Architecture level required to execute
based on the communication requirements determined each of the 3 multimedia applications. As transaction
during the Virtual Architecture level simulation. accurate hardware/software architecture contains more
The execution of the MP3 application requires a data details than the Virtual Architecture, the simulation is
exchange for the communication between the processors slower than the Virtual Architecture simulation, but it is
around 3MB. Therefore, an AMBA bus is suitable as more accurate.
interconnect component to assure the required
communication performance. During the simulation of the Table 3. Transaction Accurate Architecture simulation results
MP3 application on the Transaction Accurate Architecture
with AMBA, the global bus was accessed 4366k times and Application Interconnect Simulation time
2489k conflicts were detected due to the simultaneous Component
access requests for the shared global bus. MP3 Decoder AMBA bus ~22s
For the MJPEG Decoder application, the adopted 80KB MP3 audio file
interconnect component at the Transaction Accurate level MJPEG Decoder NoC Mesh ~5m10s
was a Hermes-based NoC [20]. This NoC employs a 2D 10 frames QVGA YUV 444
Mesh topology with 6 routers (3x2). The routers may have H.264 Encoder NoC Torus ~28h10m
from three to five ports, depending on the router position 10 frames QCIF YUV 420
relative to the limits of the mesh. It uses a pure XY routing
algorithm shared by all the ports and a round robin 6. CONCLUSION
scheduler to arbitrate the simultaneous transmission
requests. In this paper we presented a Simulink based MPSoC
Table 2 presents the relative number of routing design flow that allows gradual hardware, software and
requested per each input port in each router that composes communication refinement. The flow starts with a high level
the mesh NoC during the MJPEG simulation. application model described in Simulink. During the
refinement, the presented approach allows mixed hardware-
Table 2. Relative routing requests during the simulation software validation at different abstraction levels in addition
TOTAL LOCAL NORTH SOUTH EAST WEST to opening to new facilities such as communication mapping
0x0 24,0% 11,3% 5,7% 0,0% 7,0% 0,0% exploration and network component refinement. The
0x1 11,4% 2,9% 0,0% 1,3% 7,2% 0,0% efficiency of the proposed flow has been illustrated with 3
1x0 18,3% 0,0% 0,0% 0,0% 7,0% 11,3% case studies using the MP3, Motion JPEG and H.264
1x1 10,1% 0,0% 0,0% 0,0% 7,2% 2,9% multimedia applications and executed on an architecture
2x0 20,0% 7,2% 1,5% 0,0% 0,0% 11,3%
with different interconnect topologies (AMBA, NoC Mesh,
2x1
NoC Torus).
16,1% 7,4% 0,0% 5,8% 0,0% 2,9%

Biography
A routing request is performed at least once when a Katalin Popovici is currently PhD Student in
packet arrives to a router and, depending on the NoC Microelectronics at TIMA Laboratory, National Polytechnic
specification, can request as many times as necessary to Institute of Grenoble, France. She received the Computer

SCSC 2007 503 ISBN # 1-56555-316-0


Science Engineering Degree from Polytechnic University of [8] D. Culler, J.P. Singh, A. Gupta “Parallel Computer
Oradea, Romania in 2004. Her current research interests are Architecture: A Hardware/Software Approach”, Morgan
architecture/application modeling and simulation for Kaufmann, August 1998, ISBN 1558603433.
heterogeneous multiprocessor SoC design running [9] P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer,
multimedia applications. G. Essink “Design and programming of embedded
Dr. Ahmed Amine Jerraya received the Engineer degree multiprocessor: an interface-centric approach”, Proceeding of
from the University of Tunis in 1980 and the D.E.A., CODES+ISSS 2004, Stockholm, Sweden, 206-217
"Docteur Ingénieur", and the "Docteur d'Etat" degrees from [10] Y. Takagi, S. Honda, H. Tomiyama, H. Takada
the University Of Grenoble, France in 1981, 1983, and 1989 “Communication interfaces for System Level Design”,
respectively, all in computer sciences. In 1986, he held a full Proceeding of SASIMI 2006, April 3-4, Nagoya, Japan, 21-
research position with the CNRS (Centre National de la 28
Recherche Scientifique). From April 1990 to March 1991, [11] A. Jerraya, A. Bouchhima, F. Petrot “Programming
he was a Member of the Scientific Staff at Nortel in Canada, models and HW-SW Interfaces abstraction for Multi-
working on linking system design tools and hardware design Processor SoC”, Proceeding of DAC 2006, San Francisco,
environments. He served as General Chair for DATE 2001. USA, 280-285
He also served as a general chair or program chair of several [12] K.Popovici et al. “Efficient Software Development
international workshops and symposia. He organized several Platforms for Multimedia Applications at Different
international courses including MPSoC (Multiprocessor Abstraction Levels”, Proceeding of the 18th IEEE/IFIP
SoC), the pluri-disciplinary summer school. He published International Workshop on Rapid System Prototyping, RSP
more than 227 papers in International Conferences and 2007, Porto Alegre, Brazil, 113-122
Journals, and 7 books. He received the Best Paper Award at [13] J.A.Rowson “Hardware/Software cosimulation”,
the 1994 ED&TC for his work on Hardware/Software Co- Proceeding of DAC 1994, San Diego, USA, 439-440
simulation. [14] L. Semeria, A. Ghosh “Methodology for
Dr. Jerraya is currently Head of the design programs at hardware/software co-verification in C/C++”, Proceeding of
CEA-LETI, Minatec, Grenoble, France. ASPDAC 2000, Yokohama, Japan, 405-408
[15] MPI, http:/www-unix.mcs.anl.gov/mpi
References [16] E.A. de Kock et al. “Yapi: Application modeling for
[1] H. Meyr “Application Specific Processors (ASIP): On signal processing systems”, Proceeding of DAC 2000, USA,
design and implementation Efficiency”, Invited talk of 402-405
SASIMI 06, Nagoya, Japan [17] S.-I. Han et al. “Buffer memory optimization for video
[2] SystemC, http://www.systemc.org codec application modeled in Simulink”, Proceeding of
[3] W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, DAC 2006, San Francisco, USA, 689-694
S.Yoo, L. Gauthier, M. Diaz-Nava, A.A. Jerraya, [18] G.Nicolescu, “Specification et validation des systemes
“Multiprocessor SoC Platforms: A Component-Based heterogenes embarques”, PhD Thesis, TIMA Laboratory,
Design Approach”, IEEE Design & Test of Computers, Vol. 2002
19, Nov-Dec, 2002 [19] Paolucci, P. et al. “SHAPES: a tiled scalable Software
[4] Coware, http://www.coware.com Hardware Architecture Platform for Embedded Systems”,
[5] MathWorks Simulink, http://www.mathworks.com Proceeding of CODES-ISSS 2006, 167-172
[6] Y. Vanderperren and W. Dehaene, “From UML/SysML [20] Moraes, F. et al. “HERMES: an Infrastructure for Low
to Matlab/Simulink: Current State and Future Perspectives”, Area Overhead Packet-switching Networks on Chip
Proceeding of Design Automation and Test in Europe, Integration”, Integration, The VLSI Journal, v.38(1), 2004,
DATE 2006, 6-10 March, Munich, Germany, 93-93 69-93
[7] Brian Etscheid, “Enabling the use of an embedded [21] C. Glass, L. Ni “The turn model for adaptive routing”,
processor in a Simulink-based design flow”, PhD Thesis, Proceedings of the 19th annual international symposium on
UC-Berkley, 2001 Computer architecture, 1992, 278-287

ISBN # 1-56555-316-0 504 SCSC 2007

You might also like