Professional Documents
Culture Documents
Natal – RN
December 2018
Igor Macedo Silva
Natal – RN
December 2018
Igor Macedo Silva
Natal – RN
December 2018
To my family
ACKNOWLEDGEMENTS
I thank my parents Ivanildo, Gorete and my brother Ian for the tremendous support
they gave me throughout my undergrad course. It I am here today, it is because of you.
I thank my advisor Samuel for accepting me in his project and tutoring me towards
this final work. Also I express my deep gratitude to my collegues in the Cevero project,
specially Diego, Otávio and Kallil, who dedicated their time to review this document and
help me progress in the experiments.
After all, I thank all my professors, collegues and organizations I was a part of
during my undergrad course (IIN-ELS, Include Engenharia, Projeto SmartMetropolis and
Projeto Forró na UFRN) for they all contributed to the ensamble of knowledge I am today.
Thank you all!
“I suppose there are people who are so ‘lucky’ that they are not touched by phantoms and
are not troubled by fleeting memory and know not nostalgia and care not for the ache of
the past and are spared the feather-hit of the sweet, sweet pain of the lost, and I am sorry
for them - for to weep over what is gone is to have had something prove worth the
weeping."
( Isaac Asimov, It’s Been a Good Life )
ABSTRACT
The ever increasing demand for faster computers encouraged the development of a new
hardware paradigm in computing: the multicore processors. This new approach suggested
that problems could be broken and processed separately for each core, achieving the end
result in a fraction of the time, however this technique brought along a number of new
problems, such as race conditions, problem scalability, limitations, protocol synchronization,
among others. In an attempt to better understand such new paradigm, this work develops
and explores a multicore system with shared-memory access. The base processing core
was developed by the PULP Platform, a project by ETH Zurich and is called Zero-riscy.
This core implements the RISC-V instruction set architecture, which is used to develop a
sample assembly program so as to test parallel programing capabilities of the multicore
system. The system is tested both in an online simulation environment and in FPGA
hardware to assert its memory access functionalities.
1 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Clock and Core information signals settings . . . . . . . . . . . . . . . . . 29
3 Memory Model Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 RISC-V Assembly Program . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Zero-riscy design file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Memory initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Test procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
LIST OF ABBREVIATIONS AND
ACRONYMS
IF Instruction Fetch
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Rise of Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Parallel Computing Architecture . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Shared Memory and Interconnects . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Specific Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 ZERO-RISCY CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Microarchiteture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Core Main Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 A MULTICORE SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Cores configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Memory module design . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 DESIGN VERIFICATION . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Instruction Program Code . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Online Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Online simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Porting to FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Hardware emulation results . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12
1 INTRODUCTION
Parallel computing is new a very solid computation paradigm. However this trend
started as an alternative to a few different restrictions in older computer organizations
and a growing problem with the current transistor technology. This chapter introduces
how parallel computing became a viable solution, a few architectural decisions and the
reason that motivates this work.
After a while, his predictions were made law by researchers and the press alike.
Later, Moore revisited his statements to befit what was observed in the industry: the rate
of growth began to slow and the number of transistors in a given chip was expected to
double every two years, as opposed to only one year in his first prediction.
This trend was accompanied by an also ever increasing transistor frequency. In
1971, Intel 4004 processor was first introduced and it had a clock frequency of 107KHz. By
2006, Intel launches its Pentium D processor with an astonishing 3.2GHz clock frequency
which represents an increase greater than a hundred thousand in less than forty years.
However, as time passed, this industry pattern began to slow down considerably.
On one hand, shrinking transistors size in order to fit more of them in the same space was
restrained by technology factors and even physical limitations imposed by the materials
themselves. On the other hand, increasing transistor frequency indefinitely is also imprac-
Chapter 1. Introduction 13
tical. In order to increase clock frequency, one needs to increase the supply voltage to the
chip, as shown by Equation 1.1
(V − Vth )h
fmax = k1 (1.1)
V
where fmax is the maximum possible frequency, V is the supply voltage, Vth is the
transistor operation voltage, h is a constant dependent on the semiconductor technology
and k1 is a proportionality constant. Moreover, an increasing supply voltage causes another
problem: a rise on dissipated power, given the following equation P = k 2 ACV 2 f . The end
result is heat generated with exponential rate, a problem that most personal computers
cannot withstand for prolonged times. And, thus, it became clear that manufacturers
could not continue to decrease transistors size and increase clock frequency in order to
create better and more capable processors. Moore’s law was no longer viable.
Amidst those barriers, the world saw the surge of a new computational paradigm
during the first decade of the 21th century. It was the beginning of parallel computing.
Manufacturers realized they could replicate functional blocks in processors in order to
increase their throughput and achieve the same end-goal without necessarily having to
compromise with shrinking transistors and increased clock frequencies.
Source – Author
block to be available. This is called as the von Neumann bottleneck (PACHECO, 2011, p.
15-17).
The Harvard architecture has one simple solution: to distinguish between instruction
and data memory. Although this organization does not solves all problems between memory
and CPU, it allows for the CPU to fetch instructions and data concurrently, saving several
clock cycles in execution (GANSSLE et al., 2007). Furthermore, with separate memory
and buses, the CPU can pipeline more than one instruction at a time, which leads to
increased performance because the critical path can be reduced and the processor may
attain higher clock frequencies.
Many improvements over the von Neumann and Harvard architectures, such as
caching and virtual memory, aimed to mitigate the memory connection bottleneck by
decreasing the time it takes to fetch data. Many others, as pipelining and hardware threads,
aimed to execute as many instructions as possible and reduce the impact of instructions
stalls. All of them contributed incrementally to the complex engineering feat of a parallel
computer architecture.
[..] Although crossbars incur a higher wiring overhead than buses, they
allow multiple messages simultaneously to be in transit, thus increasing
the network bandwidth. Given this, crossbars serve as the basic switching
element within switched-media network routers. [...] A crossbar circuit
takes N inputs and connects each input to any of the M possible outputs.
[...] the circuit is organized as a grid of wires, with inputs on the left,
and outputs on the bottom. Each wire can be thought of as a bus with a
unique master,i.e., the associated input port (KIELMANN et al., 2011).
Chapter 1. Introduction 16
there. It also shows the motivation and the main goals intended for this work.
After that, chapter 2, named The RISCV Architecture, exposes the project RISCV
and PULP which serves as a basis for the work in development.
Chapter 3, called Zero-riscy Core Architecture, dives further in the PULP project
to explain the architecture of the Zero Riscy, a processor core that implements the RISCV
ISA.
In sequence, chapter 4, Developing a multicore processor, explores the hidden
complexities in developing a multicore with two Zero-riscy instances with access to one
memory block.
Chapter 5 show the execution of a program and its results in a simulation tool and
how to implement the FPGA hardware emulation.
Finally, chapter 6 compiles the work and explains the results achieved in this work.
18
Before starting to code, one of the first decisions a general processor hardware
designer must take is which instruction set architecture (ISA) will be used. The instruction
set is the interface agreement between software and hardware. It defines how bit of
software code should be organized, so that hardware will know how to interpret and
process information.
There are many well-known successful ISAs on the market. To list a few, x86 and
ARM are probably the most widespread instruction sets available today, which implies a
tremendous hardware and software support by market and developers. However, the case
with most ISAs is their licensing. Commercial ISAs are proprietary, and any individual or
institution that aspires to develop a product based on their specifications must have a
license, which generally involves negotiating terms and costs of up to 10 million dollars.
The alternative is to use open source ISAs, that have very permissive licenses, but
not yet very good acceptance from companies and developers. This scene is starting to
change with a relatively new architecture that has gathered attention and support from
both academia and market: the RISC-V instruction set.
This instruction set was developed by Krste Asanovic, Andrew Waterman, and
Yunsup Lee as an alternative to other proprietary ISAs, aiming to warm up the open
hardware initiative. The motivation to create a free instruction set is very well explained in
"Instruction Sets Should Be Free: The Case For RISC-V" by Asanovic̀ and Patterson. They
explain that, although instruction set architecture are historically proprietary, there is no
technical reason for such lack of open source ISAs, for "neither do companies exclusively
have the experience to design a competent ISA", nor their ISAs are the most "wonderful
Chapter 2. The RISC-V Architecture 19
This feature of RISC-V can be very well observed in its base instruction set and
standard extensions definition (WATERMAN et al., 2012). As Table 1 shows, there are
four base instruction sets, based on the width of their address space and register count:
two with a 32-bit address space, one with 64-bit and another one with 128-bit address
space. This last 128-bit base instruction set is a precaution against the ever evolving
need for memory space. Although current Internet of Things (IoT) and mobile devices
Chapter 2. The RISC-V Architecture 20
are completely covered by a 32-bit or 64-bit address space, there is also the case with
Warehouse-Scale Computers (WSC). Even if, currently, WSC still do not need a 128-bit
address space, it is reasonable to assume that, in years yet to come, a 64-bit address
space will not suffice the tremendous amount of data such computers have to handle. So,
considering that address size is one hard-to-recover problem with ISAs, it is prudent to
account for a possible bigger address space now (ASANOVIC̀; PATTERSON, 2014).
Besides its base instruction set, the RISC-V architecture also has a number of
standard extensions within its definition. Table 1 gives an explanation on each one of
them. The purpose of these extensions is to give the flexibility the ISA needs in order
to address demands from resource-constrained low-end hardware implementations and
high-performance ones alike. Take for examples the A and D standard extensions. On
one hand, a microcontroller designed for IoT applications much probably will not need to
perform atomic or double-precision floating point operations. On the other hand, a WSC
definitely needs atomic operations to assure parallel instructions are executed correctly and
double-precision floating point operations so as to guarantee the credibility of its results.
Therefore, with its modular configuration, RISC-V presents itself as a good alternative for
a variety of applications.
Source – RISC-V
The organization of each instruction coding is also taken into account, as other RISC
architectures do. RISC-V has fixed 32-bit instructions (with exception of its compressed
standard extension) that should naturally aligned in memory. It has 6 instruction formats,
4 of which are shown in Figure 2. These four instruction formats enclose the most common
used instructions and thus are a sensitive design choice in the ISA. The encoding of each
instruction is highly regular. Hardware designers were very careful to maintain consistency
between formats, creating fixed locations for operands and opcode. This strategy allows
the register fetch operation to proceed in parallel with instruction decoding, improving a
critical path for many hardware implementations (WATERMAN, 2016, p. 17,18).
Chapter 2. The RISC-V Architecture 21
It started as a joint effort between ETH Zürich and University of Bologna to explore
Chapter 2. The RISC-V Architecture 22
new architectures for ultra-low power microprocessors and suffice the demand for such
devices in IoT applications (PULP PLATFORM, 2018a). However, aspects that involved
the support and open nature of the RISC-V ISA and the PULP Platform helped this project
to aim higher. Nowadays, the PULP Platform has a number of cores, microcontrollers and
processors that encompass IoT and High Performance Computing (HPC) applications.
Figure 3 show an updated graph illustrating the whole PULP family.
The PULP Platform has very simple cores as Zero-riscy, "an area-optimized 2-stage
32-bit core", and complex cores as Ariane, that presents "a 6-stage, single issue, in-order
64-bit CPU which fully implements I, M, C and D extensions" and other features (PULP
PLATFORM, 2018b). Those cores might be used in diverse platforms, from single core
IoT-focused systems to Multi-cluster HPC-focused ones, as shown in Figure 3. As a result,
there are a number of choices for anyone who intends to use the PULP Platform in a
research project.
23
3 ZERO-RISCY CORE
There are many possible choices among the cores within the PULP project. There-
fore, a researcher has to consider their needs and find a suitable contender. For the purpose
of this document, which is to develop a multicore system with shared memory, the Zero-
riscy is a good starting point. This core implements the 32-bit basic RISC-V instruction
set and two other standard extensions M and C, which allow for multiplication/division
operations and the compressed instruction encoding, respectively.
In order to use this core in the proposed multicore-system, one must first understand
its microarchitecture and how to control the core signals so as to achieve any result. This
chapter explores the inner workings of the zero-riscy core and how external hardware may
communicate with it.
3.1 Microarchiteture
The zero-riscy is an area-optimized core, focused on low energy consumption. As
a reflex, it is possible to see how these decisions affected its internal structure. The core
has 2 pipeline stages: Instruction Fetch (IF) and Instruction Decode and Execution (IDE).
Figure 4 shows the microarchitecture for the zero-riscy.
During IF, as the name says, the core fetches instruction data from memory through
its instruction memory interface. The main block is the prefetch buffer, which uses the
memory protocol (explained in depth in section 3.2) to access memory and store them
in a FIFO if the next stage is not yer ready. In IF stage, the prefetch buffer handles
Chapter 3. Zero-riscy Core 24
any compressed instructions, along with the instruction address and the program counter
(which might differ from expected due to misalignment caused by compressed instructions).
After that comes the IDE stage. This stage reads the operands from the register
file (in Figure 4, the GPR (General Purpose Register) block), prepares them for execution
in the ALU or multiplier units and executes the instruction. The register file is a 2-read-
1-write unit that may be either latch based or flip-flop based, depending on the target
(ASIC or FPGA).
The ALU is a result from the area optimization. It contains only the necessary
blocks to implement the base instruction set and the multiplication/division standard
extension. Figure 5 show it contains an adder, a shifter and a logic unit. The adder selects
its operands from a multiplexer defines the output based on a controller signal, whereas
both shifter and logic unit will always compute their values with values provided by the
decoder or the register file. The end result is selected by another multiplexer that receives
all three computations.
At last, the Control and Status Registers (CSR block) holds information used by
the RISC-V privileged instructions, in exceptions, interrupts and performance metrics;
and the Load and Store Unit (LSU) is responsible for loading and storing memory data.
Chapter 3. Zero-riscy Core 25
same cycle as the request was sent or any number of cycles later. After
a grant was received, the address may be changed in the next cycle by
the LSU. In addition, the data_wdata_o, data_we_o and data_be_o
signals may be changed as it is assumed that the memory has already
processed and stored that information. After receiving a grant, the memory
answers with a data_rvalid_i set high if data_rdata_i is valid. This may
happen one or more cycles after the grant has been received. Note that
data_rvalid_i must also be set when a write was performed, although the
data_rdata_i has no meaning in this case. (SCHIAVONE, 2017).
The knowledge of this memory access protocol is vital for the design of the shared-
memory multicore system in this document. chapter 4 will go through how the design
process occurred and its final architecture.
27
4 A MULTICORE SYSTEM
simulation tools is that each tool has its own algorithm to resolve this execution order
and having a hardware design tested in more than one tool helps creating a more robust
design.
As informed by section 3.2, there are a few signal which have to be set in order
to start both cores. Listing 2 shows how the clocking enabling signals are set to one, test
Chapter 4. A Multicore System 29
enable is set to zero and the other Core ID and Cluster ID signals are set to identify each
core independently. Later, in core instantiation, each signal is assigned to its respective
input port in the module interface. Attempt for the information that theses signals provide
to each core. As mentioned in section 3.2, core_id_i and cluster_id_i in Listing 2, lines
4, 5, 11, 12, will store information regarding the cores identification. This information
is stored in a special register called Control and Status Register (CSR) bank. For these
signals in special, a 32 bit register will hold both signals in reserved sections of the register.
This information can be later retrieved with CSR access instructions.
8 // Core 2 Signals
9 logic clock_en_i_2 = 1; // enable clock , otherwise it is gated
10 logic test_en_i_2 = 0; // enable all clock gates for testing
11 logic [ 3 : 0 ] core_id_i_2 = 1;
12 logic [ 5 : 0 ] cluster_id_i_2 = 1;
13 logic [ 31 : 0 ] boot_addr_i_2 = 0;
Besides those signals, all of the other module interface signals were created with an
identification tag (ending with _1 or _2 ) in order to identify each core signals. For now
most of the signal variables do not need to present any specific value, since they will be
provided later by other modules or during the testbench execution procedure.
mechanism that routes the signal to where it is intended, but it can manage multiple
signals at the same time.
In this project there is only two cores and one memory block for both of them. The
complexity of connecting three components should be kept as little as possible. Therefore, a
crossbar is not a mindful project choice, for it adds unnecessary complexity to a system that
does not have many components. A viable and simple solution is to use a bus interconnect
between cores and memory model.
Nevertheless, connecting two cores in a shared memory system with a single bus
interconnect still has its problems. Both cores will have to share the bus to write and read
from memory, alternating each access, which might cause the system to behave slower.
The proposed solution is to use a dedicated memory bus interconnect for each core,
as in Figure 9. In this case, this decision is only possible due to the small number of cores
and the fact that memory model is also designed by the author. With this architectural
choice, the system can perform more efficiently, increasing communication bandwidth.
Both cores may access the memory for reading or writing operations at exactly the same
time. Listing ?? shows how this decision reflects on the memory module interface: the
end result is simply a duplication of input and output variables to allow for simultaneous
access and the same happens within the module definition. Figure 8 shows a graphical
representation of the memory module.
Source – Author
Source – Author
However, another possible problem arises. Consider the case when both cores try
to access the same memory address at the same time, for reading or writing operations. If
both cores try to read the same memory address, both modules will be able to perform
this action and acquire the information, which is beneficial for the whole system. Yet, in
the case when both cores try to write data to the same memory address or even one core
reads while the other writes to the same memory address, there are conflicting operations.
In both cases, the parallel nature of hardware and HDL simulation does not guarantee
which operation will be executed first. Note, however, that this may also happen in a
shared bus interconnect. If cores try to operate in the same memory address, the bus will
allow one operation at a time, but there is no guarantee to the order of execution. In such
Chapter 4. A Multicore System 32
cases, the solution is considered to come from software rather than hardware. Thus, a
duplicated memory bus is still a very compelling choice to the proposed system architecture.
Unfortunately, it does not solves common interconnect problems, but it improves greatly
the communication bandwidth by providing real simultaneous access to memory.
As discussed previously, the race condition that happens when two or more cores
tries to access the same memory address is fated to both buses and crossbars. This is in
most cases a software problem, however hardware can also be used to avoid this problem.
The atomic instructions are used to prevent more than one simultaneous access to memory.
They are implemented in such a way that during the same data transfer the core may
read and write to memory. By this implementation, this instruction seem like a single
instruction being executed for all other cores and peripherals and solves the race condition
by asserting that no other core may write in the same place at the same moment, changing
the data to be processed. This is indeed a very good solution, but although the RISC-V
instruction set supports such instructions, the zero-riscy does not implement this standard
extension, which brings the responsibility to deal with memory accessing problems to the
software back again.
33
5 DESIGN VERIFICATION
As with any engineering project, the design phase is just a fraction of the whole
development cycle. A test phase is always needed to assert design behavior. This chapter
explores two simulation methods used to ascertain the hardware design and the development
of a parallel program to stress both cores at the same time.
1. The program shall be easily parallelizable, for the purpose was not to study the
parallel program, but the hardware.
2. The program shall have few instructions, for writing an assembly program can get
really complex for simple procedures and converting them to binary code might be a
necessity.
3. The program shall deal with only simple integer operations such as add and subtract,
for the core does not support floating point operations and multiplication/division
are part of an standard extension.
With these requirements in mind, the chosen problem was to sum the numbers
from 1 to N, the sum of the first N natural numbers. This problem has a rather simple and
known solution seen in Equation 5.1. However, our implementation will deal specifically
with the left side of this equation, summing all the parts from 1 to N.
n
X n(n + 1)
k= (5.1)
k=1 2
It was chosen because it only involves add operations and can be simply solved
with a partial sum loop procedure, which is itself very easily parallelizable. To solve this
problem, the following parallel RISC-V assembly program was written. Refer to Listing 4.
The assembly program starts by fetching the core ID from the Control and State
Registers. This information is passed via the module interface and stored in those registers
for later use. After storing the content of the whole register 0xF14 in register a0, masked
AND operation is used to extract only the first four bits, which contain the core ID, to
Chapter 5. Design Verification 34
the final register. In this case, the core ID is used to distinguish which core should execute
which part of the program and is used throughout the program.
39 inf :
40 bge zero , zero , inf # branch infinitely
Following, lines 4 and 5 set the initial value for memory addresses the program uses
to store a flag value indicating whether the core has accessed a memory address or not
and a variable which will hold the final sum. Line 6 branches the code to the region each
core has to execute in order to calculate its partial sum.
In Listing 4, there are code labels called core1 and core2, in lines 8 and 18,
respectively. Each one marks the start of the block that shall be performed by each core
independently. Both blocks of code perform a partial sum that accumulates from the initial
value until the last value, that sets the loop stop condition. The only difference is the
initial set up, in lines 9 through 11 and 19 through 21, which sets the values for registers
that will hold the counter, the partial sum and the maximum value for the loop. The
procedure in each loop starts by branching if the counter is greater or equal the maximum
value. If the counter is less than the maximum value, the program will proceed to the next
line and execute a sum between the current partial sum and the counter. Afterwards the
counter is incremented by one and the program executes an unconditional jump to the
beginning of the loop, which resets the loop for another iteration. If the branch in line 13
or 23, defines the counter to be equal or greater than the max value, the code branches to
the end code label, for finishing the procedure.
The end code label denotes the starting point of the conclusion procedure when
each partial sum will be added and stored in a specific memory address. In order to avoid
any race conditions to memory operations, the author decided that each core will adhere
to an specific access order: the first core to access has core ID equal to zero, and the
second core to access has core ID equal to 1. This order guarantees the access to read
memory, fetch partial sum from memory and add with the partial sum in the core will
occur without any other simultaneous access from another core. The instruction in line 29
loads a flag the signalizes the core ID and this value is used to decide which core has the
right to access memory at the moment. If the core ID in register a0 is different from the
flag value the code will perform an unconditional jump and re-fetch the flag and compare
both values until they are equal. This procedure guarantees that the program will only
proceeds when it is its turn to access memory. When flag and core ID are equal, the core
executes the next block of code.
The last code block will load the partial sum from memory, in line 33, and add
with the partial sum in the core’s registers. Later, this value is stores again in the same
memory region that store the partial sum. Now that the core has performed its partial
sum, it will release access for other cores by updating the flag status. This flag holds the
value of the core ID that may access the memory at that time, and by increasing the flag
Chapter 5. Design Verification 36
by one, as in line 36, the current core will allow the next core ID (which was stuck in a
loop in lines 29, 30, and 31 up until this moment) to write to memory. By incrementing
and storing the flag value, the program creates a queue that all cores will respect in order
to write to memory. The entire program flux is represented in Figure 10.
Source – Author
After designing the RISC-V assembly program, there was need to convert it to
binary code in order to make the program ready for upload in memory. The Ripes program
(RIPES, 2018) is just the tool needed for this task. It is a program that receives the
assembly code and executes the instructions in a simulated pipelined processor, suitable
for educational purposes. Opportunely, this software allows access to the binary code
generated by the assembly code and, in possession of the binary text file, the parallel
program was ready to memory upload and execution.
Chapter 5. Design Verification 37
The testbench file is responsible to instatiate all necessary modules (cores and
memories), connecting all of them, and executing the parallel program. Chapter 4 explained
how each module was linked with each other and this section shows how the parallel
program is uploaded to memory and executed.
Both Verilog and System Verilog have a structure called an initial block. This type
of code structure in both HDLs has a very specific purpose: the initial block is always
executed in the beginning of the simulation and only once. In the case of the design file,
the initial block loads the program to memory, as in Listing 6. With a for loop, it loads a
Chapter 5. Design Verification 38
high impedance signal to all memory address and then reads and uploads to memory the
text file that contains the binary parallel program created in section 5.1.
In the testbench file, the initial block is responsible to set up and changing any
signals the multicore system needs to execute. In this case, this block is responsible for
initiating and switching the clock signal, as in lines 1 and 2 from Listing 7. Following, the
initial block sets the dump variable and the name of the vcd file, which contains the wave
formats.
4 initial begin
5 $dumpfile ( " wave . vcd " );
6 $dumpvars (0 , zeroriscy_tb );
7
8 rst_ni = 0;
9 fetch_enable_i_1 = 1;
10 fetch_enable_i_2 = 1;
11 #1; rst_ni = 1; #1;
12
13 #130;
14 $finish ;
15 end
Finally, the initial block uses the reset signal as an active low to reset both cores
simultaneously and then enables each fetch_enable_i signal to start instruction fetching
in lines 9 and 10. The last few statements simply keep the initial block running for 130
time units in order to execute the whole code, in line 13. After that, the result may be
seen in the wave.vcd file declared in the beginning of the initial block.
Chapter 5. Design Verification 39
Source – Author
The wave form in Figure 11 shows the ALU operands a, b and its result, as
alu_operand_a_ex, alu_operand_b_ex and alu_adder_result_ex respectively. The first
three signals from top to bottom represent core 1 and the last three signal from top to
bottom represent core 2. In possession of those signals, one may be able to follow both
core loops and the final result for each partial sum in the parallel code.
Each loop iteration in Figure 11 is marked with a color, for better visualization.
The second clock cycle for each loop iteration is market with a white stroke box so as to
indicate the cycle which the partial sum is executed. This means the moment when the
previous sum is added with the current counter within the loop iteration. As expected, core
1 performs the sums 0 + 0 = 0, 0 + 1 = 1, 1 + 2 = 3, 3 + 3 = 6 and then 6 + 4 = A (using
hexadecimal representation). And core 2 performs 0 + 5 = 5, 5 + 6 = B and 11 + 7 = 12.
After all partial sums are executed, both cores prepare for memory access, each
Chapter 5. Design Verification 40
one taking its turn to add and write its partial sum to the specified memory address.
During time unit 100, the first core add its partial sum with the one it has just fetched
from memory. This operation is marked with a dashed red box in Figure 11, and reads
A + 0 = A. The next partial sum aggregation happens in time unit 132, as shown in
Figure 11. It is also marked with a dashed red box and displays the following operation
12 + A = 1C. This last operation is the result of the first core’s partial sum (store in
memory) with the second core’s partial sum. This result is then stored in memory and the
program starts the final infinite loop.
cycles. The signal value is converted to hexadecimal numeric representation and displayed
on the available seven-segment LED displays. At last, the zeroriscy_soc includes all the
components mentioned in chapter 4, cores and memory blocks. The top level architecture
is shown in Figure 12.
Source – Author
Besides these modifications on the organization of the project, a few files were
altered so that the core could compile for the FPGA board. The reason is because in the
original project there were assert statements and other non-synthesizable constructs in
the System Verilog Language. By removing these statements the project was ready to be
used in the FPGA IDE.
Source – Author
The LED displays and the switches are the only input and output form to verify the
correct execution of the code.
In order to start the execution of the code in memory, the switch 0, which holds the
reset signal, was turned to logical low and back up to logical high. This action resets the
cores and prepares them for start up, considering that the reset signal is active low. After
Chapter 5. Design Verification 43
that, both fetch enable signals are set to active high and then the cores start requesting
instruction data.
The four left most LED displays in the board, grouped by two, represent the ALU
result from core 0 and core 1, respectively. As each core execute instructions, the observer
could see that these values changed in each cycle. Throughout the experiment, the author
followed the LED display output to confirm that the values of the ALU adder result signal
in each clock cycle was equal to the ones in the simulation and, then reinforce the accuracy
of the hardware design.
The correct result was confirmed by the LED display on the far right, labeled as
Total Sum Result in Figure 13. It shows 1C as a result, just as expected from the simulation
in EDA Playground. On its left is the Memory Access flag, which initially is zero, turns to
one, and then to two after core 1 has written its value to memory.
44
6 CONCLUSION
the workload and accessed the same memory region in order to aggregate its partial sum
to the result. This solution, however simple, reveals intricacies that usually are not expose
to the final consumer, but are really important for the core to correctly perform its job.
And by the knowledge of these hidden complexities, this work aids the development of
new parallel hardware design.
46
BIBLIOGRAPHY
ASANOVIC̀, K.; PATTERSON, D. A. Instruction Sets Should Be Free: The Case For
RISC-V. [S.l.], 2014. Disponível em: <http://www2.eecs.berkeley.edu/Pubs/TechRpts-
/2014/EECS-2014-146.html>. 19, 20
GANSSLE, J. et al. Embedded Hardware: Know It All. Elsevier Science, 2007. (Newnes
Know It All). ISBN 9780080560748. Disponível em: <https://books.google.com.br-
/books?id=HLpTtLjEXqcC>. 14
RISC-V. RISC-V Cores and SoC Overview. 2018. Disponível em: <https://riscv.org/risc-
v-cores/>. Acesso em: 1 december, 2018. 21
ROSSI, D. et al. PULP: A parallel ultra low power platform for next generation IoT
applications. 2015 IEEE Hot Chips 27 Symposium, HCS 2015, 2016. 21
SCHIAVONE, P. D. et al. Slow and steady wins the race? A comparison of ultra-low-power
RISC-V cores for internet-of-things applications. 2017 27th International Symposium on
Power and Timing Modeling, Optimization and Simulation, PATMOS 2017, v. 2017-Janua,
p. 1–8, 2017. 23, 24, 26
Bibliography 47
WATERMAN, A. et al. The RISC-V Instruction Set Manual v2.1. 2012 IEEE
International Conference on Industrial Technology, ICIT 2012, Proceedings, I, p. 1–32,
2012. 19