TCC Ufrn Igor

Igor Macedo Silva
Multicore System with Shared Memory
Natal – RN
December 2018
Igor Macedo Silva
Trabalho de Conclusão de Curso de En-

genharia de Computação da Universidade
Federal do Rio Grande do Norte, apresentado
como requisito parcial para a obtenção
do grau de Bacharel em Engenharia de
Computação
Orientador: Samuel Xavier de Souza
Universidade Federal do Rio Grande do Norte – UFRN

Departamento de Engenharia de Computação e Automação – DCA
Curso de Engenharia de Computação
Natal – RN
December 2018
Igor Macedo Silva
Trabalho de Conclusão de Curso de En-

genharia de Computação da Universidade
Federal do Rio Grande do Norte, apresentado
como requisito parcial para a obtenção
do grau de Bacharel em Engenharia de
Computação
Orientador: Samuel Xavier de Souza
Trabalho aprovado. Natal – RN, 13 de Dezembro de 2018:
Prof. Dr. Samuel Xavier de Souza - Orientador

UFRN
Prof. Dr. Luiz Felipe de Queiroz Silveira - Convidado

UFRN
Prof. MSc. Diego Vinicius Cirilo do Nascimento - Convidado

IFRN
Natal – RN
December 2018
To my family
ACKNOWLEDGEMENTS
I thank my parents Ivanildo, Gorete and my brother Ian for the tremendous support
they gave me throughout my undergrad course. It I am here today, it is because of you.
I thank my advisor Samuel for accepting me in his project and tutoring me towards
this final work. Also I express my deep gratitude to my collegues in the Cevero project,
specially Diego, Otávio and Kallil, who dedicated their time to review this document and
help me progress in the experiments.
After all, I thank all my professors, collegues and organizations I was a part of
during my undergrad course (IIN-ELS, Include Engenharia, Projeto SmartMetropolis and
Projeto Forró na UFRN) for they all contributed to the ensamble of knowledge I am today.
Thank you all!
“I suppose there are people who are so ‘lucky’ that they are not touched by phantoms and
are not troubled by fleeting memory and know not nostalgia and care not for the ache of
the past and are spared the feather-hit of the sweet, sweet pain of the lost, and I am sorry
for them - for to weep over what is gone is to have had something prove worth the
weeping."
( Isaac Asimov, It’s Been a Good Life )
ABSTRACT
The ever increasing demand for faster computers encouraged the development of a new
hardware paradigm in computing: the multicore processors. This new approach suggested
that problems could be broken and processed separately for each core, achieving the end
result in a fraction of the time, however this technique brought along a number of new
problems, such as race conditions, problem scalability, limitations, protocol synchronization,
among others. In an attempt to better understand such new paradigm, this work develops
and explores a multicore system with shared-memory access. The base processing core
was developed by the PULP Platform, a project by ETH Zurich and is called Zero-riscy.
This core implements the RISC-V instruction set architecture, which is used to develop a
sample assembly program so as to test parallel programing capabilities of the multicore
system. The system is tested both in an online simulation environment and in FPGA
hardware to assert its memory access functionalities.
Palavras-chaves: Multicore. Parallel Computing. Shared-memory. RISC-V.

LIST OF FIGURES
Figure 1 – von Neumann and Harvard Architecture schematic. . . . . . . . . . . . 14

Figure 2 – Basic RISC-V instruction formats . . . . . . . . . . . . . . . . . . . . . 20
Figure 3 – The PULP Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 4 – Simplified block diagram of Zero-riscy . . . . . . . . . . . . . . . . . . 23
Figure 5 – Simplified block diagram of the Zero-riscy ALU . . . . . . . . . . . . . 24
Figure 6 – Memory protocol time diagram. . . . . . . . . . . . . . . . . . . . . . . 26
Figure 7 – EDA Playground Interface . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 8 – Memory module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 9 – Memory and Cores Interconnect . . . . . . . . . . . . . . . . . . . . . . 31
Figure 10 – Flux Diagram of the parallel program . . . . . . . . . . . . . . . . . . . 36
Figure 11 – Simulation wave form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 12 – Top level architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 13 – FPGA Emulation result . . . . . . . . . . . . . . . . . . . . . . . . . . 42
LIST OF TABLES
Table 1 – RISC-V base instruction sets and extensions . . . . . . . . . . . . . . . 19

Table 2 – LSU Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Table 3 – FPGA and Project Specifications . . . . . . . . . . . . . . . . . . . . . . 40
Table 4 – Flow Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Table 5 – Quartus Fitter Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
LISTINGS
1 Module Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Clock and Core information signals settings . . . . . . . . . . . . . . . . . 29
3 Memory Model Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 RISC-V Assembly Program . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Zero-riscy design file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Memory initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7 Test procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
LIST OF ABBREVIATIONS AND
ACRONYMS
CPU Central Processing Unit
UMA Uniform Memory Access
NUMA Non-Uniform Memory Access
ISA Instruction Set Architecture
BSD Berkeley Software Distribution
GPU Graphics Processing Unit
DSP Digital Signal Processor
RISC Reduced Instruction Set Computer
IoT Internet of Things
WSC Warehouse-Scale Computers
PULP Parallel Ultra Low Power
NTC Near-Threshold Computing
HPC High Performance Computing
IF Instruction Fetch
IDE Instruction Decode and Execution
FIFO First In First Out
ALU Arithmetic Logic Unit
ASIC Application Specific Integrated Circuits
FPGA Field Programmable Gate Array
HDL Hardware Description Language
LED Light Emitting Diode

CONTENTS
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1 Rise of Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Parallel Computing Architecture . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 Shared Memory and Interconnects . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Motivation and Objectives . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Specific Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 THE RISC-V ARCHITECTURE . . . . . . . . . . . . . . . . . . . . 18

2.1 What is RISC-V? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 The PULP Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 ZERO-RISCY CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Microarchiteture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Core Main Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 A MULTICORE SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Cores configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Memory module design . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 DESIGN VERIFICATION . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Instruction Program Code . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Online Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Online simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Porting to FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Hardware emulation results . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12
1 INTRODUCTION
Parallel computing is new a very solid computation paradigm. However this trend
started as an alternative to a few different restrictions in older computer organizations
and a growing problem with the current transistor technology. This chapter introduces
how parallel computing became a viable solution, a few architectural decisions and the
reason that motivates this work.
1.1 Rise of Parallel Computing

During the end of the 20th century and begining of the 21th century, Moore’s law
was the best way to predict the rate of improvement given to each processor generation.
Moore made his first speculation about this theme in 1965, for a special edition of the
Electronics magazine. He said
The complexity for minimum component costs has increased at a rate

of roughly a factor of two per year [...]. Certainly over the short term
this rate can be expected to continue, if not to increase. Over the longer
term, the rate of increase is a bit more uncertain, although there is no
reason to believe it will not remain nearly constant for at least 10 years.
That means by 1975, the number of components per integrated circuit for
minimum cost will be 65,000.
I believe that such a large circuit can be built on a single wafer.
(MOORE, 1965).
After a while, his predictions were made law by researchers and the press alike.
Later, Moore revisited his statements to befit what was observed in the industry: the rate
of growth began to slow and the number of transistors in a given chip was expected to
double every two years, as opposed to only one year in his first prediction.
This trend was accompanied by an also ever increasing transistor frequency. In
1971, Intel 4004 processor was first introduced and it had a clock frequency of 107KHz. By
2006, Intel launches its Pentium D processor with an astonishing 3.2GHz clock frequency
which represents an increase greater than a hundred thousand in less than forty years.
However, as time passed, this industry pattern began to slow down considerably.
On one hand, shrinking transistors size in order to fit more of them in the same space was
restrained by technology factors and even physical limitations imposed by the materials
themselves. On the other hand, increasing transistor frequency indefinitely is also imprac-
Chapter 1. Introduction 13
tical. In order to increase clock frequency, one needs to increase the supply voltage to the
chip, as shown by Equation 1.1
(V − Vth )h
fmax = k1 (1.1)
V
where fmax is the maximum possible frequency, V is the supply voltage, Vth is the
transistor operation voltage, h is a constant dependent on the semiconductor technology
and k1 is a proportionality constant. Moreover, an increasing supply voltage causes another
problem: a rise on dissipated power, given the following equation P = k 2 ACV 2 f . The end
result is heat generated with exponential rate, a problem that most personal computers
cannot withstand for prolonged times. And, thus, it became clear that manufacturers
could not continue to decrease transistors size and increase clock frequency in order to
create better and more capable processors. Moore’s law was no longer viable.
Amidst those barriers, the world saw the surge of a new computational paradigm
during the first decade of the 21th century. It was the beginning of parallel computing.
Manufacturers realized they could replicate functional blocks in processors in order to
increase their throughput and achieve the same end-goal without necessarily having to
compromise with shrinking transistors and increased clock frequencies.
1.2 Parallel Computing Architecture

Parallel hardware has outgrown the common sequential hardware researchers and
enthusiasts used to know. In order to compare how this change occurred, let us analyze how
most computers worked before parallel devices emerged. Two of the most classic computer
architecture are the von Neumann architecture and the Harvard architecture. Consider
Figure 1. The von Neuman computer organization defines a CPU (Central Processing
Unit), a Memory Unit, an Input Device and an Output Device. The main focus here is the
CPU. It is composed by a Control Unit, managing how and when each computer instruction
should run, an Arithmetic Unit, executing each computer instruction, and Registers that
store information. Yet, the most distinctive feature between the von Neumann architecture
and the Harvard Architecture is how both CPU and Memory Unit communicate, and what
kind of information is stored in the Memory Unit.
As shown in Figure 1, the von Neumann has only one Memory Unit and a single
bus for communication. Therefore, every and each piece of data has to be stored in this
memory block which means that both computer instructions and work data are kept there.
Such connection maybe troublesome, because CPUs are generally much faster in running
instructions and requesting data than memory block are in retrieving stored information.
Commonly, an instruction may stall for several clock cycles awaiting for an specific data
Figure 1 – von Neumann and Harvard Architecture schematic.
Source – Author
block to be available. This is called as the von Neumann bottleneck (PACHECO, 2011, p.
15-17).
The Harvard architecture has one simple solution: to distinguish between instruction
and data memory. Although this organization does not solves all problems between memory
and CPU, it allows for the CPU to fetch instructions and data concurrently, saving several
clock cycles in execution (GANSSLE et al., 2007). Furthermore, with separate memory
and buses, the CPU can pipeline more than one instruction at a time, which leads to
increased performance because the critical path can be reduced and the processor may
attain higher clock frequencies.
Many improvements over the von Neumann and Harvard architectures, such as
caching and virtual memory, aimed to mitigate the memory connection bottleneck by
decreasing the time it takes to fetch data. Many others, as pipelining and hardware threads,
aimed to execute as many instructions as possible and reduce the impact of instructions
stalls. All of them contributed incrementally to the complex engineering feat of a parallel
computer architecture.
1.2.1 Shared Memory and Interconnects

The problem of memory and CPU communication gains a much higher proportion
in parallel computing. Having two or more cores to share the same memory block or
components needs managing to avoid conflicting requests.

The most simple organization in this case would be to use some kind of interconnect
and link directly each core to the memory block, this is called uniform memory access
(UMA) because it takes the same time for all cores to access the memory block. Another
solution is to have all cores possessing its own private memory block and allowing other
cores to access this private memory through a communication protocol between cores and
not memory. That is called non-uniform memory access because accessing its own private
memory takes less time than accessing data in another core memory. UMA systems are
generally easier to program, since the programmer does not need to worry about different
memory accessing times for different memory locations. However, NUMA systems have
the potential to hold larger amounts of memory because each core needs to manage only
its own private memory block (PACHECO, 2011). Which is best is definitely a project
decision.
After choosing which memory organization to use, the hardware designer has a last
decision to take: how to connect memory and processing cores. The interconnect is the
main interface responsible for sending signal from multiple cores to memory and, most
importantly, managing which core has the interconnect "control" to send its request.
There is also two main interconnects used to this task, a bus and a crossbar. The
simplest of them is the bus. It typically is a connection of wires shared by several different
components that allow one sender at a time to communicate with all sharers of that
medium. If this network of components needs scalability, the crossbar is a more suitable
choice. It uses a switched-media to control devices and allow concurrent access for several
components. The Encyclopedia of Parallel Computing has a definition for both of them.
A bus comprises a shared medium with connections to multiple entities.

An interface circuit allows each of the entities either to place signals on
the medium or sense (listen to) the signals already present on the medium.
In a typical communication, one of the entities acquires ownership of the
bus (the entity is now known as the bus master) and places signals on the
bus. Every other entity senses these signals and, depending on the content
of the signals, may choose to accept or discard them. [..] Buses often are
collections of electrical wires, where each wire is typically organized as
“data ”, “address”, or “control”. (KIELMANN et al., 2011).
[..] Although crossbars incur a higher wiring overhead than buses, they
allow multiple messages simultaneously to be in transit, thus increasing
the network bandwidth. Given this, crossbars serve as the basic switching
element within switched-media network routers. [...] A crossbar circuit
takes N inputs and connects each input to any of the M possible outputs.
[...] the circuit is organized as a grid of wires, with inputs on the left,
and outputs on the bottom. Each wire can be thought of as a bus with a
unique master,i.e., the associated input port (KIELMANN et al., 2011).
1.3 Motivation and Objectives

The surge of parallel computing has brought brand new concepts to the world of
computation. Researchers created new hardware constructs in order to address the von
Neumann bottleneck and keep processors evolving on each generation. The complexity of
a state-of-the-art processor design nowadays may take several years and many engineers to
make sure everything is working as intended. Understanding the multiple facets of parallel
computing architecture is enough work for a life time.
This work takes a single step and tries to elucidate a topic that is most fundamental
for parallel computers currently, specially those with a shared memory architecture: how
does one design a multicore processor with only one memory block? This question may
reveal some of the underlying architectural decisions made in order to fasten together two
or more whole processors and make them work seamlessly with each other.
Besides, this work is intended as an initial move towards a not so conventional take
on parallel computing. Most parallel platforms are developed to perform given instructions
and respond with the intended result. In this case, they are concerned with performance,
executing as many instructions at a time as possible and cramming towards the next
one. However, in other cases, parallel hardware might give insight to a problem that is
not obsessed with how many instructions, but instead if those instructions generated the
correct result. Having multiple instances of the same core, or at least of a few functional
blocks inside this core, that can execute the same instruction at the same time might
help asserting if this instruction performed correctly or if any error occurred during its
execution.
With this in mind, this document is motivated by the investigation of a very
fundamental concept underlying multicore processors and memory, yet it also lays the
ground for a future work that places parallel computing in an unusual place.
1.3.1 Specific Goals

As a main goal, this work will develop a multicore system, based on an open-source
instruction set architecture. This multicore system aggregates two core instances and a
shared memory between both of them. This document will elaborate on the process used
to develop and test core features. The outcome will be tested in a software simulation tool,
as well as in a hardware prototype.
1.4 Document Structure

This work presents an introduction to the parallel computing theme, a historical
take on the necessity of parallel computing and how the parallel hardware evolved from
there. It also shows the motivation and the main goals intended for this work.
After that, chapter 2, named The RISCV Architecture, exposes the project RISCV
and PULP which serves as a basis for the work in development.
Chapter 3, called Zero-riscy Core Architecture, dives further in the PULP project
to explain the architecture of the Zero Riscy, a processor core that implements the RISCV
ISA.
In sequence, chapter 4, Developing a multicore processor, explores the hidden
complexities in developing a multicore with two Zero-riscy instances with access to one
memory block.
Chapter 5 show the execution of a program and its results in a simulation tool and
how to implement the FPGA hardware emulation.
Finally, chapter 6 compiles the work and explains the results achieved in this work.
18
2 THE RISC-V ARCHITECTURE
Before starting to code, one of the first decisions a general processor hardware
designer must take is which instruction set architecture (ISA) will be used. The instruction
set is the interface agreement between software and hardware. It defines how bit of
software code should be organized, so that hardware will know how to interpret and
process information.
There are many well-known successful ISAs on the market. To list a few, x86 and
ARM are probably the most widespread instruction sets available today, which implies a
tremendous hardware and software support by market and developers. However, the case
with most ISAs is their licensing. Commercial ISAs are proprietary, and any individual or
institution that aspires to develop a product based on their specifications must have a
license, which generally involves negotiating terms and costs of up to 10 million dollars.
The alternative is to use open source ISAs, that have very permissive licenses, but
not yet very good acceptance from companies and developers. This scene is starting to
change with a relatively new architecture that has gathered attention and support from
both academia and market: the RISC-V instruction set.
2.1 What is RISC-V

RISC-V is a general-purpose instruction set architecture. It is Berkeley Software
Distribution (BSD) licensed, which is a very permissive free software license and imposes
minimal barriers to replication, modification, and even commercial use and distribution.
As Waterman noted, they desired to create an ISA that was simple, free and modular.
Leveraging three decades of hindsight, RISC-V builds and improves on

the original Reduced Instruction Set Computer (RISC) architectures. The
result is a clean, simple, and modular ISA that is well suited to low-power
embedded systems and high-performance computers alike. (WATERMAN,
2016).
This instruction set was developed by Krste Asanovic, Andrew Waterman, and
Yunsup Lee as an alternative to other proprietary ISAs, aiming to warm up the open
hardware initiative. The motivation to create a free instruction set is very well explained in
"Instruction Sets Should Be Free: The Case For RISC-V" by Asanovic̀ and Patterson. They
explain that, although instruction set architecture are historically proprietary, there is no
technical reason for such lack of open source ISAs, for "neither do companies exclusively
have the experience to design a competent ISA", nor their ISAs are the most "wonderful
Chapter 2. The RISC-V Architecture 19
ISAs" (ASANOVIC̀; PATTERSON, 2014). In fact, according to Asanovic̀, industry could

benefit from freely available open ISAs, just as it benefited from free open source software.
Such an ISA could generate grater innovation, with shared open core designs and affordable
processors for more devices.
Along its goal to create a free and open ISA, RISC-V also tries to become a
"universal" architecture with its highly modular setup. If you consider most mobile devices
nowadays, such as smartphones and tablets, they are a combination of different processing
units such as general multicore processor, Graphics Processing Unit (GPUs), Digital
Signal Processor (DSPs), power management cores, etc. Each of them might have its own
instruction set and, commonly, it is a software problem to deal with each architecture
in order to keep everything working. As designers create heterogeneous platforms (which
probably will keep happening), this complexity will continue to increase. A possible solution
would be having all of them use variants of a single ISA and possibly reusing both hardware
and software. RISC-V is the one ISA that aims to unify such different hardware in modular
interface that adapts to the needs of its users.
Table 1 – RISC-V base instruction sets and extensions
Base ISA Instructions Description

RV32I 47 32-bit address space and integer instructions
RV32E 47 Subset of RV32I, restricted to 16 registers
RV64I 59 64-bit address space and integer instructions,
along with several 32-bit integer instructions
RV128I 71 128-bit address space and integer instructions,
along with several 64 and 32-bit instruction
Extension Instructions Description
M 8 Integer multiply and divide
A 11 Atomic memory operations, load-
reserve/store conditional
F 26 Single-precision (32 bit) floating point
D 26 Double-precision (64 bit) floating point;
requires F extension
Q 26 Quad-precision (128 bit) floating point;
requires F and D extensions
C 46 Compressed integer instructions; reduces
size to 16 bits
Source – RISC-V
This feature of RISC-V can be very well observed in its base instruction set and
standard extensions definition (WATERMAN et al., 2012). As Table 1 shows, there are
four base instruction sets, based on the width of their address space and register count:
two with a 32-bit address space, one with 64-bit and another one with 128-bit address
space. This last 128-bit base instruction set is a precaution against the ever evolving
need for memory space. Although current Internet of Things (IoT) and mobile devices
are completely covered by a 32-bit or 64-bit address space, there is also the case with
Warehouse-Scale Computers (WSC). Even if, currently, WSC still do not need a 128-bit
address space, it is reasonable to assume that, in years yet to come, a 64-bit address
space will not suffice the tremendous amount of data such computers have to handle. So,
considering that address size is one hard-to-recover problem with ISAs, it is prudent to
account for a possible bigger address space now (ASANOVIC̀; PATTERSON, 2014).
Besides its base instruction set, the RISC-V architecture also has a number of
standard extensions within its definition. Table 1 gives an explanation on each one of
them. The purpose of these extensions is to give the flexibility the ISA needs in order
to address demands from resource-constrained low-end hardware implementations and
high-performance ones alike. Take for examples the A and D standard extensions. On
one hand, a microcontroller designed for IoT applications much probably will not need to
perform atomic or double-precision floating point operations. On the other hand, a WSC
definitely needs atomic operations to assure parallel instructions are executed correctly and
double-precision floating point operations so as to guarantee the credibility of its results.
Therefore, with its modular configuration, RISC-V presents itself as a good alternative for
a variety of applications.
Figure 2 – Basic RISC-V instruction formats.
Source – RISC-V
The organization of each instruction coding is also taken into account, as other RISC
architectures do. RISC-V has fixed 32-bit instructions (with exception of its compressed
standard extension) that should naturally aligned in memory. It has 6 instruction formats,
4 of which are shown in Figure 2. These four instruction formats enclose the most common
used instructions and thus are a sensitive design choice in the ISA. The encoding of each
instruction is highly regular. Hardware designers were very careful to maintain consistency
between formats, creating fixed locations for operands and opcode. This strategy allows
the register fetch operation to proceed in parallel with instruction decoding, improving a
critical path for many hardware implementations (WATERMAN, 2016, p. 17,18).
RISC-V has a number of other features thought to improve hardware implementation.

It had the advantage that other ISAs, as x86 or ARM, did not have: to see the pros and
cons of many ISAs from the past and to plan ahead for what is to come in the world of
computing. Therefore, it definitely has an special place in current ISA designs and gains
much support from academia and industry alike.
2.2 The PULP Project

The surge of RISC-V was a great event for academia. Now researchers can develop
their own chips based on a very robust and well-planned interface, adapting it to their
needs. The RISC-V website lists a number of academic and commercial projects that took
this ISA as the founding ground of its development (RISC-V, 2018).
The PULP Platform goes along with this trend, but brings a very specific question in
mind. The name itself reveals the project goal: PULP stands for Parallel Ultra Low Power.
The project had in mind the demands of near-threshold computing (NTC), thus managing
power from transistor switching and leakage during operation, recovering performance with
parallel execution, and finding the most energy efficient number of issues and execution
order were a concern from architecture design to silicon implementation (ROSSI et al.,
2016).
Figure 3 – The PULP Family.
Source – PULP Platform
It started as a joint effort between ETH Zürich and University of Bologna to explore
new architectures for ultra-low power microprocessors and suffice the demand for such
devices in IoT applications (PULP PLATFORM, 2018a). However, aspects that involved
the support and open nature of the RISC-V ISA and the PULP Platform helped this project
to aim higher. Nowadays, the PULP Platform has a number of cores, microcontrollers and
processors that encompass IoT and High Performance Computing (HPC) applications.
Figure 3 show an updated graph illustrating the whole PULP family.
The PULP Platform has very simple cores as Zero-riscy, "an area-optimized 2-stage
32-bit core", and complex cores as Ariane, that presents "a 6-stage, single issue, in-order
64-bit CPU which fully implements I, M, C and D extensions" and other features (PULP
PLATFORM, 2018b). Those cores might be used in diverse platforms, from single core
IoT-focused systems to Multi-cluster HPC-focused ones, as shown in Figure 3. As a result,
there are a number of choices for anyone who intends to use the PULP Platform in a
research project.
23
3 ZERO-RISCY CORE
There are many possible choices among the cores within the PULP project. There-
fore, a researcher has to consider their needs and find a suitable contender. For the purpose
of this document, which is to develop a multicore system with shared memory, the Zero-
riscy is a good starting point. This core implements the 32-bit basic RISC-V instruction
set and two other standard extensions M and C, which allow for multiplication/division
operations and the compressed instruction encoding, respectively.
In order to use this core in the proposed multicore-system, one must first understand
its microarchitecture and how to control the core signals so as to achieve any result. This
chapter explores the inner workings of the zero-riscy core and how external hardware may
communicate with it.
3.1 Microarchiteture
The zero-riscy is an area-optimized core, focused on low energy consumption. As
a reflex, it is possible to see how these decisions affected its internal structure. The core
has 2 pipeline stages: Instruction Fetch (IF) and Instruction Decode and Execution (IDE).
Figure 4 shows the microarchitecture for the zero-riscy.
Figure 4 – Simplified block diagram of Zero-riscy.
Source – (SCHIAVONE et al., 2017)
During IF, as the name says, the core fetches instruction data from memory through
its instruction memory interface. The main block is the prefetch buffer, which uses the
memory protocol (explained in depth in section 3.2) to access memory and store them
in a FIFO if the next stage is not yer ready. In IF stage, the prefetch buffer handles
Chapter 3. Zero-riscy Core 24
any compressed instructions, along with the instruction address and the program counter
(which might differ from expected due to misalignment caused by compressed instructions).
After that comes the IDE stage. This stage reads the operands from the register
file (in Figure 4, the GPR (General Purpose Register) block), prepares them for execution
in the ALU or multiplier units and executes the instruction. The register file is a 2-read-
1-write unit that may be either latch based or flip-flop based, depending on the target
(ASIC or FPGA).
The ALU is a result from the area optimization. It contains only the necessary
blocks to implement the base instruction set and the multiplication/division standard
extension. Figure 5 show it contains an adder, a shifter and a logic unit. The adder selects
its operands from a multiplexer defines the output based on a controller signal, whereas
both shifter and logic unit will always compute their values with values provided by the
decoder or the register file. The end result is selected by another multiplexer that receives
all three computations.
Figure 5 – Simplified block diagram of the Zero-riscy ALU.
At last, the Control and Status Registers (CSR block) holds information used by
the RISC-V privileged instructions, in exceptions, interrupts and performance metrics;
and the Load and Store Unit (LSU) is responsible for loading and storing memory data.
3.2 Core Main Signals

The Zero-riscy interface defines 37 input and output signals. They can be categorized
as clock signals, core information, instruction memory interface, data memory interface,
interrupt inputs, debug signals and cpu control signals. In order to initiate the core, clock
signals, core information and cpu control signals need to be set.
In more detail, clock signal category contains clk_i, rst_ni, clock_en_i and
test_en_i. The first two signals clk_i and rst_ni receive the input clock as active high
(therefore the _i ending) and the reset signal as active low (therefore the _ni ending),
respectively; the other two signals clock_en_i and test_en_i are enable signals that
allow clock signal to propagate normally or with test methodology.
The core information signals have core_id_i, cluster_id_i and boot_addr_i.
Signals core_id_i and cluster_id_i will define core and cluster information to be stored
inside the Control and Status Registers (CSR), refer to Figure 4). This information can
be accessed through privileged instructions described in the ISA manual. Furthermore,
there is the boot_addr_i which sets part of the initial booting address. It is important to
observe that the complete address is a composition of the boot_addr_i and an exception
offset parameter within the core code.
At last, there are the cpu control signals: fetch_enable_i and ext_perf_counters_i.
The fetch_enable_i signal is set so as to enable the instruction fetch process to occur; the
ext_perf_counters_i is a port which may receive an external signal to a performance
counter, which increases the performance counter every time the port receives a logical
high. This performance counters are commonly used to count internal events such as stalls,
but in this case it is used to counter an external user-defined event.
The instruction memory interface and data memory interface are the ports through
which data comes in and out of the core. They have very similar signal as shown in Table 2,
with addition of signals that allow write operations to data memory and byte-aligned read
and write, both are not needed for instruction memory access. In this case, signals that
are in the same colored line server the same purpose, differing only by the name of the
memory module they connect.
After the aforementioned signals are set, the program fetching process can start.
The zero-riscy manual (SCHIAVONE, 2017) specifies a memory access protocol, valid for
both instruction and data memory, described as follows. Figure 6 shows a graphic time
representation of this protocol.
The protocol that is used by the LSU to communicate with a memory

works as follows: The LSU provides a valid address in data_addr_o and
sets data_req_o high. The memory then answers with a data_gnt_i set
high as soon as it is ready to serve the request. This may happen in the
same cycle as the request was sent or any number of cycles later. After
a grant was received, the address may be changed in the next cycle by
the LSU. In addition, the data_wdata_o, data_we_o and data_be_o
signals may be changed as it is assumed that the memory has already
processed and stored that information. After receiving a grant, the memory
answers with a data_rvalid_i set high if data_rdata_i is valid. This may
happen one or more cycles after the grant has been received. Note that
data_rvalid_i must also be set when a write was performed, although the
data_rdata_i has no meaning in this case. (SCHIAVONE, 2017).
Table 2 – LSU Signals

Signal Direction Description
data_req_o output Request ready, must stay high until data_gnt_i
instr_req_o is high for one cycle
data_addr_o[31:0] output Address
instr_addr_o[31:0]
data_we_o output Write Enable, high for writes, low for reads.
Sent together with data_req_o
data_be_o[3:0] output Byte Enable. Is set for the bytes to write/read,
sent together with data_req_o
data_wdata_o[31:0] output Data to be written to memory, sent together with
data_req_o
data_rdata_i[31:0] input Data read from memory
instr_rdata_i[31:0]
data_rvalid_i input data_rdata_is holds valid data when
instr_rvalid_i data_rvalid_i is high. This signal will be high for
exactly one cycle per request.
data_gnt_i input The other side accepted the request.
instr_gnt_i data_addr_o may change in the next cycle.
Source – (SCHIAVONE, 2017)
Figure 6 – Memory protocol time diagram.
The knowledge of this memory access protocol is vital for the design of the shared-
memory multicore system in this document. chapter 4 will go through how the design
process occurred and its final architecture.
27
4 A MULTICORE SYSTEM
Chapter 3 analyzed the microarchitecture and external signals provided by the

zero-riscy core. With this information, one may now start to envision how to combine
cores and memory modules in a single system.
4.1 Development Environment

The code for zero-riscy cores is open-source and available in a github page (PULP
PLATFORM, 2018c) from the PULP Platform. The core was designed in System Verilog,
a hardware description language (HDL) that extends the Verilog language.
In order to design the multicore system that uses the zero-riscy as base core, there
was need for a tool that supports the System Verilog language. Unfortunately, most open
source and free tools aimed at hardware development and simulation do not support
System Verilog and even those that announce support for that language do not implement
the whole System Verilog specification. Initial tests made with tools such as Icarus Verilog
did not render positive results concerning the System Verilog support.
Regarding commercial tools, there are well known companies that advertise hardware
development tools, IDEs and simulators with System Verilog support. To list a few, there are
Cadence, Mentor and Synopsys, three of the main hardware development companies. The
software tools to aid hardware development are used throughout the world by companies
to develop its own hardware.
Fortunately, there is an online tool which provides free access to such commercial
tools. It is called EDA Playground (EDA PLAYGROUND, 2018). This online framework
provides access to Cadence Incisive and Synopsys VCS, two of the main System Verilog
simulators available to date. Figure 7 shows the main interface. This tool allows its users
to define verification framework, compile and run options, develop, upload and download
files, and see the final simulated wave form through an integrated system.
The main advantage of using such system, besides its free availability, is the
possibility of simulating hardware code in two different simulators almost seamlessly and
this advantage arises from a very common problem in HDL simulation tools. Simulation
tools, in general, have to account for the concurrent execution that happens in real
hardware and the way to deal with this in both Verilog and System Verilog is to create
event regions. Each event regions has the purpose to execute category specific instructions,
however inside each region there is no guarantee for execution order of this instructions,
which may cause unintended hardware behavior. The advantage of using one or more
Chapter 4. A Multicore System 28
simulation tools is that each tool has its own algorithm to resolve this execution order
and having a hardware design tested in more than one tool helps creating a more robust
design.
Figure 7 – EDA Playground Interface.
Source – (EDA PLAYGROUND, 2018)
4.2 Cores configuration

The cores were instantiated in a testbench file in the EDA Playground environment.
At first, the zero-riscy core has three parameters it takes during module instantiation. The
N_EXT_PERF_COUNTERS is the number of external counters for performance metrics and
the RV32E and RV32M which set support for reduced integer register counts instruction set
and multiplication/division standard extension, respectively. As neither of this parameters
are of interest for this project for now, they were set to zero, as in Listing 1.
Listing 1 – Module Parameters

1 parameter N _ E XT _ P ER F _ CO U N TE R S = 0;
2 parameter RV32E = 0;
3 parameter RV32M = 0;
As informed by section 3.2, there are a few signal which have to be set in order
to start both cores. Listing 2 shows how the clocking enabling signals are set to one, test
enable is set to zero and the other Core ID and Cluster ID signals are set to identify each
core independently. Later, in core instantiation, each signal is assigned to its respective
input port in the module interface. Attempt for the information that theses signals provide
to each core. As mentioned in section 3.2, core_id_i and cluster_id_i in Listing 2, lines
4, 5, 11, 12, will store information regarding the cores identification. This information
is stored in a special register called Control and Status Register (CSR) bank. For these
signals in special, a 32 bit register will hold both signals in reserved sections of the register.
This information can be later retrieved with CSR access instructions.
Listing 2 – Clock and Core information signals settings

1 // Core 1 Signals
2 logic clock_en_i_1 = 1; // enable clock , otherwise it is gated
3 logic test_en_i_1 = 0; // enable all clock gates for testing
4 logic [ 3 : 0 ] core_id_i_1 = 0;
5 logic [ 5 : 0 ] cluster_id_i_1 = 1;
6 logic [ 31 : 0 ] boot_addr_i_1 = 0;
7
8 // Core 2 Signals
9 logic clock_en_i_2 = 1; // enable clock , otherwise it is gated
10 logic test_en_i_2 = 0; // enable all clock gates for testing
11 logic [ 3 : 0 ] core_id_i_2 = 1;
12 logic [ 5 : 0 ] cluster_id_i_2 = 1;
13 logic [ 31 : 0 ] boot_addr_i_2 = 0;
Besides those signals, all of the other module interface signals were created with an
identification tag (ending with _1 or _2 ) in order to identify each core signals. For now
most of the signal variables do not need to present any specific value, since they will be
provided later by other modules or during the testbench execution procedure.
4.3 Memory module design

One of the most important decisions in this project was to design the memory
model and its interconnects with both cores. In this case there is the advantage to design
the module as best fits the project needs.
As discussed in chapter 1, there is two main network interfaces to connect multiple
hardware components: a bus and a crossbar. A bus is typically smaller and simpler,
connecting all components in a broadcast network where each component listens to all
signals, but only receives those addressed to them and only one component at a time
sends messages. On the other side, a crossbar is more complex for the need of a switching
mechanism that routes the signal to where it is intended, but it can manage multiple
signals at the same time.
In this project there is only two cores and one memory block for both of them. The
complexity of connecting three components should be kept as little as possible. Therefore, a
crossbar is not a mindful project choice, for it adds unnecessary complexity to a system that
does not have many components. A viable and simple solution is to use a bus interconnect
between cores and memory model.
Nevertheless, connecting two cores in a shared memory system with a single bus
interconnect still has its problems. Both cores will have to share the bus to write and read
from memory, alternating each access, which might cause the system to behave slower.
The proposed solution is to use a dedicated memory bus interconnect for each core,
as in Figure 9. In this case, this decision is only possible due to the small number of cores
and the fact that memory model is also designed by the author. With this architectural
choice, the system can perform more efficiently, increasing communication bandwidth.
Both cores may access the memory for reading or writing operations at exactly the same
time. Listing ?? shows how this decision reflects on the memory module interface: the
end result is simply a duplication of input and output variables to allow for simultaneous
access and the same happens within the module definition. Figure 8 shows a graphical
representation of the memory module.
Listing 3 – Memory Model Instantiation

1 mem_mod inst_mem (
2 // Clock and Reset
3 /* input logic */ . clk ( clk_i ) ,
4 /* input logic */ . rst_n (1 ’ b1 ) ,
5
6 /* input logic */ . port1_req_i ( instr_req_o_1 ) ,

7 /* output logic */ . port1_gnt_o ( instr_gnt_i_1 ) ,
8 /* output logic */ . port1_rvalid_o ( instr_rvalid_i_1 ) ,
9 /* input logic [ ADDR_WIDTH -1:0] */ . port1_addr_i ( instr_addr_o_1 ) ,
10 /* input logic */ . port1_we_i (1 ’ b0 ) ,
11 /* output logic [ IN1_WIDTH -1:0] */ . port1_rdata_o ( instr_rdata_i_1 ) ,
12 /* input logic [ IN1_WIDTH -1:0] */ . port1_wdata_i (32 ’ b0 ) ,
13
14 /* input logic */ . port2_req_i ( instr_req_o_2 ) ,

15 /* output logic */ . port2_gnt_o ( instr_gnt_i_2 ) ,
16 /* output logic */ . port2_rvalid_o ( instr_rvalid_i_2 ) ,
17 /* input logic [ ADDR_WIDTH -1:0] */ . port2_addr_i ( instr_addr_o_2 ) ,
18 /* input logic */ . port2_we_i (1 ’ b0 ) ,
19 /* output logic [ IN1_WIDTH -1:0] */ . port2_rdata_o ( instr_rdata_i_2 ) ,
20 /* input logic [ IN1_WIDTH -1:0] */ . port2_wdata_i (32 ’ b0 )

21 );
Figure 8 – Memory module.
Source – Author
Figure 9 – Memory and Cores Interconnect.
Source – Author
However, another possible problem arises. Consider the case when both cores try
to access the same memory address at the same time, for reading or writing operations. If
both cores try to read the same memory address, both modules will be able to perform
this action and acquire the information, which is beneficial for the whole system. Yet, in
the case when both cores try to write data to the same memory address or even one core
reads while the other writes to the same memory address, there are conflicting operations.
In both cases, the parallel nature of hardware and HDL simulation does not guarantee
which operation will be executed first. Note, however, that this may also happen in a
shared bus interconnect. If cores try to operate in the same memory address, the bus will
allow one operation at a time, but there is no guarantee to the order of execution. In such
cases, the solution is considered to come from software rather than hardware. Thus, a
duplicated memory bus is still a very compelling choice to the proposed system architecture.
Unfortunately, it does not solves common interconnect problems, but it improves greatly
the communication bandwidth by providing real simultaneous access to memory.
As discussed previously, the race condition that happens when two or more cores
tries to access the same memory address is fated to both buses and crossbars. This is in
most cases a software problem, however hardware can also be used to avoid this problem.
The atomic instructions are used to prevent more than one simultaneous access to memory.
They are implemented in such a way that during the same data transfer the core may
read and write to memory. By this implementation, this instruction seem like a single
instruction being executed for all other cores and peripherals and solves the race condition
by asserting that no other core may write in the same place at the same moment, changing
the data to be processed. This is indeed a very good solution, but although the RISC-V
instruction set supports such instructions, the zero-riscy does not implement this standard
extension, which brings the responsibility to deal with memory accessing problems to the
software back again.
33
5 DESIGN VERIFICATION
As with any engineering project, the design phase is just a fraction of the whole
development cycle. A test phase is always needed to assert design behavior. This chapter
explores two simulation methods used to ascertain the hardware design and the development
of a parallel program to stress both cores at the same time.
5.1 Instruction Program Code

In order to choose a program to test the multicore system, the author took a few
requisites into consideration:
1. The program shall be easily parallelizable, for the purpose was not to study the
parallel program, but the hardware.
2. The program shall have few instructions, for writing an assembly program can get
really complex for simple procedures and converting them to binary code might be a
necessity.
3. The program shall deal with only simple integer operations such as add and subtract,
for the core does not support floating point operations and multiplication/division
are part of an standard extension.
With these requirements in mind, the chosen problem was to sum the numbers
from 1 to N, the sum of the first N natural numbers. This problem has a rather simple and
known solution seen in Equation 5.1. However, our implementation will deal specifically
with the left side of this equation, summing all the parts from 1 to N.
n
X n(n + 1)
k= (5.1)
k=1 2
It was chosen because it only involves add operations and can be simply solved
with a partial sum loop procedure, which is itself very easily parallelizable. To solve this
problem, the following parallel RISC-V assembly program was written. Refer to Listing 4.
The assembly program starts by fetching the core ID from the Control and State
Registers. This information is passed via the module interface and stored in those registers
for later use. After storing the content of the whole register 0xF14 in register a0, masked
AND operation is used to extract only the first four bits, which contain the core ID, to
Chapter 5. Design Verification 34
the final register. In this case, the core ID is used to distinguish which core should execute
which part of the program and is used throughout the program.
Listing 4 – RISC-V Assembly Program

1 main :
2 csrr a0 , 0 xf14 # access CSR register
3 andi a0 , a0 , 0 xF # get the core ID with a mask
4 sw zero , 0( zero ) # sets zero to access flag
5 sw zero , 4( zero ) # sets zero to total sum
6 bne a0 , zero , core2 # branches to core loop
7
8 core1 :
9 li a1 , 0 # a1 holds counter start
10 li a2 , 5 # a2 holds counter max
11 li a3 , 0 # a3 holds partial sum
12 forloop1 :
13 bge a1 , a2 , end # branchs if greater or equal
14 add a3 , a3 , a1 # add counter to partial sum
15 addi a1 , a1 , 1 # increment counter
16 j forloop1 # reestart for loop
17
18 core2 :
19 li a1 , 5 # a1 holds counter start
20 li a2 , 8 # a2 holds counter max
21 li a3 , 0 # a3 holds partial sum
22 forloop2 :
23 bge a1 , a2 , end # branchs if grater or equal
24 add a3 , a3 , a1 # add counter to partial sum
25 addi a1 , a1 , 1 # increment counter
26 j forloop2 # reestart for loop
27
28 end :
29 lw a4 , 0( zero ) # loads flag Core ID value
30 beq a0 , a4 , store # branch if flag equal to core id
31 j end # reestart loop
32 store :
33 lw a5 , 4( zero ) # load partial sum from mem
34 add a5 , a5 , a3 # add with core partial sum
35 sw a5 , 4( zero ) # store partial sum in memory
36 addi a4 , a4 , 1 # increment flag value
37 sw a4 , 0( zero ) # store flag value
38
39 inf :
40 bge zero , zero , inf # branch infinitely
Following, lines 4 and 5 set the initial value for memory addresses the program uses
to store a flag value indicating whether the core has accessed a memory address or not
and a variable which will hold the final sum. Line 6 branches the code to the region each
core has to execute in order to calculate its partial sum.
In Listing 4, there are code labels called core1 and core2, in lines 8 and 18,
respectively. Each one marks the start of the block that shall be performed by each core
independently. Both blocks of code perform a partial sum that accumulates from the initial
value until the last value, that sets the loop stop condition. The only difference is the
initial set up, in lines 9 through 11 and 19 through 21, which sets the values for registers
that will hold the counter, the partial sum and the maximum value for the loop. The
procedure in each loop starts by branching if the counter is greater or equal the maximum
value. If the counter is less than the maximum value, the program will proceed to the next
line and execute a sum between the current partial sum and the counter. Afterwards the
counter is incremented by one and the program executes an unconditional jump to the
beginning of the loop, which resets the loop for another iteration. If the branch in line 13
or 23, defines the counter to be equal or greater than the max value, the code branches to
the end code label, for finishing the procedure.
The end code label denotes the starting point of the conclusion procedure when
each partial sum will be added and stored in a specific memory address. In order to avoid
any race conditions to memory operations, the author decided that each core will adhere
to an specific access order: the first core to access has core ID equal to zero, and the
second core to access has core ID equal to 1. This order guarantees the access to read
memory, fetch partial sum from memory and add with the partial sum in the core will
occur without any other simultaneous access from another core. The instruction in line 29
loads a flag the signalizes the core ID and this value is used to decide which core has the
right to access memory at the moment. If the core ID in register a0 is different from the
flag value the code will perform an unconditional jump and re-fetch the flag and compare
both values until they are equal. This procedure guarantees that the program will only
proceeds when it is its turn to access memory. When flag and core ID are equal, the core
executes the next block of code.
The last code block will load the partial sum from memory, in line 33, and add
with the partial sum in the core’s registers. Later, this value is stores again in the same
memory region that store the partial sum. Now that the core has performed its partial
sum, it will release access for other cores by updating the flag status. This flag holds the
value of the core ID that may access the memory at that time, and by increasing the flag
by one, as in line 36, the current core will allow the next core ID (which was stuck in a
loop in lines 29, 30, and 31 up until this moment) to write to memory. By incrementing
and storing the flag value, the program creates a queue that all cores will respect in order
to write to memory. The entire program flux is represented in Figure 10.
Figure 10 – Flux Diagram of the parallel program.
Source – Author
After designing the RISC-V assembly program, there was need to convert it to
binary code in order to make the program ready for upload in memory. The Ripes program
(RIPES, 2018) is just the tool needed for this task. It is a program that receives the
assembly code and executes the instructions in a simulated pipelined processor, suitable
for educational purposes. Opportunely, this software allows access to the binary code
generated by the assembly code and, in possession of the binary text file, the parallel
program was ready to memory upload and execution.
5.2 Online Simulation

The EDA Playground environment, as explained before, is capable of simulating
System Verilog code and it was the first simulation environment to receive the multicore
system. The graphical interface divides development between testbench file and design
files, as seen Figure 7. This work organizes and includes all the necessary design files in a
single file called design.sv as shown in Listing 5, because the interface restricts access only
to this file.
Listing 5 – Zero-riscy design file

1 ‘include " zero riscy_ define s . sv "
2 ‘include " c l u s t e r _ c l o c k _ g a t i n g . sv "
3 ‘include " z e r o r i s c y _ t r a c e r _ d e f i n e s . sv "
4 ‘include " zeroriscy_alu . sv "
5 ‘include " z e r o r i s c y _ c o m p r e s s e d _ d e c o d e r . sv "
6 ‘include " z e r o r i s c y _ c o n t r o l l e r . sv "
7 ‘include " z e r o r i s c y _ c s _ r e g i s t e r s . sv "
8 ‘include " z e r o r i s c y _ d e b u g _ u n i t . sv "
9 ‘include " zero riscy_ decode r . sv "
10 ‘include " z e r o r i s c y _ i n t _ c o n t r o l l e r . sv "
11 ‘include " z er or is cy _ex _b lo ck . sv "
12 ‘include " z er or is cy _id _s ta ge . sv "
13 ‘include " z er or is cy _if _s ta ge . sv "
14 ‘include " z e r o r i s c y _ l o a d _ s t o r e _ u n i t . sv "
15 ‘include " z e r o r i s c y _ m u l t d i v _ s l o w . sv "
16 ‘include " z e r o r i s c y _ m u l t d i v _ f a s t . sv "
17 ‘include " z e r o r i s c y _ p r e f e t c h _ b u f f e r . sv "
18 ‘include " z e r o r i s c y _ f e t c h _ f i f o . sv "
19 ‘include " z e r o r i s c y _ r e g i s t e r _ f i l e . sv "
20 ‘include " zeroriscy_tracer . sv "
21 ‘include " zeroriscy_core . sv "
22 ‘include " memory . sv "
The testbench file is responsible to instatiate all necessary modules (cores and
memories), connecting all of them, and executing the parallel program. Chapter 4 explained
how each module was linked with each other and this section shows how the parallel
program is uploaded to memory and executed.
Both Verilog and System Verilog have a structure called an initial block. This type
of code structure in both HDLs has a very specific purpose: the initial block is always
executed in the beginning of the simulation and only once. In the case of the design file,
the initial block loads the program to memory, as in Listing 6. With a for loop, it loads a
high impedance signal to all memory address and then reads and uploads to memory the
text file that contains the binary parallel program created in section 5.1.
Listing 6 – Memory initialization

1 initial begin
2 for ( int i = 0; i != 255; i = i + 1) begin
3 mem [ i ] = 32 ’ bz ;
4 end
5 $readmemb ( " instmem . bin " , mem );
6 end
In the testbench file, the initial block is responsible to set up and changing any
signals the multicore system needs to execute. In this case, this block is responsible for
initiating and switching the clock signal, as in lines 1 and 2 from Listing 7. Following, the
initial block sets the dump variable and the name of the vcd file, which contains the wave
formats.
Listing 7 – Test procedure

1 initial clk_i = 1;
2 always #1 clk_i = ~ clk_i ;
3
4 initial begin
5 $dumpfile ( " wave . vcd " );
6 $dumpvars (0 , zeroriscy_tb );
7
8 rst_ni = 0;
9 fetch_enable_i_1 = 1;
10 fetch_enable_i_2 = 1;
11 #1; rst_ni = 1; #1;
12
13 #130;
14 $finish ;
15 end
Finally, the initial block uses the reset signal as an active low to reset both cores
simultaneously and then enables each fetch_enable_i signal to start instruction fetching
in lines 9 and 10. The last few statements simply keep the initial block running for 130
time units in order to execute the whole code, in line 13. After that, the result may be
seen in the wave.vcd file declared in the beginning of the initial block.
5.2.1 Online simulation results

The result is accessible in the dumped wave form from the initial block. Using the
integrated wave form visualizer, the end result was confirmed and shown in Figure 11.
Figure 11 – Simulation wave form.
Source – Author
The wave form in Figure 11 shows the ALU operands a, b and its result, as
alu_operand_a_ex, alu_operand_b_ex and alu_adder_result_ex respectively. The first
three signals from top to bottom represent core 1 and the last three signal from top to
bottom represent core 2. In possession of those signals, one may be able to follow both
core loops and the final result for each partial sum in the parallel code.
Each loop iteration in Figure 11 is marked with a color, for better visualization.
The second clock cycle for each loop iteration is market with a white stroke box so as to
indicate the cycle which the partial sum is executed. This means the moment when the
previous sum is added with the current counter within the loop iteration. As expected, core
1 performs the sums 0 + 0 = 0, 0 + 1 = 1, 1 + 2 = 3, 3 + 3 = 6 and then 6 + 4 = A (using
hexadecimal representation). And core 2 performs 0 + 5 = 5, 5 + 6 = B and 11 + 7 = 12.
After all partial sums are executed, both cores prepare for memory access, each
one taking its turn to add and write its partial sum to the specified memory address.
During time unit 100, the first core add its partial sum with the one it has just fetched
from memory. This operation is marked with a dashed red box in Figure 11, and reads
A + 0 = A. The next partial sum aggregation happens in time unit 132, as shown in
Figure 11. It is also marked with a dashed red box and displays the following operation
12 + A = 1C. This last operation is the result of the first core’s partial sum (store in
memory) with the second core’s partial sum. This result is then stored in memory and the
program starts the final infinite loop.
5.3 Porting to FPGA

After simulating the multicore system in a software simulation, the project was
one step ahead to robustness. The main goal is to have a functional multicore system
which could execute RISC-V instructions and be used in real parallel applications. In
order to achieve such goal, the next target is to emulate the hardware design in a Field
Programmable Gate Array (FPGA) development board.
Table 3 – FPGA and Project Specifications
Quartus Prime Version 18.1.0 Build 625 09/12/2018 SJ Standard Edition

Revision Name zeroriscy_core
Top-level Entity Name zeroriscy_toplevel
Family Cyclone IV E
Device EP4CE115F29C7
Source – Author
The emulation hardware is a DE2 Development and Educational Board, developed

by Terasic. It includes a Cyclone IV E FPGA, as show in Table 3. To develop for this
board, the author used Quartus Prime Standard Edition.
This DE2 board has a few peripherals such as switches, 7-segment Light Emitting
Diode (LED) displays and buttons that will help debug the hardware and assert the
program execution. Yet, in order to compile the hardware design to the FPGA, there are
a few modifications to be made. The new design has to be a top level entity that contains
all modules: two cores, instruction memory, data memory, and other auxiliary output
modules such as the seven-segment LED display driver and the clock divider.
The top level entity comprises a zeroriscy_soc, a seven LED display drivers, and a
clock divider. The clock divider has the sole function of dividing the internal 50 MHz clock
signal, provided by the board, and outputs a 1 Hz signal, so that each instruction execution
may be followed by the observers. The LED diplay drivers will receive strategic signals,
such as alu_adder_result_ex, in order to assert the correct execution in specific clock
cycles. The signal value is converted to hexadecimal numeric representation and displayed
on the available seven-segment LED displays. At last, the zeroriscy_soc includes all the
components mentioned in chapter 4, cores and memory blocks. The top level architecture
is shown in Figure 12.
Figure 12 – Top level architecture.
Source – Author
Besides these modifications on the organization of the project, a few files were
altered so that the core could compile for the FPGA board. The reason is because in the
original project there were assert statements and other non-synthesizable constructs in
the System Verilog Language. By removing these statements the project was ready to be
used in the FPGA IDE.
5.3.1 Hardware emulation results

The entire design was compiled and executed using the Quartus Prime IDE. Even
though the zero-riscy is one of the smallest cores in the PULP project, it still is a very
complex device by itself, which is confirmed by the metrics shown after compilation.
As Table 4 shows, the total compilation time was 27 minutes and 25 seconds, and
the Fitter stage was the most time consuming, taking as much as 19 minutes and 44
seconds to finish. The end result has a total of 37892 logical elements, which is about one
third of the whole FPGA logic elements count. Refer to Table 5.
After compilation, the program was uploaded to the FPGA board and executed.
Table 4 – Flow Elapsed Time
Module Name Elapsed Time

Analysis & Synthesis 00:06:36
Fitter 00:19:44
Assembler 00:00:05
Timing Analyzer 00:00:23
EDA Netlist Writer 00:00:37
Total 00:27:25
Source – Author
Table 5 – Quartus Fitter Result
Timing Models Final

Total logic elements 37,892 / 114,480 ( 33 % )
Total registers 14705
Total pins 106 / 529 ( 20 % )
Source – Author
Figure 13 – FPGA Emulation result.
Source – Author
The LED displays and the switches are the only input and output form to verify the
correct execution of the code.
In order to start the execution of the code in memory, the switch 0, which holds the
reset signal, was turned to logical low and back up to logical high. This action resets the
cores and prepares them for start up, considering that the reset signal is active low. After
that, both fetch enable signals are set to active high and then the cores start requesting
instruction data.
The four left most LED displays in the board, grouped by two, represent the ALU
result from core 0 and core 1, respectively. As each core execute instructions, the observer
could see that these values changed in each cycle. Throughout the experiment, the author
followed the LED display output to confirm that the values of the ALU adder result signal
in each clock cycle was equal to the ones in the simulation and, then reinforce the accuracy
of the hardware design.
The correct result was confirmed by the LED display on the far right, labeled as
Total Sum Result in Figure 13. It shows 1C as a result, just as expected from the simulation
in EDA Playground. On its left is the Memory Access flag, which initially is zero, turns to
one, and then to two after core 1 has written its value to memory.
44
6 CONCLUSION
Parallel computing is the dominant computer organization nowadays. As time

progresses and new systems are launched to market, it becomes clear that hardware
specialization and parallelization is a necessity to overcome physical restrictions with
current transistor technology and materials. Yet, all the new computer organizations and
specialized hardware introduces a lot of complexity into the design.
This work was proposed as a way to remove the complexity and investigate the
design of a parallel hardware system, a dual processor with shared memory access. By
developing such system, the expectation was to reveal design intricacies that a simple
theoretical study usually does not reveal, and, after that, to put this system under
verification in commercial simulators and a hardware emulation.
During development, the author decided to use a core developed by the PULP
platform in appliance with the RISC-V instruction set architecture. Such decision was made
by the growing demand by open source hardware designs as well as the increasing support
the ISA has. The ISA was designed to be stable and extensible enough for applications
ranging from IoT to HPC. Theses characteristics and other advantages this ISA has over
traditional architecture has received attention from traditional hardware industry and
academy alike, which make using this ISA as a very compelling choice for new projects.
The development of the multicore system considered a few different memory or-
ganizations, such as a bus and a crossbar. In the end, the decision to use a duplicated
bus interconnect was a viable solution to keep the project simple and allow both cores to
access memory at exactly the same time, which is not possible with a single interconnect.
In consequence, the memory module design takes both buses interfaces and duplicates
signals in order to allow this signal exchange between cores and memory.
In order to test the multicore system, the author developed a parallel program
which produces the sum of all natural numbers from 1 to N. This problem was chosen
because it involves only the sum of integer numbers and it is easily parallelizable, as it
can be simply divided in partial sums for each core separately. Having the binary code
uploaded to memory, the experiment was run in commercial Sytem Verilog simulator, and
produce the expected results. Later, after a few design modifications, the system was also
emulated in a FPGA board and rendered correct results.
In conclusion, the proposed multicore system worked as expected, allowing both
cores to simultaneously access memory and modify its content. The final result of the
assembly program is correct, but only because both cores worked in consonance, divided
Chapter 6. Conclusion 45
the workload and accessed the same memory region in order to aggregate its partial sum
to the result. This solution, however simple, reveals intricacies that usually are not expose
to the final consumer, but are really important for the core to correctly perform its job.
And by the knowledge of these hidden complexities, this work aids the development of
new parallel hardware design.
46
BIBLIOGRAPHY
ASANOVIC̀, K.; PATTERSON, D. A. Instruction Sets Should Be Free: The Case For
RISC-V. [S.l.], 2014. Disponível em: <http://www2.eecs.berkeley.edu/Pubs/TechRpts-
/2014/EECS-2014-146.html>. 19, 20
EDA PLAYGROUND. EDA Playground. 2018. Disponível em: <https://www-

.edaplayground.com>. Acesso em: 4 december, 2018. 27, 28
GANSSLE, J. et al. Embedded Hardware: Know It All. Elsevier Science, 2007. (Newnes
Know It All). ISBN 9780080560748. Disponível em: <https://books.google.com.br-
/books?id=HLpTtLjEXqcC>. 14
KIELMANN, T. et al. Buses and crossbars. In: Encyclopedia of Parallel Computing.

Springer US, 2011. p. 200–205. Disponível em: <https://doi.org/10.1007/978-0-387-09766-
4 476>. 15
MOORE, G. E. Cramming More Components Onto Integrated Circuits. 1965. 12
PACHECO, P. An Introduction to Parallel Programming. 1st. ed. San Francisco, CA,

USA: Morgan Kaufmann Publishers Inc., 2011. ISBN 9780123742605. 14, 15
PULP PLATFORM. About PULP. 2018. Disponível em: <https://www.pulp-platform-

.org/projectinfo.html>. Acesso em: 1 december, 2018. 22
PULP PLATFORM. Processors - RISC-V compatible cores written in SystemVerilog.

2018. Disponível em: <https://www.pulp-platform.org/implementation.html>. Acesso em:
1 december, 2018. 22
PULP PLATFORM. zero-riscy: RISC-V Core. 2018. Disponível em: <https://www-

.overleaf.com/7765347961bjdtytnfcvrj>. Acesso em: 4 december, 2018. 27
RIPES. 2018. Disponível em: <https://github.com/mortbopet/Ripes>. Acesso em: 1

december, 2018. 36
RISC-V. RISC-V Cores and SoC Overview. 2018. Disponível em: <https://riscv.org/risc-
v-cores/>. Acesso em: 1 december, 2018. 21
ROSSI, D. et al. PULP: A parallel ultra low power platform for next generation IoT
applications. 2015 IEEE Hot Chips 27 Symposium, HCS 2015, 2016. 21
SCHIAVONE, P. D. zero-riscy : User Manual. n. June, 2017. Disponível em:

<https://pulp-platform.org/docs/zeroriscy\ user\ manu>. 25, 26
SCHIAVONE, P. D. et al. Slow and steady wins the race? A comparison of ultra-low-power
RISC-V cores for internet-of-things applications. 2017 27th International Symposium on
Power and Timing Modeling, Optimization and Simulation, PATMOS 2017, v. 2017-Janua,
p. 1–8, 2017. 23, 24, 26
Bibliography 47
WATERMAN, A. Design of the RISC-V Instruction Set Architecture. Tese (Doutorado)

— EECS Department, University of California, Berkeley, Jan 2016. Disponível em:
<http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-1.html>. 18, 20
WATERMAN, A. et al. The RISC-V Instruction Set Manual v2.1. 2012 IEEE
International Conference on Industrial Technology, ICIT 2012, Proceedings, I, p. 1–32,
2012. 19

TCC Ufrn Igor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TCC Ufrn Igor

Uploaded by

Copyright:

Available Formats

Igor Macedo Silva

Multicore System with Shared Memory

Multicore System with Shared Memory

Trabalho de Conclusão de Curso de En-

Orientador: Samuel Xavier de Souza

Universidade Federal do Rio Grande do Norte – UFRN

Multicore System with Shared Memory

Trabalho de Conclusão de Curso de En-

Orientador: Samuel Xavier de Souza

Trabalho aprovado. Natal – RN, 13 de Dezembro de 2018:

Prof. Dr. Samuel Xavier de Souza - Orientador

Prof. Dr. Luiz Felipe de Queiroz Silveira - Convidado

Prof. MSc. Diego Vinicius Cirilo do Nascimento - Convidado

Palavras-chaves: Multicore. Parallel Computing. Shared-memory. RISC-V.

Figure 1 – von Neumann and Harvard Architecture schematic. . . . . . . . . . . . 14

Table 1 – RISC-V base instruction sets and extensions . . . . . . . . . . . . . . . 19

CPU Central Processing Unit

UMA Uniform Memory Access

NUMA Non-Uniform Memory Access

ISA Instruction Set Architecture

BSD Berkeley Software Distribution

GPU Graphics Processing Unit

DSP Digital Signal Processor

RISC Reduced Instruction Set Computer

IoT Internet of Things

WSC Warehouse-Scale Computers

PULP Parallel Ultra Low Power

NTC Near-Threshold Computing

HPC High Performance Computing

IDE Instruction Decode and Execution

FIFO First In First Out

ALU Arithmetic Logic Unit

ASIC Application Specific Integrated Circuits

FPGA Field Programmable Gate Array

HDL Hardware Description Language

LED Light Emitting Diode

2 THE RISC-V ARCHITECTURE . . . . . . . . . . . . . . . . . . . . 18

1.1 Rise of Parallel Computing

The complexity for minimum component costs has increased at a rate

1.2 Parallel Computing Architecture

Figure 1 – von Neumann and Harvard Architecture schematic.

1.2.1 Shared Memory and Interconnects

components needs managing to avoid conflicting requests.

A bus comprises a shared medium with connections to multiple entities.

1.3 Motivation and Objectives

1.3.1 Specific Goals

1.4 Document Structure

2 THE RISC-V ARCHITECTURE

2.1 What is RISC-V

Leveraging three decades of hindsight, RISC-V builds and improves on

ISAs" (ASANOVIC̀; PATTERSON, 2014). In fact, according to Asanovic̀, industry could

Table 1 – RISC-V base instruction sets and extensions

Base ISA Instructions Description

Figure 2 – Basic RISC-V instruction formats.

RISC-V has a number of other features thought to improve hardware implementation.

2.2 The PULP Project

Figure 3 – The PULP Family.

Source – PULP Platform

Figure 4 – Simplified block diagram of Zero-riscy.

Source – (SCHIAVONE et al., 2017)

Figure 5 – Simplified block diagram of the Zero-riscy ALU.

Source – (SCHIAVONE et al., 2017)

3.2 Core Main Signals