ECE699 Lecture 7

ECE 699: Lecture 7
Efficient Communication
Between
Hardware Accelerators and PS
Recommended Videos & Slides
M.S. Sadri, ZYNQ Training
Lesson 12 AXI Memory Mapped Interfaces

and Hardware Debugging
Lesson 7 AXI Stream Interface In Detail (RTL Flow)
Lesson 9 Software development for ZYNQ using
Xilinx SDK (Transfer data from ZYNQ PL to PS)
Xilinx Advanced Embedded System Design on Zynq

Memory Interfacing (see Resources on Piazza)
Recommended Paper & Slides
M. Sadri, C. Weis, N. Wehn, and L. Benini,

Energy and Performance Exploration of
Accelerator Coherency Port Using Xilinx ZYNQ,
Proc. 10th FPGAworld Conference, Stockholm 2013,
available at
http://www.googoolia.com/wp/2014/03/07/my-cv/
Mapping of an Embedded SoC
Hardware Architecture to Zynq
Source: Xilinx White Paper: Extensible Processing Platform

Simple Custom Peripheral
Source: M.S. Sadri, Zynq Training

Simple Custom Accelerator

Example of a Custom Accelerator

Block Diagram of the Pattern Counter

Ways of Implementing AXI4 Slave Units

Pixel Processing Engine

PS-PL Interfaces and Interconnects
Source: The Zynq Book

General-Purpose Port Summary
GP ports are designed for maximum flexibility
Allow register access from PS to PL or PL to PS
Good for Synchronization
Prefer ACP or HP port for data transport

High-Performance Port Summary
HP ports are designed for maximum bandwidth
access to external memory and OCM
When combined can saturate external memory
and OCM bandwidth
HP Ports : 4 * 64 bits * 150 MHz * 2 = 9.6 GByte/sec
external DDR: 1 * 32 bits * 1066 MHz * 2 = 4.3 GByte/sec
OCM : 64 bits * 222 MHz * 2 = 3.5 GByte/sec
Optimized for large burst lengths and many
outstanding transactions
Large data buffers to amortize access latency
Efficient upsizing/downsizing for 32 bit accesses
Using Central DMA

Central DMA
High-bandwidth Direct
Memory Access (DMA)
between a memory-
mapped source address
and a memory-mapped
destination address
Optional Scatter Gather
(SG)
Initialization, status, and
control registers are
accessed through an
AXI4-Lite slave interface
Source: Xilinx Advanced Embedded System Design on Zynq

Using Central DMA in the Scatter-Gather Mode

Scatter Gather DMA Mode
Source: Symbian OS Internals/13. Peripheral Support

Custom Accelerator with the Master
AXI4 Interface

Ways of Implementing AXI4 Master Units

AXI4-Full

Image Rotation Unit

FFT Unit

Sample Generator

PL-PS Interfaces

Accelerator Architecture with DMA
Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial
AXI DMA-based Accelerator Communication
Write to Accelerator
processor allocates buffer
processor writes data into buffer
processor flushes cache for buffer
processor initiates DMA transfer
Read from Accelerator

processor waits for DMA to complete
processor invalidates cache for buffer
processor reads data from buffer
Flushing and Invalidating Cache
/* Flush the SrcBuffer before the DMA transfer */

Xil_DCacheFlushRange((u32)TxBufferPtr, BYTES_TO_SEND);
........
/* Invalidate the DstBuffer after the DMA transfer */

Xil_DCacheInvalidateRange((u32)RxBufferPtr, BYTES_TO_RCV);
Simple DMA Transfer
Programming Sequence for MM2S channel (1)
1. Start the MM2S channel running by setting the run/stop bit

to 1, MM2S_DMACR.RS = 1.
2. If desired, enable interrupts by writing a 1 to
MM2S_DMACR.IOC_IrqEn and MM2S_DMACR.Err_IrqEn.
3. Write a valid source address to the MM2S_SA register.
4. Write the number of bytes to transfer in
the MM2S_LENGTH register.
The MM2S_LENGTH register must be written last.
All other MM2S registers can be written in any order.
Simple DMA Transfer
Programming Sequence for S2MM channel (1)
1. Start the S2MM channel running by setting the run/stop bit

to 1, S2MM_DMACR.RS = 1.
2. If desired, enable interrupts by by writing a 1 to
S2MM_DMACR.IOC_IrqEn and S2MM_DMACR.Err_IrqEn.
3. Write a valid destination address to the S2MM_DA register.
4. Write the length in bytes of the receive buffer in the
S2MM_LENGTH register.
The S2MM_LENGTH register must be written last.
All other S2MM registers can be written in any order.
Transmitting and Receiving a Packet
Using High-Level Functions
/* Transmit a packet */
Status = XAxiDma_SimpleTransfer(&AxiDma,(u32) TxBufferPtr,
BYTES_TO_SEND, XAXIDMA_DMA_TO_DEVICE);
if (Status != XST_SUCCESS) { return XST_FAILURE; }
while (!TxDone);
......
/* Receive a packet */
Status = XAxiDma_SimpleTransfer(&AxiDma,(u32) RxBufferPtr,
BYTES_TO_RCV, XAXIDMA_DEVICE_TO_DMA);
if (Status != XST_SUCCESS) { return XST_FAILURE; }
while (!RxDone);
Transmitting a Packet
Using Lower-Level Functions
/* Transmit a packet */
Xil_Out32(AxiDma.TxBdRing.ChanBase +
XAXIDMA_SRCADDR_OFFSET, (u32) TxBufferPtr);
Xil_Out32(AxiDma.TxBdRing.ChanBase + XAXIDMA_CR_OFFSET,
Xil_In32(AxiDma.TxBdRing.ChanBase +XAXIDMA_CR_OFFSET)
| XAXIDMA_CR_RUNSTOP_MASK);
Xil_Out32(AxiDma.TxBdRing.ChanBase +
XAXIDMA_BUFFLEN_OFFSET, BYTES_TO_SEND);
while (TxDone == 0);
Receiving a Packet
Using Lower-Level Functions
/* Receive a packet */
Xil_Out32(AxiDma.RxBdRing.ChanBase +
XAXIDMA_DESTADDR_OFFSET, (u32) RxBufferPtr);
Xil_Out32(AxiDma.RxBdRing.ChanBase+XAXIDMA_CR_OFFSET,
Xil_In32(AxiDma.RxBdRing.ChanBase+XAXIDMA_CR_OFFSET)
| XAXIDMA_CR_RUNSTOP_MASK);
Xil_Out32(AxiDma.RxBdRing.ChanBase +
XAXIDMA_BUFFLEN_OFFSET, BYTES_TO_RCV);
while (RxDone == 0);
PL-PS Interfaces

Accelerator Architecture with Coherent DMA
Source: Building Zynq Accelerators with Vivado HLS, FPL 2013 Tutorial
Coherent AXI DMA-based Accelerator
Communication
Write to Accelerator
processor writes data into buffer
processor flushes cache for buffer
Read from Accelerator

processor waits for DMA to complete
processor invalidates cache for buffer
processor reads data from buffer
Accelerator Coherency Port (ACP) Summary
ACP allows limited support for Hardware Coherency
Allows a PL accelerator to access cache of the Cortex-A9 processors
PL has access through the same path as CPUs including caches,
OCM, DDR, and peripherals
Access is low latency (assuming data is in processor cache)
no switches in path
ACP does not allow full coherency
PL is not notified of changes in processor caches
Use write to PL register for synchronization
ACP is compromise between bandwidth and latency
Optimized for cache line length transfers
Low latency for L1/L2 hits
Minimal buffering to hide external memory latency
One shared 64 bit interface, limit of 8 masters
AXI-based DMA Services
Four AXI-based DMA services are provided

Central DMA (CDMA)
Memory-to-memory operations
DMA
Memory to/from AXI stream peripherals
FIFO Memory Mapped To Streaming
Streaming AXI interface alternative to traditional DMA
Video DMA
Optimized for streaming video application to/from memory

Streaming FIFO

Streaming FIFO
General AXI interconnect has no

support for the AXI stream interface
axi_fifo_mm_s provides this
facility
FIFO included
Added as all other types of IP are
from the IP Catalog
Features
AXI4/AXI4-Lite slave interface
Independent internal 512B-128KB
TX and RX data FIFOs
Full duplex operation

Streaming FIFO
Slave AXI connection

RX/TX FIFOs
Interrupt controller
Control registers
Three user-side AXI Stream
interfaces
TX data
RX data
TX control
AXI Video DMA Controller

Design Goal
Hardware accelerator capable of working for

arbitrary values of parameters
lm, ln, lp,
defined in software, with the only limitations imposed
by the total size and the word size of internal
memories.
Passing Parameters to an Accelerator
Option 1: Parameters (e.g., lm, ln, lp) are passed

using AXI_Lite
Option 2: Parameters (e.g., lm, ln, lp) are passed

in the header of input data
Option 3: Parameters inferred from the size of

transmitted input data (not possible in
general case of matrix multiplication)
Input size: (2lm+ln + 2lp+lm)*8

Output size: (2lp+ln)*32 (for lm16)
Choosing Optimal Parameters

Energy and Performance
Exploration of Accelerator
Coherency Port Using Xilinx ZYNQ
Mohammadsadegh Sadri, Christian Weis, Norbert When and

Luca Benini
Department of Electrical, Electronic and Information Engineering (DEI) University of Bologna, Italy
Microelectronic Systems Design Research Group, University of Kaiserslautern, Germany
{mohammadsadegh.sadr2,luca.benini}@unibo.it, {weis,wehn}@eit.uni-kl.de
ver0
Processing Task Definition
We define : Different methods to

accomplish the task.
Measure : Execution time & Energy.
Allocated by: Source Image Result Image
128K
kmalloc Selection of Pakcets:
(image_size (image_size
dma_alloc_coherent (Addressing)
bytes) bytes)
Depends on the memory - Normal
Sharing method - Bit-reversed
@Source @Dest
Address Address
Loop: N times
Image Sizes: Measure execution interval.
4KBytes
16K FIFO: 128K
65K
128K FIR
256K read write
1MBytes process
2MBytes
Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini Energy and performance 46
exploration of ACP Using ZYNQ
Memory Sharing Methods
ACP Only (HP only is similar, there is no SCU and L2)
Accelerat ACP SC
or U L2 DRAM
CPU only (with&without cache)

CPU ACP
(CPU HP similar)
CPU 2
1
Accelerat ACP SC
or U L2 DRAM
ACP ---CPU ---ACP ---

Speed Comparison
ACP Loses!
CPU OCM between
CPU ACP & CPU HP
298MBytes/s
239MBytes/s
4K 16K
64K 128K
256K 1MBytes
Mohammadsadegh Sadri, Christian Weis, Norbert Wehn, Luca Benini Energy and performance
48
Energy Comparison
CPU only methods : worst case!
CPU OCM always between

CPU ACP and CPU HP
CPU ACP ; always better energy than CPU HP0

When the image size grows CPU ACP converges CPU HP0
Lessons Learned & Conclusion
If a specific task should be done by accelerator only:

For small arrays ACP Only & OCM Only can be used
For large arrays (>size of L2$) HP Only always acts
better.
If a specific task should be done by the

cooperation of CPU and accelerator:
CPU ACP and CPU OCM are always
better than CPU HP in terms of energy
If we are running other applications which
heavily depend on caches, CPU OCM and then
CPU HP are preferred!

ECE699 Lecture 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECE699 Lecture 7

Uploaded by

Copyright:

Available Formats

ECE 699: Lecture 7

Lesson 12 AXI Memory Mapped Interfaces

Xilinx Advanced Embedded System Design on Zynq

M. Sadri, C. Weis, N. Wehn, and L. Benini,

Source: Xilinx White Paper: Extensible Processing Platform

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: The Zynq Book

GP ports are designed for maximum flexibility

Allow register access from PS to PL or PL to PS

Good for Synchronization

Prefer ACP or HP port for data transport

Source: M.S. Sadri, Zynq Training

Source: Xilinx Advanced Embedded System Design on Zynq

Source: M.S. Sadri, Zynq Training

Source: Symbian OS Internals/13. Peripheral Support

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Source: M.S. Sadri, Zynq Training

Read from Accelerator

/* Flush the SrcBuffer before the DMA transfer */

/* Invalidate the DstBuffer after the DMA transfer */

1. Start the MM2S channel running by setting the run/stop bit

1. Start the S2MM channel running by setting the run/stop bit

Source: M.S. Sadri, Zynq Training

Read from Accelerator

Four AXI-based DMA services are provided

Source: Xilinx Advanced Embedded System Design on Zynq

Source: Xilinx Advanced Embedded System Design on Zynq

General AXI interconnect has no

Source: Xilinx Advanced Embedded System Design on Zynq

Slave AXI connection

Source: Xilinx Advanced Embedded System Design on Zynq

Hardware accelerator capable of working for

Option 1: Parameters (e.g., lm, ln, lp) are passed

Option 2: Parameters (e.g., lm, ln, lp) are passed

Option 3: Parameters inferred from the size of

Input size: (2lm+ln + 2lp+lm)*8

Source: M.S. Sadri, Zynq Training

Mohammadsadegh Sadri, Christian Weis, Norbert When and

We define : Different methods to

ACP Only (HP only is similar, there is no SCU and L2)

CPU only (with&without cache)

ACP ---CPU ---ACP ---

CPU only methods : worst case!

CPU OCM always between

CPU ACP ; always better energy than CPU HP0

If a specific task should be done by accelerator only:

If a specific task should be done by the

You might also like