You are on page 1of 7

FPGA PCIe Bandwidth

Mike Rose Department of Computer Science and Engineering University of California San Diego June 9, 2010

Abstract
The unique fusion of hardware and software that characterizes FPGAs has had a strong impact on the research community. Here at UCSD, there have been many successful applications of FPGAs, especially in the implementation of Machine Learning techniques that react in real-time. I have demonstrated that with the addition of DMA transfers, our bandwidth between the FPGA and host computer has been tripled for large data transfers. This increase in communication will result in increased capabilities of previous work and will permit otherwise impossible future applications.

Introduction

Field Programmable Gate Arrays (FPGAs) already have shown an unparalleled ability to provide inexpensive real-time video processing solutions that are simply unavailable with competing technologies. This allows for unrivaled human-machine interactions by achieving undetectable visual response times of less than 100 ms. Many ongoing projects at UCSD have been utilizing these FPGA devices made by Xilinx with great success. An especially impressive project that took place at UCSD that exploited FPGAs is the real-time face detector written by Junguk Cho et al. [1]. They achieved a thirty-ve times increase in system performance over an equivalent software implementation. Another exciting use of FPGA boards that is 1

currently in development is a 60 frames per second point tracker by Matt Jacobson and Sunsern Cheamanunkul [4]. Mayank Cabra has also been looking into incorporating FPGAs into his work on analyzing pathology images for cancer detection. Most of our FPGA solutions rely on both hardware and software components. With solely software systems, the latency is just too high to be used in real time because you are forced to buer and parse large amounts of incoming data. And while it is relatively easy to integrate in small specied hardware modules to accelerate simple tasks, the development eort and resources required to implement complex systems completely in hardware is also often unattainable. This means that in the majority of complex systems we must rely on hardware and software cooperation. There are two ways that software programs are run in our systems. Microblaze is a small processor that runs directly on the FPGA using a portion of the programmable hardware on the chip, or we can use a workstation (desktop computer) attached and powering the FPGA. Microblaze is much closer to our hardware, since it is also running on the same chip. This results in low communication latency but Microblaze is limited by a very slow clock-speed of only about 100 MHz and a small amount of memory available. Desktop computers, in comparison, have easier programming environments, seemingly limitless memory, multiple CPUs and fast clock speeds. For these reasons, it was my goal to increase the bandwidth of communication between our FPGA devices and our

workstations so we can take advantage of the superior performance of the powerful workstation computers and increase the capabilities of the projects designed by my fellow students.

PCIe Overview

Figure 1: TLP Packet Breakdown maximum of 4KB. Xilinx performed a theoretical analysis of the protocol just based on the overhead of space in every packet (see Figure 1) and demonstrated you could compute the maximum theoretical eciency with the following equation: Eciency = Payload 100 Payload + 12 + 6 + 2

On our Xilinx Virtex-5 FPGA board, the Peripheral Component Interconnect Express (PCIe) is ideal for our communication needs because it is capable of both high bandwidth and low latency data transfer. Its major advantage over our other communication options is that it plugs the FPGA directly into the motherboard itself (like an expansion card). PCIe connections are formed based on serial links (called lanes) rather than a parallel interconnect. The new Xilinx FPGA boards support multiple lanes (up to 16X) on the PCIe interconnect which allow for greater communication bandwidth especially for unrelated or bidirectional data streams. Above the physical layer lies the Transaction Layer which is the heart of the PCIe Communication protocol. The Transaction Layer segments data into Transaction Layer Packets (TLPs) which guarantees inorder transmission across the bus. Even though the protocol claims it is capable of 250 MB/s throughput across a single lane of the PCIe, this is usually unattainable in practice. In fact, in preliminary tests here at UCSD, we were only able to send, process and send back data at a rate of about 3.5 MB/sec [2]. My rst goal when I undertook this project was to investigate the reason that our throughput numbers were so low when the protocol had given us such high hopes. Upon reading about the PCIe, it quickly became evident that it was optimized for large data transfers (such as those that come constantly streaming from a graphics card). Since the entire memory size of the current system we were testing was only 64KB, it was impossible for data transfers larger than that to occur. I also discovered that by using only normal read and write calls via Process Input Output (PIO) communication in the drivers means that each individual TLP Packet can only contain a payload of 4 bytes. This is instead of something closer to the dened 2

So if we use only 4 byte payloads, this already drops even the theoretical maximum eciency down to 16.7% eciency instead of 99% when using 2 KB payloads [8]. But what complicated things in our case was that the systems we were using were not actually using just basic reads and writes through the driver. We actually had a memory mapping in place that allowed us to access the memory of the FPGA as if it were our own. How memory mappings are implemented is completely machine dependent, so it was very difcult to know if the same limitations were applying in this case or if Linux was using this mapping to optimize our data transfers. The recommended way to take advantage of the full payload size of TLP packets is using Direct Memory Access (DMA). In a DMA transfer the PCIe knows the amount of data to be transferred beforehand and can segment the data accordingly.

Related Work

Patrick Lai, Matt Jacobson and myself began exploring the capabilities and possibilities of using FPGAs to implement Machine Learning algorithms under the direction of Yoav Freund here at UCSD in the Fall of 2007. It was at this time that we uncovered and created much of the framework still used today when beginning UCSDs FPGA integrated projects [4]. At

rst we were only able to communicate to our workstation through an extremely limited JTAG interface, but Patrick went on to nd and edit a driver by Xilinx allowing us to communicate via PCIe with much more ease and higher bandwidth [3]. I cannot stress enough that the technologies and ideas that were used in this project were not groundbreaking or new. Xilinx has had the framework to do this DMA communication via the PCIe for quite some time and others have managed to implement similar systems in the past [7]. However, my objective was to investigate and design a solution so those of us researching here at UCSD could attain analogous results for our systems. Evan Ettinger designed the initial bandwidth benchmark of our PCIe communication that I edited and used for the baseline performance and on which I drew my comparisons [2].

Figure 2: Original Design Flow was devised by Evan Ettinger (see Figure 2). First, the workstation lls a buer with test data. It then begins timing and sends the data over the PCIe using its mapping into BRAM (the FPGAs fast access memory). When the transmission is complete, the benchmark sets a ag in a specic location in BRAM signaling that the data is ready for processing. Meanwhile, a hardware module on the FPGA named the bitipper polls this location. Once it has been updated, the bitipper wakes and begins negating each word of data that was received. The bitipper then signals the workstation by writing to a dierent location in BRAM. In the meantime, the workstation must poll BRAM across the PCIe until it receives notication that the processing is complete. The workstation then reads all of the data back from FPGA. Lastly, it stops our timer and checks the validity of the data by comparing its returned data to the inverse of the input data. The bitipper was added because it proved to be a useful tool in verifying that our data was indeed received by the FPGA and there was no caching or errors occurring in our data transfers. Based on my initial research, there were two areas that I perceived as the most likely possible bottlenecks in the benchmark. The rst was using the memory map as our main mechanism for data transfers. Since the mechanisms used inside the memory map were such an unknown area, it seemed very plausible that the resulting data transmissions were only sending one word data transmissions at a time. I decided that the most worthwhile optimization I could hope to achieve would be to convert the system to use DMA transfers which are capable of taking full 3

Design and Implementation

The main communication concern that we wished to address was how to maximize the amount of throughput (data per second) that could be sent from the FPGA to the workstation, processed, and then returned back to the FPGA. Since we wanted to benchmark the bandwidth of a round trip, we decided that the communication bandwidth would be the same whether the data originated on the workstation or the FPGA as long as a full cycle was completed. It was easier at the time to locate the controlling application that was on the workstation since development time of a normal C program is much quicker and taking timing measurements would be far simpler. The other reason that it was easier to create a benchmark with the workstation in control was because our communication with the FPGA was still asymmetric. This means that the workstation was completely in control of the source and destination addresses of any data transmissions across the PCIe. An important byproduct of my research was to enable the FPGA components to be able to control reads and writes of workstation memory, creating a full symmetric communication system. With these goals in mind, the following benchmark

advantage of the PCIe protocol. The other issue was having the workstation completely in control of data transmission. This not only made a strange, unnatural dependency that caused an ugly interface between the two halves of our systems, but I also believed that having the workstation polling across the PCIe channel was creating latency that could dominate the running time of small data transfers.

4.1

DMA Daemon

ing hardware level code to interface to the PLB was acheivable but was not a good use of my time. So instead of having the bitipper IP core communicate to the DMA controller directly, I made a daemon that ran on the Microblaze and would poll the specic BRAM locations and pass on the parameters to the DMA Controller. I realized this would also be very useful for ordering DMA transfers from the workstation as well, since we already had working access to BRAM. This simple daemon made a clean and useful interface to the DMA handler and bridged the communication gap.

To realize these optimizations, the rst step was to incorporate a DMA controller into the system. Luckily, Xilinx has an available IP core available that can be generated for exactly this purpose. The XPS Central DMA Controller [6] transfers a programmable amount of data from a source address to a destination address. I generated the core and attached it to the PLB (processor local bus). The PLB is the central bus of the FPGA that can be accessed from the Microblaze and is used to access other peripherals like BRAM and the PCIe bridge. The next step was completing a local standalone DMA transfer. To achieve this, I wrote a small program running in the Microblaze on the FPGA that put some data into BRAM. In order for the Microblaze to initiate a DMA transfer it must edit four registers. These registers are located in the XPS Central DMA Controller base address on the PLB plus corresponding osets for source address, destination address, DMA control (DMACR), and Length. The source and destination in this case were both addresses within BRAM so the transfer would be local. Lastly, when the length register is set, the DMA transfer commences. It was quite useful to discover that while the transaction is underway you can poll the DMA Status Register to nd out when the transfer has been completed. Upon nishing and debugging this code, I had my rst successful DMA transfer. The bitipper IP core was only ever connected directly to BRAM and never interfaced to the PLB. This created an interesting problem since the bitipper was therefore incapable of using the DMA Controller alone. A colleague advised me that writ4

4.2

Workstation Driver

Now that I had veried my ability to make local DMA Transfers, it was necessary to allow the workstations memory to be addressable from the FPGA. The only way that this can be done is at the kernel driver level because user level applications live in a world of virtual addresses that are process specic and would mean nothing to the out-of-context hardware on the FPGA. At the kernel level you can allocate true physical addresses, but here I uncovered another interesting problem. The workstation we were using had a 64-bit address system so a simple call of kmalloc to allocate a buer could very likely lay in a region not addressable with only 32 bits. After some searching, I found exactly the system call required for this situation. The pci alloc consistent call takes a size and returns both a cpu virtual memory address to be used by the PC and a hardware address (which is guaranteed to be within the rst 32 bits of address space) that maps to the same address to be used by the device. This memory is also managed to ensure consistency. This means that either the workstation or the FPGA could be accessing this memory, but not both at any given time (using a mutex locking system). With this useful call in mind, I could now make some changes to the driver. Upon loading the driver I now allocated two buers, one for sending data to the FPGA, and one for receiving data from the FPGA. Now using the previous working mechanism for writing across the PCIe (writel), I manually write the hardware addresses to specic locations in BRAM.

These locations will be picked up by the daemon and stored for later use. I could now begin changing the read and write functions of the driver. Instead of calling individual writes for each word, it would instead copy the user data into my preallocated buer for workstation to FPGA communication. It would then simply send the DMA parameters of source, destination and length to my DMA daemon. I was happy to discover that with my new system in place, reads have became even easier. The bitipper was now responsible for initiating the transfers from the FPGA to the workstation on its own. So the only functionality required in the workstation drivers read function was to copy the contents of the shared buer back to the user. This means for a read operation there was no longer a need to transmit data across the PCIe.

4.3

PCIe Bridge

Now most of the pieces of my project were in place. The DMA Controller was complete and the workstation driver was able to accept data transfers. But the next and nal piece to connect it altogether proved very challenging. The PCIe Bridge (more specically the Xilinx PLBv46 PCIe core [5]) that we had always relied on for PCIe communication was still only congured to allow data requests that originated from the workstation. We had always known it should be capable of allowing the FPGA the same freedom but had never understood how to accomplish this. The PCIe bridge itself, is actually broken up into two parts. There is the PCIe Controller which we had used before to enable the workstation to access BRAM, and the PCIe Bar, which was meant to be an address window on the PLB that maps directly to workstation memory. The Bar had so far gone unused, and it was now my task to use the Controller to initialize and congure the PCIe Bar. The rst issue I faced was that in the beginning I could only nd a mechanism to congure the workstation memory location that the bar should be mapped to at compile time. This was not very useful since I did not know the hardware address of my buer until the driver was up and running. And about half the 5

time it would get the same address but occasionally it would not, and it was unacceptable to consider a solution that was dependant on something so arbitrary. I nally found that if I rebuilt the Controller with the parameter C INCLUDE BAROFFSET REG set to 1 there would be an allocated register named IPIFBAR2PCIBAR 0L where I could input my destination address at runtime [5]. The next problem I confronted can be summed up in one word: endianness . The Microblaze architecture is big endian and the workstation like any x86 computer is a little endian system. This means when I transferred over the hardware address that I should be setting the bar to point to, it was received but mixed up. A bug like this is easily xable with a short method that ips the endianness of your data, though it was a dicult bug to detect. The hardest part was conrming which endianness I even should be using since I could nd no documentation stating which endianness the PCIe controller would expect an addresses to be written in. The nal diculty I dealt with in altering the PCIe Bridge was due to alignment. A subtlety of how the PCIe Bar is designed is that it can only be congured to begin at addresses that are a multiple of its allocated address range. At rst, I had given it an address range that was very large without much thought. I later found out this was causing the bar to ignore many of the least signicant bits of the hardware address. This made the bridge appear that it was not working when it in fact was working but the beginning of the window was located far before the beginning of my buers. This problem can be alleviated by either lowering the address size of the bar or by using bit manipulation on the hardware location to add the correct relative oset to the bar.

4.4

The Completed Benchmark

The resulting data ow of the benchmark can be seen in Figure 3. The blue arrows represent the process to send data to the FPGA, and the red arrows represent the return trip. As the diagram shows, the driver and the bitswapper both communicate with the DMA Daemon but neither one is responsible for sending the bulk of the data being transferred. This system also

Figure 3: Modied Design Flow allows both the workstation and the FPGA to only poll memory locally which alleviates the more costly process of polling across the PCIe. Figure 4: Processed Data Throughput It was especially interesting to observe that even the small 4 byte transfers have shown a slight improvement. This is in spite of incurring extra overhead to begin our DMA transfers. This can only be explained by the fact that we have eliminated the need to poll for data readiness across the PCIe. As the data size is increased, the advantage of using DMA transfers begins to dominate. By the time we transfer 2 kilobytes of data, we have surpassed a 3x improvement in our data processing capabilities. Even though the results show great improvement, the numbers still do not approach the maximum bandwidth limits imposed by the PCIe protocol. However, I do not believe this in anyway undercuts the impact that this added bandwidth will have.

Results

The results presented in this section were obtained by running the baseline benchmark that did not include the congured PCIe bar or DMA cores against the new system that I designed. When running the two benchmarks, all of the timings included in this paper were calculated using the RDTSC registers to count CPU clock cycles. The cycles were divided by the workstations clock speed to yield our time. I also checked these values using gettimeofday which returned very similar results. Each test consisted of 10 trials that each requested 1000 DMA transfers in both directions for the the varying data sizes. I was careful to record both averages and standard deviations to ensure the quality of my results. Figure 4 shows the results of running the original benchmark in green (on the left), and the results of running our new benchmark in blue (on the right). The error bars represent the standard deviation in our test data which remained very small. Table 1 shows the results from the same trials, but I have added in the ratio of improvement. It can clearly be seen the throughput has improved in all sizes of data processed after the optimizations were added. 6

Future Work

Accelerating bulk communication between the workstation and the FPGA will remain an open research problem here at UCSD, as it will always limit the abilities of our researchers. I believe the next step is to contact Xilinx on their forums and ask what can be done to improve our results further. An interesting topic that I was unable to test due to time constraints was the impact of using wider lane widths for the PCIe communication on the FPGA. The ML506 board we are currently us-

Bytes 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K

Original 0.28 0.75 1.23 1.85 2.47 2.98 3.19 3.36 3.46 3.49 3.52 3.53 3.54 3.55

Modied 0.42 0.83 1.61 3.02 4.85 6.74 8.30 9.50 10.26 10.70 10.96 11.08 11.16 11.19

Ratio 1.50 1.10 1.32 1.63 1.96 2.26 2.60 2.83 2.96 3.06 3.11 3.14 3.15 3.16

haar classiers. In FPGA 09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays (New York, NY, USA, 2009), ACM, pp. 103112. [2] Ettinger, E. Cse291 bram pcie. http://seed.ucsd.edu/mediawiki/index. php/CSE291_BRAM_PCIe. [3] Lai, P. Patricklabdiary. http://seed.ucsd. edu/mediawiki/index.php/PatrickLabDiary. [4] UCSD. Fpga. http://seed.ucsd.edu/twiki/ bin/view/FPGAWeb/WebHome. [5] Xilinx. Ip logicore plbv46 rc/ep bridge for pci express (v4.04a). http://www.xilinx.com/ support/documentation/ip_documentation/ plbv46_pcie.pdf.

Table 1: Processed Data Throughput (MB/sec)

[6] Xilinx. Logicore ip xps central dma controller (v2.01c). http://www.xilinx.com/ ing only supports 1X communication but the slightly support/documentation/ip_documentation/ newer ML555 series has 8X capability, and the just xps_central_dma.pdf. released Virtex-6 FPGA boards has been reported to use up to 16X. I have read mixed gures on how much Pci express forum. has been gained from increasing the width, but using [7] Xilinx. http://xlnx.lithium.com/t5/ 8X lane widths did result in close to a 8 times speedup PCI-Express/bd-p/PCIe;jsessionid= for this bandwidth test run by Xilinx [8] and is said CD9813F58F3E0898E51FD9AD783737A4. to be especially benecial for bidirectional communication. [8] Xilinx. Spartan-6 fpga connectivity tarThe other great improvement to the benchmark getedreferencedesignperformance. http: would be to take advantage of the time in which the //www.xilinx.com/support/documentation/ PCIe is not full. This benchmark could be pipelined ip_documentation/xps_central_dma.pdf. so that there are multiple buers of data and a separate buer could be transferred while the earlier data is being processed. I think this is the only true way to really maximize the usage of the PCIe. I hope that Xilinx is able to diagnose the limiting factors in our current communication method. But at the very least, when pipelined data ow is being incorporated with wider lane widths of new boards, we will see far superior performance.

References
[1] Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. Fpga-based face detection system using 7

You might also like