You are on page 1of 4

Generating FPGA Accelerators

for Chemical Similarity Assessment

Nikolaos Alachiotis
Dept. of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15217
Email: nalachio@ece.cmu.edu

AbstractNumerical measures of similarity/distance between operations are required to calculate the total number of set
objects represented by binary vectors are common in a wide bits in A, B, and their intersection A B. In the second
range of disciplines. Searching in large-scale chemical databases stage, a similarity coefficient is calculated. Typically, the first
requires billions of comparisons between molecules that are rep- stage is common not only among different chemical similarity
resented by binary fingerprints to capture the atomic structure. measures (e.g., Jaccard/Tanimoto, Dice, and Cosine) but also
The performance bottleneck here is the enumeration of set bits
in vectors (population count). Due to the discrete representation,
among different applications (e.g, SNP association). However,
similarity measures between binary fingerprints should fit well it has been reported that 2D similarity calculations can achieve
to FPGAs. We present an architecture to accelerate binary sim- only up to 65% of the theoretical peak performance on modern
ilarity assessment, evaluate various design points, and compare CPU architectures [3]. Based on a performance evaluation of
performance to highly optimized CPU and GPU implementations. the SNP association subroutine in the software OmegaPlus [4],
We implement an RTL generation software, SimGenRTL, to we conclude that such a peak performance deviation might
generate accelerators of various sizes based on the proposed be associated, to a certain extent, with possible architectural
architecture. We find that accelerators with fewer and wider restrictions of modern x86 CPUs that not allow the efficient
population counters allow better distribution of the hardware combination, at the microkernel level, of SSE or AVX vector
resources, outperforming significantly accelerators with more and instructions with the population count instruction supported in
narrower bit-enumeration components. SimGenRTL is available
for download to allow rapid design space exploration of the
hardware, which operates on regular registers with single-cycle
computational core ahead of a full custom system implementation. throughput (available since Intel Nehalem and AMD K10).
Here, we present a versatile reconfigurable architecture
I. I NTRODUCTION to accelerate fingerprint-based similarity calculations. Further,
we implement SimGenRTL (available at https:// github.com/
A wide range of applications in chemical informatics alachins/ simgenrtl), an RTL generation software that generates
rely on similarity calculations as a critical subroutine. Virtual fully synthesizable Verilog for the proposed architecture for
screening, which relies on compound database searching, has a variety of parameters such as the number and topology of
emerged as a cost-effective in silico method for early-stage the processing units, the width and latency of the population
drug design and discovery. Database clustering is also widely counters, as well as the required multi-bank memory subsys-
used in pharmaceutical chemistry to create ensembles of tem. Therefore, it allows rapid design space exploration for
molecules that share similar biological activity. Thus, finding the similarity kernel, and reduces significantly the engineering
similarity in multimillion compound databases currently avail- effort to adapt the architecture for a device of different size
able (e.g. PubChem [1], 63, 160, 076 compounds, last accessed and/or an FPGA board with different DRAM bandwidth.
March 4, 2015) has a key role in chemical informatics, but also
requires immense computational power. II. 2D F INGERPRINT S IMILARITY
A common approach to assess chemical similarity between According to the similar property principle [2], structurally
compounds involves the calculation of similarity measures on similar molecules tend to exhibit similar biological properties
2D fingerprints, i.e., binary vectors of fixed length where each and activity. Given a similarity measure S and a cut-off
bit represents the existence or absence of a substructural frag- threshold value D, compounds A and B are considered similar
ment in a molecule. Albeit conceptually simple, fingerprint- if SAB D. Most frequently, chemical similarity is evaluated
based similarity assessment has been proven very effective, in silico using fixed length binary fingerprints, i.e., series
and several measures have been proposed [2], with the most of binary digits. A binary fingerprint is generated using a
widely used being the Jaccard index/Tanimoto coefficient. subgraph isomorphism algorithm. The atoms and the bonds
Similarity assessment between objects represented by binary between neighboring atoms are examined and organized into
vectors is employed in other disciplines as well. In population patterns. Thereafter, a bit is set for each different pattern.
genetics, genetic polymorphisms (SNPs) are represented by Binary fingerprints are usually called 2D fingerprints to dis-
binary vectors to facilitate association studies. tinguish them from 3D fingerprint schemes that depend on the
compound shape.
Assessing similarity between two objects A and B repre-
sented by binary vectors consists of two distinct stages. During When fingerprints have been generated to numerically
the first stage, a number of population count and accumulate represent the compounds under investigation, a measure is
TABLE I. A RCHITECTURE PARAMETERS
employed to evaluate the similarity between two compounds
from their fingerprints. Given two compounds A and B, with Parameter Description
p, q, and x being the numbers of set bits in A, B, and A B, m population counter input bus width (bits)
respectively, the similarity between A and B can be calculated k population counter LUT port width (bits)
by several measures. The most widely used measure is the t population counter latency (cycles)
Tanimoto coefficient, which relies on the common set bits: n population counter output width (bits)
x x number of processing units horizontally
T animotoAB = . (1) y number of processing units vertically
p+qx z number of words in query memory bank
Other rarely used measures are Manhattan, Dice, Cosine, and r number of words in result memory bank
Euclidean. Typical fingerprint schemes comprise between 164 o number of result memory banks grouped together
(MDS fingerprints) and 2,048 bits (Daylight fingerprints).
n-bit output
A B
III. R ELATED W ORK +

m bits
vldin
CLB
Haque et al. [3] described a series of fast methods for
delay t-1
population count on modern x86 architectures and a GPU im- latency

t latency
POPCNT
plementation to compute the Tanimoto coefficient. Employing delay
OpenMP and SSE4.2 intrinsics, a throughput of 217.0 106 + +

n bits
rst ACCUM
coefficients/second (1,024-bit fingerprints) was achieved on
a 4-core Intel Core i7-920 CPU running at 2.93 GHz. Fur- ROM ... ROM
1c
SMC
thermore, an Nvidia GTX 480 GPU implementation achieved vldin

throughput of 1, 157 106 coefficients/second. Maggioni et vldout


k bits ... result
al. [5] presented a GPU algorithm and a series of opti- m-bit input IEEE-754 fp

mizations to compute all-to-all chemical similarity matrices


implementing the Tanimoto, Dice, Cosine, Euclidean, and Fig. 1. Parameterized population counter (left) and processing pipeline unit
(PPU, right) generated by SimGenRTL.
Manhattan coefficients. The optimized GPU implementation,
on an Nvidia GTX 460 GPU, can achieve throughput of up to
214.0 106 coefficients/second for computing the Tanimoto of ROM blocks in the array depends on the m parameter. The
coefficient on 256-bit fingerprints. Ma et al. [6] evaluated output of the ROM block is the number of set bits for each
Tanimoto calculations on GPUs when processing low-sparsity corresponding k-bit input segment, and the final population
fingerprints. They implemented an integer-based as well as a count value is computed via an adder-tree reduction structure
sparse vector algorithm. The highest reported throughput was with increasing adder width. To achieve high operating clock
288.9106 coefficients/second, processing 992-bit fingerprints, frequency, register stages are inserted in between the adders
and it was achieved on an Nvidia GTS 250 GPU. in the reduction structure.
Szanto et al. [7] presented an FPGA accelerator archi-
b) Processing Pipelined Unit: The main processing
tecture for reducing virtual screening processing times by
pipelined unit (PPU) is illustrated in Figure 1. Initially, a Com-
accelerating Tanimoto calculations. The proposed architecture
bining Logic Block (CLB) implements the required operations
was mapped to a Virtex-4 LX200 FPGA, and a maximum
to construct a single binary vector to capture the common
operating frequency of 100 MHz was reported. The authors
set bits between the objects under comparison. The CLB
employ a lookup table-based approach associated with the
output vector is validated via the vldin input port to ensure
similarity cut-off threshold in order to avoid the floating-
correctness of the accumulation of the POPCNTs outputs via
point division, a design choice that restricts significantly the
the accumulator unit (ACCUM) in the case of input flow
capability of the architecture to accelerate similar or even
disruptions. In addition to the population counter output width,
identical similarity calculations in other domains. Reported
the n parameter also specifies the width of the accumulator.
speedups range between 100X and 200X, from the comparison
The rst ACCUM input port resets the accumulator when the
of a single FPGA device against industrial software, however
two objects A and B are streamed in. The final PPU stage is
no benchmarking methodology is provided.
the similarity measure calculation (SMC), which implements
the Tanimoto coefficient. The objects A and B are streamed in
IV. R ECONFIGURABLE A RCHITECTURE cycle by cycle in m-bit segments via the separate input ports,
The proposed computational core is generated and a similarity coefficient is computed every l/m cycles,
by SimGenRTL based on the parameters summarized where l is the smallest multiple of m that is larger than the
in Table I. These parameters specify a single design point. length of the objects under comparison.
a) Population counter: The population count pipeline c) Similarity Core: The similarity core architecture
(POPCNT) is outlined in Figure 1. At the lowest level, it generated by SimGenRTL is illustrated in Figure 2. The Query
consists of an array of lookup tables (ROM blocks) that are, Mem exhibits x dual-port memory banks (QBanks), and each
by default, mapped to dual-port block RAMs in order to bank can store up to z number of m-bit segments. Each object
reduce logic slice utilization by exploiting the large number is stored contiguously in a QBank, occupying l/m words,
of available block RAMs in state-of-the-art FPGAs. The num- where l is the smallest multiple of m that is larger than the
ber of block RAMs per ROM table is controlled by the k length of the object. All QBank ports are connected to the
parameter since each table comprises 2k entries. The number same read/write address and write enable wires. However,
Dport 0 RBank RBank
PPU . . . PPU
management logic (FETCH, IMEM, and DECODE units in
CNT
Fig. 3) to set the number of query/database objects and the

. . .
. . .
Dport PPU Block A
y-1 RBank RBank
object length. The FETCH unit retrieves the descriptor from
CNT PPU . . . PPU DRAM, stores it in the IMEM instruction memory, and notifies
CNT . . . CNT
the DECODE unit. The DECODE unit extracts the required
parameters from 8-byte long instructions and configures the
A A
QBank . . . QBank control FSMs, i.e., IControl, CControl, and OControl. All
0 Query Mem x-1 objects are stored in memory in chunks of size that does not
B B
exceed the DRAM row buffer size to achieve high bandwidth.
CNT . . . CNT
When all control units are configured, the DECODE unit
PPU . . .
PPU row PPU
initiates the similarity calculations by enabling the CControl,

y PPUs
which coordinates the operation of the similarity core with the

. . .
. . .

PPU Block B
incoming and outgoing data chunks via synchronization with
PPU . . . PPU the IControl and OContol units, respectively. For each iteration,
PPU column
a number of query chunks is initially loaded into Query Mem.
x PPUs
Therafter, the database objects are streamed in chunks. We
Fig. 2. The similarity core architecture generated by SimGenRTL. employ double buffering for streaming in the database. Each
of the two DMem memory blocks stores a database chunk,
start
Fetch descriptor IMem pass Decode and its objects are reused multiple times depending on z, i.e.,
rdaddr/
configuration parameters the size of the QBanks. To provide fingerprints to the y PPU
descriptor
rdaddr/sync sync CControl sync rows, each DMem memory is organized into y banks of width
IControl OControl m. When results start being stored in the RBanks using one
q_rdaddr/o_rst
q_wraddr/q_wren r_rdaddr wraddr
query data wren of the two available ports, the CControl notifies the OControl
Similarity to start transferring data to DRAM via the second port. Data
Interface

Interface
External

External

DMem Core
coefficient
from multiple RBanks are read per cycle, depending on the o
DMem database results
data parameter, to saturate the available DRAM bandwidth.
Fig. 3. The top-level similarity evaluation architecture.
V. C ODE GENERATION AND E VALUATION

each QBank is assigned a bank ID to allow distributing the a) Code Generation: The RTL generation software
query objects to the available banks during the Query Mem SimGenRTL is implemented in the C language and generates
initialization phase. During processing, the two available ports Verilog for the similarity core and the underlying hierarchy
in the QBanks provide query objects to the CNT blocks which of module instances. The command line below represents an
accumulate the number of set bits and pass the count number invocation example that was used to generate a small similarity
and the query objects to the PPU blocks. The CNT block core (8 PPUs, henceforth denoted by test8 core) for testing.
pipeline is identical to the PPU pipeline without the SMC SimGenRTL -name test8 -m 16 -k 8 -t 4
stage. Each of the two PPU blocks comprises y rows of x PPUs -n 32 -x 4 -y 1 -z 256 -r 256 -o 2
each. A PPU column computes similarity coefficients between
the same query object and a subset of y database objects, while In addition to the required parameters that specify the RTL
each PPU row computes similarity coefficients between the design (Table I), optional parameters allow to select between
same database object and x query objects that are loaded in block RAMs and distributed memory individually for each of
parallel from the x QBanks. The two PPU blocks comprise the modules that instantiate memory blocks, or between using
in total 2xy processing units. An equal number of output FPGA LUTs and DSP48 slices in the adder-tree reduction
memory banks (RBanks) are used to store the results. The structures. The test8 similarity core was mapped on a Virtex5
output memory that comprises the 2xy RBanks is organized SX95T-2 FPGA. To verify correctness, we conducted extensive
into two memory blocks, one for every PPU block, of xy dual- post-place and route simulations using ModelSim 6.3f, as well
port banks each. All RBank ports are assigned a group ID and as by employing the test environment provided by an open-
are driven by the same read/write addresses and write enable source PC-FPGA communication platform [9].
wires. For clarity reasons, the write address and write enable
b) Evaluation: A series of design points were gener-
generation logic for the RBanks is ommitted in Figure 2. Each
ated and mapped on a Virtex7 VX980T-2 FPGA. We assess
RBank group consists of o memory banks.
throughput performance in terms of billion Tanimotos per
d) Top-level Architecture: The similarity core operates second (bTan/sec) for varying number of PPUs and popu-
on a subset of query and database objects. To address the lation count sizes when 1,024-bit fingerprints are processed.
generic problem of all-to-all similarity calculations, compu- Figures 4 and 5 illustrate post-place and route (PAR) resource
tations are conducted on chunks of data iteratively. Each utilization, maximum operating clock frequency (post-PAR
iteration computes the similarity coefficients between a limited static timing report), and estimated throughput performance.
number of query objects and the entire database. The top-level Table II shows the resources for the entire control logic
architecture is depicted in Figure 3. To facilitate the analysis (Control), the similarity measure calculation stage (SMC), and
of arbitrary problem sizes and object lengths, we adopt the the population count module with m = 16 and t = 4. Note
concept of an accelerator descriptor as introduced by Quo et the disproportionately large amount of slice resources occupied
al. [8]. We implement a light-weight version of the descriptor by the SMC floating-point operators in comparison to the
Similarity Core Evaluation for m=16 and t=4 Similarity Core Evaluation for x=16 and y=1

100 200 100 200


T=0.1 Occupied slices BRAMs Occupied slices BRAMs
Slice registers DSP48 Slice registers DSP48
Slice LUTs Frequency T=0.76 Slice LUTs Frequency

T=0.18
80 180 80 180

T=0.34
Resource Utilization (%)

Resource Utilization (%)


T=1.29
T=0.32

Frequency (MHz)

Frequency (MHz)
60 160 60 160

T=2.42

40 140 40 140

T=0.53


20 120 20 120


0 100 0 100
32 64 128 256 64 128 256 512

Number of Processing Units (PPUs=2xy) Population counter size (width m, latency t=log2(m))

Fig. 4. Resource utilization, maximum frequency, and throughput (bTan/sec, Fig. 5. Resource utilization, maximum frequency, and throughput (bTan/sec,
1,024-bit fingerprints, denoted by T ) for constant population counter size. 1,024-bit fingerprints, denoted by T ) for constant number of PPUs.

TABLE II. O CCUPIED R ESOURCES TABLE III. P ERFORMANCE COMPARISON WITH PREVIOUS WORKS
Resources Control SMC POPCNT 16 4
Occupied Slices 868 448 1 Platform Length bTan/sec FPGA Throughput & Speedup
Slice Registers 1,218 1,888 4 Haque et al. [3], CPU 1,024 0.217 13.7x
Slice LUTs 1,877 1,316 0 Haque et al. [3], GPU 1,024 1.157 Instance A: 2.97 2.6x
BRAM (18Kb)/DSP48 4/0 0/0 1/1 Ma et al. [6], GPU 992 0.288 10.3x
Maggioni et al. [5], GPU 256 0.214 Instance B: 3.19 14.9x

population count module. The latency of the bit enumeration


stage is proportional to the length of the fingerprints, and pipelined population count modules instead of a high number
rapidly becomes the bottleneck as the fingerprint length in- of independent pairwise comparison units.
creases. Consequently, similarity cores with large number of
PPUs and narrow population counters exhibit poor throughput R EFERENCES
performance, as can be observed in Figure 4, since the largest [1] Y. Wang, J. Xiao, T. O. Suzek, J. Zhang, J. Wang, and S. H. Bryant,
fraction of resources is allocated for floating-point operations, PubChem: a public information system for analyzing bioactivities of
which are not performance critical. On the other hand, a mod- small molecules, Nucleic acids research, vol. 37, no. suppl 2, pp. W623
erate number of PPUs allows for significantly wider population W633, 2009.
count modules, leading to dramatic latency reduction for the [2] A. R. Leach and V. J. Gillet, An introduction to chemoinformatics.
Springer Science & Business Media, 2007.
bit enumeration stage and a significant throughput performance
boost, as can be observed in Figure 5. Table III provides [3] I. S. Haque, V. S. Pande, and W. P. Walters, Anatomy of high-
performance 2D similarity calculations, Journal of chemical information
a performance comparison between two accelerator instances and modeling, vol. 51, no. 9, pp. 23452351, 2011.
(A: 44 PPUs, m=512, t=9, 135 MHz, and B: 80 PPUs, m=256, [4] N. Alachiotis, A. Stamatakis, and P. Pavlidis, OmegaPlus: a scalable
t=8, 155 MHz, based on 12 GB/s memory bandwidth) and tool for rapid detection of selective sweeps in whole-genome datasets,
previous works on microprocessors and GPUs. Bioinformatics, vol. 28, no. 17, pp. 22742275, 2012.
[5] M. Maggioni, M. D. Santambrogio, and J. Liang, GPU-accelerated
VI. C ONCLUSION Chemical Similarity Assessment for Large Scale Databases, Procedia
Computer Science, vol. 4, pp. 20072016, 2011.
Similarity calculations between binary objects are widely [6] C. Ma, L. Wang, and X.-Q. Xie, GPU accelerated chemical similarity
employed in various domains. While current microprocessor calculation for compound library comparison, Journal of chemical
architectures provide hardware support for population count information and modeling, vol. 51, no. 7, pp. 15211527, 2011.
operations, deployment with vector intrinsics at the micro- [7] P. Szanto, B. Feher, and A. Berces, Accelerating virtual screening of
compound libraries, in Many-Core and Reconfigurable Supercomputing
kernel level leads to suboptimal kernel performance. To this Conference (MRSC 09). Berlin, Germany, 2009.
end, a high-performance accelerator architecture for FPGAs [8] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi,
was presented and compared to third-party highly optimized J. C. Hoe, and F. Franchetti, 3D-stacked Memory-side Acceleration:
software and GPUs codes. Additionally, an RTL generation Accelerator and System Design, in WoNDP, in conjunction with the
software was developed, which allowed extensive design space 47th IEEE/ACM International Symposium on Microarchitecture, 2014.
exploration. We found that custom FPGA designs are up to [9] N. Alachiotis, S. A. Berger, and A. Stamatakis, Efficient PC-FPGA
2.6 and 13.7 times faster than GPU and software imple- communication over Gigabit Ethernet, in Computer and Information
Technology (CIT), 2010 IEEE 10th International Conference on. IEEE,
mentations, respectively, while performance benefits become 2010, pp. 17271734.
more prevalent when FPGA resources are allocated for wider

You might also like