Professional Documents
Culture Documents
Nikolaos Alachiotis
Dept. of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15217
Email: nalachio@ece.cmu.edu
AbstractNumerical measures of similarity/distance between operations are required to calculate the total number of set
objects represented by binary vectors are common in a wide bits in A, B, and their intersection A B. In the second
range of disciplines. Searching in large-scale chemical databases stage, a similarity coefficient is calculated. Typically, the first
requires billions of comparisons between molecules that are rep- stage is common not only among different chemical similarity
resented by binary fingerprints to capture the atomic structure. measures (e.g., Jaccard/Tanimoto, Dice, and Cosine) but also
The performance bottleneck here is the enumeration of set bits
in vectors (population count). Due to the discrete representation,
among different applications (e.g, SNP association). However,
similarity measures between binary fingerprints should fit well it has been reported that 2D similarity calculations can achieve
to FPGAs. We present an architecture to accelerate binary sim- only up to 65% of the theoretical peak performance on modern
ilarity assessment, evaluate various design points, and compare CPU architectures [3]. Based on a performance evaluation of
performance to highly optimized CPU and GPU implementations. the SNP association subroutine in the software OmegaPlus [4],
We implement an RTL generation software, SimGenRTL, to we conclude that such a peak performance deviation might
generate accelerators of various sizes based on the proposed be associated, to a certain extent, with possible architectural
architecture. We find that accelerators with fewer and wider restrictions of modern x86 CPUs that not allow the efficient
population counters allow better distribution of the hardware combination, at the microkernel level, of SSE or AVX vector
resources, outperforming significantly accelerators with more and instructions with the population count instruction supported in
narrower bit-enumeration components. SimGenRTL is available
for download to allow rapid design space exploration of the
hardware, which operates on regular registers with single-cycle
computational core ahead of a full custom system implementation. throughput (available since Intel Nehalem and AMD K10).
Here, we present a versatile reconfigurable architecture
I. I NTRODUCTION to accelerate fingerprint-based similarity calculations. Further,
we implement SimGenRTL (available at https:// github.com/
A wide range of applications in chemical informatics alachins/ simgenrtl), an RTL generation software that generates
rely on similarity calculations as a critical subroutine. Virtual fully synthesizable Verilog for the proposed architecture for
screening, which relies on compound database searching, has a variety of parameters such as the number and topology of
emerged as a cost-effective in silico method for early-stage the processing units, the width and latency of the population
drug design and discovery. Database clustering is also widely counters, as well as the required multi-bank memory subsys-
used in pharmaceutical chemistry to create ensembles of tem. Therefore, it allows rapid design space exploration for
molecules that share similar biological activity. Thus, finding the similarity kernel, and reduces significantly the engineering
similarity in multimillion compound databases currently avail- effort to adapt the architecture for a device of different size
able (e.g. PubChem [1], 63, 160, 076 compounds, last accessed and/or an FPGA board with different DRAM bandwidth.
March 4, 2015) has a key role in chemical informatics, but also
requires immense computational power. II. 2D F INGERPRINT S IMILARITY
A common approach to assess chemical similarity between According to the similar property principle [2], structurally
compounds involves the calculation of similarity measures on similar molecules tend to exhibit similar biological properties
2D fingerprints, i.e., binary vectors of fixed length where each and activity. Given a similarity measure S and a cut-off
bit represents the existence or absence of a substructural frag- threshold value D, compounds A and B are considered similar
ment in a molecule. Albeit conceptually simple, fingerprint- if SAB D. Most frequently, chemical similarity is evaluated
based similarity assessment has been proven very effective, in silico using fixed length binary fingerprints, i.e., series
and several measures have been proposed [2], with the most of binary digits. A binary fingerprint is generated using a
widely used being the Jaccard index/Tanimoto coefficient. subgraph isomorphism algorithm. The atoms and the bonds
Similarity assessment between objects represented by binary between neighboring atoms are examined and organized into
vectors is employed in other disciplines as well. In population patterns. Thereafter, a bit is set for each different pattern.
genetics, genetic polymorphisms (SNPs) are represented by Binary fingerprints are usually called 2D fingerprints to dis-
binary vectors to facilitate association studies. tinguish them from 3D fingerprint schemes that depend on the
compound shape.
Assessing similarity between two objects A and B repre-
sented by binary vectors consists of two distinct stages. During When fingerprints have been generated to numerically
the first stage, a number of population count and accumulate represent the compounds under investigation, a measure is
TABLE I. A RCHITECTURE PARAMETERS
employed to evaluate the similarity between two compounds
from their fingerprints. Given two compounds A and B, with Parameter Description
p, q, and x being the numbers of set bits in A, B, and A B, m population counter input bus width (bits)
respectively, the similarity between A and B can be calculated k population counter LUT port width (bits)
by several measures. The most widely used measure is the t population counter latency (cycles)
Tanimoto coefficient, which relies on the common set bits: n population counter output width (bits)
x x number of processing units horizontally
T animotoAB = . (1) y number of processing units vertically
p+qx z number of words in query memory bank
Other rarely used measures are Manhattan, Dice, Cosine, and r number of words in result memory bank
Euclidean. Typical fingerprint schemes comprise between 164 o number of result memory banks grouped together
(MDS fingerprints) and 2,048 bits (Daylight fingerprints).
n-bit output
A B
III. R ELATED W ORK +
m bits
vldin
CLB
Haque et al. [3] described a series of fast methods for
delay t-1
population count on modern x86 architectures and a GPU im- latency
t latency
POPCNT
plementation to compute the Tanimoto coefficient. Employing delay
OpenMP and SSE4.2 intrinsics, a throughput of 217.0 106 + +
n bits
rst ACCUM
coefficients/second (1,024-bit fingerprints) was achieved on
a 4-core Intel Core i7-920 CPU running at 2.93 GHz. Fur- ROM ... ROM
1c
SMC
thermore, an Nvidia GTX 480 GPU implementation achieved vldin
. . .
. . .
Dport PPU Block A
y-1 RBank RBank
object length. The FETCH unit retrieves the descriptor from
CNT PPU . . . PPU DRAM, stores it in the IMEM instruction memory, and notifies
CNT . . . CNT
the DECODE unit. The DECODE unit extracts the required
parameters from 8-byte long instructions and configures the
A A
QBank . . . QBank control FSMs, i.e., IControl, CControl, and OControl. All
0 Query Mem x-1 objects are stored in memory in chunks of size that does not
B B
exceed the DRAM row buffer size to achieve high bandwidth.
CNT . . . CNT
When all control units are configured, the DECODE unit
PPU . . .
PPU row PPU
initiates the similarity calculations by enabling the CControl,
y PPUs
which coordinates the operation of the similarity core with the
. . .
. . .
PPU Block B
incoming and outgoing data chunks via synchronization with
PPU . . . PPU the IControl and OContol units, respectively. For each iteration,
PPU column
a number of query chunks is initially loaded into Query Mem.
x PPUs
Therafter, the database objects are streamed in chunks. We
Fig. 2. The similarity core architecture generated by SimGenRTL. employ double buffering for streaming in the database. Each
of the two DMem memory blocks stores a database chunk,
start
Fetch descriptor IMem pass Decode and its objects are reused multiple times depending on z, i.e.,
rdaddr/
configuration parameters the size of the QBanks. To provide fingerprints to the y PPU
descriptor
rdaddr/sync sync CControl sync rows, each DMem memory is organized into y banks of width
IControl OControl m. When results start being stored in the RBanks using one
q_rdaddr/o_rst
q_wraddr/q_wren r_rdaddr wraddr
query data wren of the two available ports, the CControl notifies the OControl
Similarity to start transferring data to DRAM via the second port. Data
Interface
Interface
External
External
DMem Core
coefficient
from multiple RBanks are read per cycle, depending on the o
DMem database results
data parameter, to saturate the available DRAM bandwidth.
Fig. 3. The top-level similarity evaluation architecture.
V. C ODE GENERATION AND E VALUATION
each QBank is assigned a bank ID to allow distributing the a) Code Generation: The RTL generation software
query objects to the available banks during the Query Mem SimGenRTL is implemented in the C language and generates
initialization phase. During processing, the two available ports Verilog for the similarity core and the underlying hierarchy
in the QBanks provide query objects to the CNT blocks which of module instances. The command line below represents an
accumulate the number of set bits and pass the count number invocation example that was used to generate a small similarity
and the query objects to the PPU blocks. The CNT block core (8 PPUs, henceforth denoted by test8 core) for testing.
pipeline is identical to the PPU pipeline without the SMC SimGenRTL -name test8 -m 16 -k 8 -t 4
stage. Each of the two PPU blocks comprises y rows of x PPUs -n 32 -x 4 -y 1 -z 256 -r 256 -o 2
each. A PPU column computes similarity coefficients between
the same query object and a subset of y database objects, while In addition to the required parameters that specify the RTL
each PPU row computes similarity coefficients between the design (Table I), optional parameters allow to select between
same database object and x query objects that are loaded in block RAMs and distributed memory individually for each of
parallel from the x QBanks. The two PPU blocks comprise the modules that instantiate memory blocks, or between using
in total 2xy processing units. An equal number of output FPGA LUTs and DSP48 slices in the adder-tree reduction
memory banks (RBanks) are used to store the results. The structures. The test8 similarity core was mapped on a Virtex5
output memory that comprises the 2xy RBanks is organized SX95T-2 FPGA. To verify correctness, we conducted extensive
into two memory blocks, one for every PPU block, of xy dual- post-place and route simulations using ModelSim 6.3f, as well
port banks each. All RBank ports are assigned a group ID and as by employing the test environment provided by an open-
are driven by the same read/write addresses and write enable source PC-FPGA communication platform [9].
wires. For clarity reasons, the write address and write enable
b) Evaluation: A series of design points were gener-
generation logic for the RBanks is ommitted in Figure 2. Each
ated and mapped on a Virtex7 VX980T-2 FPGA. We assess
RBank group consists of o memory banks.
throughput performance in terms of billion Tanimotos per
d) Top-level Architecture: The similarity core operates second (bTan/sec) for varying number of PPUs and popu-
on a subset of query and database objects. To address the lation count sizes when 1,024-bit fingerprints are processed.
generic problem of all-to-all similarity calculations, compu- Figures 4 and 5 illustrate post-place and route (PAR) resource
tations are conducted on chunks of data iteratively. Each utilization, maximum operating clock frequency (post-PAR
iteration computes the similarity coefficients between a limited static timing report), and estimated throughput performance.
number of query objects and the entire database. The top-level Table II shows the resources for the entire control logic
architecture is depicted in Figure 3. To facilitate the analysis (Control), the similarity measure calculation stage (SMC), and
of arbitrary problem sizes and object lengths, we adopt the the population count module with m = 16 and t = 4. Note
concept of an accelerator descriptor as introduced by Quo et the disproportionately large amount of slice resources occupied
al. [8]. We implement a light-weight version of the descriptor by the SMC floating-point operators in comparison to the
Similarity Core Evaluation for m=16 and t=4 Similarity Core Evaluation for x=16 and y=1
T=0.18
80 180 80 180
T=0.34
Resource Utilization (%)
Frequency (MHz)
Frequency (MHz)
60 160 60 160
T=2.42
40 140 40 140
T=0.53
20 120 20 120
0 100 0 100
32 64 128 256 64 128 256 512
Number of Processing Units (PPUs=2xy) Population counter size (width m, latency t=log2(m))
Fig. 4. Resource utilization, maximum frequency, and throughput (bTan/sec, Fig. 5. Resource utilization, maximum frequency, and throughput (bTan/sec,
1,024-bit fingerprints, denoted by T ) for constant population counter size. 1,024-bit fingerprints, denoted by T ) for constant number of PPUs.
TABLE II. O CCUPIED R ESOURCES TABLE III. P ERFORMANCE COMPARISON WITH PREVIOUS WORKS
Resources Control SMC POPCNT 16 4
Occupied Slices 868 448 1 Platform Length bTan/sec FPGA Throughput & Speedup
Slice Registers 1,218 1,888 4 Haque et al. [3], CPU 1,024 0.217 13.7x
Slice LUTs 1,877 1,316 0 Haque et al. [3], GPU 1,024 1.157 Instance A: 2.97 2.6x
BRAM (18Kb)/DSP48 4/0 0/0 1/1 Ma et al. [6], GPU 992 0.288 10.3x
Maggioni et al. [5], GPU 256 0.214 Instance B: 3.19 14.9x