Professional Documents
Culture Documents
Smartphones
Title Here Editor:
Editor:
Editor
Nayeem
Name Islam
Here nn Editor
Qualcomm
affiliation
n here
nayeem.islam@gmail.com
n editor email here
Approximate Computing:
Making Mobile Systems
More Efficient
Thierry Moreau, Adrian Sampson, and Luis Ceze, University of Washington
SmartPhones
annotating legacy software with intui- approximate data. To this end, Accept the pixel array in an image-filter algo-
tive approximate datatype annotations. provides a compiler analysis library that rithm. The programmer also provides a
finds regions of code that are amenable quality metric that measures the accu-
An Approximate Compiler to transformations. An ensemble of racy of the program’s overall output.
The Accept compiler framework com- optimization strategies transform these
bines programmer annotations, code regions. One critical optimization tar- Compilation
analysis, optimizations, and profiling gets Snnap, our neural accelerator The compiler implements neural accel-
feedback to make approximation safe (described in more detail later). eration in four phases: region selection,
and keep control in the hands of pro- execution observation, training, and
grammers. Its front end, built atop the Autotuning code generation. Accept first identifies
LLVM compiler infrastructure, extends Although a set of annotations might large regions of code that are safe to
the syntax of C and C++ to incorporate permit many different safe program approximate and nominates them as
an APPROX keyword that program- relaxations, not all of them are benefi- candidates for neural acceleration.
mers use to annotate datatypes. Accept’s cial in the quality-performance tradeoff Next, it executes the program with
analysis identifies code that can affect they offer. A practical approximation test cases and records the inputs and
only variables marked as APPROX. mechanism must help programmers outputs to each target code region. It
Optimizations use these analysis results choose from among many candidate then uses this input-output data to train
to avoid transforming the precise parts relaxations for a given program to a neural network that mimics the origi-
of the program. An autotuning com- strike an optimal balance between per- nal code. Training can use standard
ponent measures program executions formance and quality. Accept’s auto- techniques for neural networks—we
and uses heuristics to identify program tuner heuristically explores the space use the standard backpropagation
variants that maximize performance of possible relaxed programs to identify algorithm.
and output quality. The final output is Pareto-optimal variants. Finally, the compiler generates an
a set of Pareto-optimal versions of the executable that replaces the original
input program that reflect its efficiency- Neural Acceleration code with invocations of a special accel-
quality tradeoff space. Neural acceleration is a powerful erator (the NPU), which implements the
approach to approximate comput- trained neural network.
Safety Constraints and Feedback ing that works by substituting entire
Because program relaxations can have regions of code in a program with Execution
significant effects on program behavior, machine-learning models. 2 Neural During deployment, the transformed
programmers need visibility into—and acceleration trains neural networks to program begins execution on the main
control over—the transformations mimic and replace regions of approxi- core and configures the NPU. Through-
the compiler applies. To give the pro- mate imperative code. Once the neural out execution, the program invokes
grammer fine-grained control over network is trained, the system no longer the NPU to perform a neural network
relaxations, Accept extends an exist- executes the original code and instead evaluation in lieu of executing the code
ing lightweight annotation system for invokes the neural network model on region it replaced. Invoking the NPU
approximate computing based on type a neural processing unit (NPU) accel- is faster and more energy-efficient than
qualifiers.2 Accept gives programmers erator. Neural networks have efficient executing the original code region on
visibility into the relaxation process via hardware implementations, so this the CPU, so the program as a whole
feedback that identifies which transfor- workflow can offer significant energy runs faster.
mations can be applied and which anno- savings over traditional execution.
tations are constraining it. Through Neural acceleration consists of three Hardware Support for
annotation and feedback, the program- phases: programming, compilation, Approximate Acceleration
mer iterates toward an annotation set and execution. Our NPU implementation, Snnap, runs
that unlocks new performance benefits on off-the-shelf FPGAs. Using existing,
while relying on an assurance that criti- Programming affordable hardware means that Snnap
cal computations are unaffected. To use neural acceleration in Accept, can provide benefits today, without wait-
the programmer uses profiling informa- ing for new silicon. Snnap uses an emerg-
Automatic Program tion and type annotations to mark code ing class of heterogeneous computing
Transformations that’s amenable to approximation. For devices called programmable system-
Based on programmer annotations, many applications, it’s easy to identify on-chips (PSoCs). These devices combine
Accept’s compiler passes can apply pro- the “core” approximate data that domi- a set of hard processor cores with pro-
gram transformations that involve only nates the program’s execution—such as grammable logic on the same die.
scratchpad
scratchpad
mercially available PSoC: the Xilinx
...
...
Dual core
Zynq-7020 on the ZC702 evaluation ARM cortex-A9
platform (www.xilinx.com/products/ PE PE
silicon-devices/soc.html).4 The Zynq
SIG SIG
includes a Dual Core ARM Cortex-A9
and an FPGA fabric. The CPU-NPU
interface composes three communica-
tion mechanisms on the Zynq PSoC 4
Figure 1. The system diagram for the Systolic Neural Network Accelerator in
for high bandwidth and low latency.
Programmable logic (Snnap). Each processing unit (PU) contains a chain of
First, when the program starts, it
processing elements (PE) feeding into a sigmoid unit (SIG).
configures Snnap using the medium-
throughput general-purpose I /Os
(GPIOs) interface. Then, to use Snnap
during execution, the program sends from a local scratchpad memory three undergraduate researchers, all of
inputs using the high-throughput ARM where temporary results can also be whom were beginners with the C and
Accelerator Coherency Port (ACP). The stored. The sigmoid unit implements C++ languages and new to approximate
processor then uses the ARMv7 SEV/ a nonlinear neuron-activation func- computing, as well as graduate students
WFE signaling instructions to invoke tion using a lookup table. The PU more familiar with the field.
Snnap and enter sleep mode. Finally, the control block contains a configurable Programmers tended to approach
accelerator writes outputs back to the sequencer that orchestrates com- annotation by finding the central
processor’s cache via the ACP interface munication between the PEs and the approximable data in the program—
and, when finished, signals the proces- sigmoid unit. The PUs can be pro- for example, the vector coordinates in a
sor to wake up. grammed to operate independently, clustering algorithm, or pixels in imag-
so different PUs can be used to either ing code. Accept’s type errors guided
Micro-Architecture parallelize the invocations of a single programmers toward other parts of
Our design, shown in Figure 1, consists neural network or evaluate different the code that needed annotation. Pro-
of a cluster of processing units (PUs) neural networks concurrently. grammers needed to balance effort with
connected through a bus. Each PU is potential reward during annotation, so
composed of a control block, a chain Experience and Results auxiliary tools, such as profilers and
of processing elements (PEs), and a sig- We applied Accept and Snnap to a set call graph generators, were useful to
moid unit, denoted by the SIG block. of approximable benchmarks. Our goal find hot spots.
The PEs form a one-dimensional sys- was to show that programmers can
tolic array that feeds into the sigmoid unlock significant efficiency gains at a Snnap Acceleration Efficiency
unit. Systolic arrays excel at exploiting small accuracy cost with minimal effort. Our evaluation targeted seven bench-
the regular data-parallelism found in marks from many application domains:
neural networks, and they’re amenable Writing Approximate Programs option pricing, signal processing, robot-
to efficient implementation on modern To evaluate the effort required to ics (the inverse kinematics for 2-joint
FPGAs. apply approximation, we annotated arm—inversek2j), lossy image com-
When evaluating a layer of a neural a set of benchmarks for Accept’s lan- pression, machine learning (k-means),
network, PEs read the neuron weights guage. The programmers included and image processing. We compared
SmartPhones
Take the
CS Library
wherever
you go!
IEEE Computer Society magazines and Transactions are now
available to subscribers in the portable ePub format.
Just download the articles from the IEEE Computer Society Digital
Library, and you can read them on any device that supports ePub.
For more information, including a list of compatible devices, visit
www.computer.org/epub