Professional Documents
Culture Documents
1. INTRODUCTION
In the past, research in digital systems has been towards increasing the speed and throughput
and reducing the area. It was in the 90s that the trend changed and a growing demand for high
performance portable computing system made power consumption reduction one of the most
critical issues. Providing a low power solution without sacrificing the computational output to
perform a desired algorithm became the need of the hour. Power optimization can be achieved at
the following levels: circuit and logic level, architecture level, algorithm level, system level, and
process level. Since CMOS circuits do not dissipate power if they are not switching, a major
focus of low power design is to reduce the switching activity to the minimal level required to
perform the computation. This can range from simply powering down the complete circuit or
parts of it, to more complex schemes of gating the clock or optimal architectures with reduced
number of transitions.
The report is organized as follows: Section 2 introduces the sources of power dissipation,
Section 3 talks about the different levels of power optimization; Section 4 deals with reliability
of Low power systems; section 5 talks about the near threshold computing and Section 6 is the
description of the research project on low power systems undertaken during the course of this
semester.
2. SOURCES OF POWER DISSIPATION:
There are three major sources of power dissipation in digital CMOS circuits which are
summarized in the following equation [1]:
(1)
i.e the average number of times the node makes a power consuming transition (01) in one
clock period, CL is the load capacitance, Vdd is the supply voltage and fclk is the clock frequency.
This is the dominant component.
Isc.Vdd : Isc is the short circuit current which arises when both nMOS and the pMOS are on. There
exists a conducting path from Vdd to ground and hence this gives rise to a short circuit and hence
power dissipation
Ileakage.Vdd : Ileakage is the leakage current associated with subthreshold effect. It is the current that
flows even when the transistor is not on, i.e when VGS is zero and IDS >0.
3. LEVELS OF POWER OPTIMIZATION
3.1 Physical, logic and circuit level: The switching component of the power can be varied by
varying the term
01CL which is the effective capacitance. This is varied by the choice of logic
do not
change from the previous clock period, then a static gate does not switch. However, the gates of a
dynamic CMOS gate can switch. The aim of logic optimization is to choose logic to minimize
switching, position registers to reduce glitching, choosing gate from a library to minimize
switching, etc such that Ceff is minimized.
Circuit topology: The overall switching activity is influenced by the manner in which logic gates
are interconnected. It has two components:
a. Static component: This is a function of topology and signal statistics.
b. Dynamic component: This is more concerned with timing behavior like spurious
transitions and glitches.
Also, the circuit can be implemented in either a chain topology or tree topology. A chain
topology has a higher switching activity than the tree topology because there are extra
transitions in the chain implementation. However a tree topology is balanced and glitchfree.
Placement and Routing optimization:
Placing and routing algorithms should be such that signals having high switching activity should
be assigned short wires and signals with lower switching activity should be assigned longer
wires.
3.2 ARCHITECTURE LEVEL OPTIMIZATION & MEMORY OPTIMIZATION:
This has different strategies like architecture driven voltage scaling, choice of number
representation, exploitation of signal correlations and resynchronization to minimize glitching.
3.2.1 Architecture driven voltage scaling:
Aim is to operate the architecture at reduced supply voltage. Since delay is proportional to
supply voltage, the operating speed can also be reduced. Hence Vdd and f are reduced, thereby
decreasing power dissipation. To compensate for the reduction in throughput caused due to the
reduction in Vdd and f, pipelining and/or parallel processing are incorporated.
Figure 1: A Simple
adder datapath[1]
through
adder
(2)
Parallel Processing:
Pipelining:
(3)
Figure 3:
Pipelined
implementation of
the simple
datapath[1]
Add a pipelined latch in the critical path. So now, the adder and comparator can operate at a
lower frequency. Since the delays in the adder and comparator are equal, the supply voltage can
reduce from 5V to 2.9V. Effective capacitance increases to 1.15C. Therefore,
Ppipe= CpipeVpipe2fpipe
= (1.15Cref)(0.58Vref)2fref 0.39Pref
(4)
(5)
On-Chip Leakage power: The main idea is to power down the cache lines that do not
contain useful data. Vdd is turned off using a gater transisitor. The algorithm should be
such that if the cache line is not accessed for a long time, switch off Vdd.
Dynamic Voltage Scaling approach: In the gated V dd approach, the state of the cache line
is lost when turned off. Hence there is a significant power and performance overhead
associated with reloading the cache line. To avoid this, a drowsy cache is one in which
the cache lines are put in low power mode where the information of the cache line is
preserved. When contents need to be accessed, the cache lines need to be reinstated to a
high power mode.
Figure 5: Matrix
array[19]
But
for
FFT, a
linear
(6)
Shutdown happens once the component is idle for a time greater than the predicted timeout
value. A timeout scheme is needed to periodically reevaluate Tpredn .and a saturation condition is
needed to handle overprediction.
4. RELIABILITY OF LOW POWER SYSTEMS
4.1 RAZOR METHOD:
When the architecture is operated at voltages lower than VDD-Critical, there are errors that crop
up in the system when longer paths are excited. This can be avoided by shaving voltage margins
with Razor. The goal of this technique is to reduce voltage margins with in-situ error detection
and correction for delay failures. According to the proposed approach, tune processor voltage
based on error rate. Eliminate safety margins, and purposely run below critical voltage and have
data dependent delay margins. There is a trade-off between voltage power savings and overhead
of correction.
5. NEAR-THRESHOLD COMPUTING
Figure 9: Energy and delay in different supply voltage operating regions [8]
In the superthreshold regime (Vdd > Vth), energy is highly sensitive to Vdd due to the quadratic
scaling of switching energy with Vdd. Hence voltage scaling down to the near-threshold regime
(Vdd ~ Vth) yields an energy reduction on the order of 10X at the expense of approximately 10X
performance degradation, as seen in Fig. 8 [8]. However, the dependence of energy on V dd
becomes more complex as voltage is scaled below Vth. In subthreshold (V dd < Vth), circuit delay
increases exponentially with Vdd, causing leakage energy (the product of leakage current, V dd,
and delay) to increase in a near-exponential fashion. This rise in leakage energy eventually
dominates any reduction in switching energy, creating an energy minimum seen in Fig. 2
5.1 NTC BARRIERS:
a. Performance Loss
Performance loss observed in NTC, poses one of the most challenging problems in
utilizing NTC. An F04 inverter at NTC supply is 10X times slower than at 1.1V
b. Increased performance variation
In the near-threshold regime, MOSFET drive current exhibits an exponential dependency
on Vth, Vdd, and temperature. As a result, NTC designs display a dramatic increase in
performance uncertainty.
c. Increased functional failure
2. Device optimization: At the lowest level of abstraction, the performance of NTC systems
can be greatly improved through modifications and optimizations of the transistor structure and
its fabrication process.
b. Addressing performance variation
1. Soft edge clocking
The device variation inherent to semiconductor manufacturing continues to increase from such
causes as dopant fluctuation and other random sources, limiting the performance and yield of
ASIC designs. One potential solution to address timing variation while minimizing overhead is a
type of soft-edge flip-flop (SFF) that maintains synchronization at a clock edge, but has a small
transparency window, or softness. In one particular approach to soft-edge clocking, tunable
inverters are used in a masterslave flip-flop to delay the incoming master clock edge with
respect to the slave edge. As a result of this delay, a small window of transparency is generated in
the edge-triggered register that accommodates paths in the preceding logic that were too slow for
the nominal cycle time in essence allowing time borrowing within an edge-triggered register
environment. Hence, soft edge clocking results in a trade-off between short and long paths and is
effective at mitigating random, uncorrelated variations in delay, which are significant in NTC
2. Body Biasing
While effective in the superthreshold domain, the influence of body-biasing becomes particularly
effectual in the NTC domain where device sensitivity to the threshold voltage increases
exponentially. Body-biasing is therefore a strong lever for modulating the frequency and
performance in NTC, and is ideally suited as a technique for addressing the increased detriments
of process variation in NTC.
nominal case. This in turn reduces cell stability and heightens sensitivity to Vth variation, which
is generally high in SRAM devices to begin with due to the particularly aggressive device sizing
necessary for high density.
The read and write operations can be decoupled by relying on extra complexity in the
periphery of the core array. This results in a cell that is functional below 200 mV and that
achieves relatively high energy efficiency.
2.
of the NTC SRAM, but with minimal impact on die space a cache where only a subset of the
cache ways are implemented in larger NTC tolerant cells is proposed. This modified cache
structure, can dynamically reconfigure access semantics to act like a traditional cache if needed
for performance and act like a filter cache to balance energy in low power mode. When
performance is not critical, power can be reduced by accessing the low-voltage cache way first,
with the other ways of the cache only accessed on a miss. This technique is similar to that of
filter caches, and while providing power savings it does increase access time for hits in the
higher-NTC cache way voltages. When performance is critical, the access methodology is
changed to access all ways of the cache in parallel to provide a fast single cycle access to all
data.
The goal was to parallelize three very important classification algorithms, namely SVM, K
Means clustering and Reduced Boltzmann Machine and implement the same on a multi core
platform. The platform used here was Odroid-XU+E, which is based on the big little architecture
developed by arm. This part is organized as follows: first is an introduction to multi core,
followed by a discussion on the Big LITTLE architecture and the ODROID, then the different
algorithms are discussed, followed by how they were parallelized; finally results are presented.
6.2 MULTI CORE SYSTEMS
Figure 11: A basic block diagram of a generic multi core processor [17]
A multi-core processor is generally defined as an integrated circuit to which two or more
independent processors (called cores) are attached. This term is distinct from but related to the
term multi-CPU, which refers to having multiple CPUs which are not attached to the same
integrated circuit. The term uniprocessor generally refers to having one processor per system,
and that the processor has one core; it is used to contrast with multiprocessing architectures, i.e.
either multi-core, multi-CPU, or both. Figure 11 above, illustrates the basic components of a
generic multi-core processor.
There are many ways to operate the multicore so as to minimize the power consumed. One of the
many ways is power gating, i.e, the non-active cores can be turned off. When the core is active, it
is operated at its optimal voltage and optimal frequency. This scheme allows for fine grain power
management with two levels of voltage or two levels of frequency. This helps easier load balance
and achieves better control of temperatures.
NEED FOR MULTI CORE:
As different tasks demand higher processing speeds, single core processors need to be clocked at
higher frequencies. But they cannot be clocked beyond a certain clock frequency. Also, another
technique to improve the processing speed is pipelining. As the number of pipelines increase,
they face problems of heat dissipation, difficulty in complex design and verification and
demands large design teams accordingly. Multi core has emerged as the solution to all these
problems. There is a general trend in computer architecture wherein more and more applications
are getting parallelized and multi-threaded. Multicore processors seem to answer the deficiencies
of single core processors, by increasing bandwidth while decreasing power consumption
LEVELS OF PARALLELISM
1. TASK LEVEL PARALLELISM
Task level parallelism focuses on distributing execution processes (threads) across different
parallel computing nodes. In a multiprocessor system, this type of parallelism is achieved when
each processor executes a different thread (or process) on the same or different data. The threads
may execute the same or different code. In the general case, different execution threads
communicate with one another as they work. Communication usually takes place by passing data
from one thread to the next as part of a workflow. This parallelism emphasizes the distributed
(parallelized) nature of the processing (i.e. threads)
Instruction level parallelism: Exploits data level parallelism at modest levels with the help of
compiler and using ideas like pipelining and at medium levels using ideas like speculative
execution
2.
Vector architectures and Graphics processor units: Exploit data level parallelism by
Thread level parallelism: Exploits either data level parallelism or task level parallelism in a
tightly coupled hardware model that allows for interaction among parallel threads.
4.
Request level parallelism: Exploits parallelism among largely decoupled tasks specified by
The Odroid-XU+E comes in a package having four current/voltage sensors to measure the power
consumption of the Big A15 cores, Little A7 cores, GPUs and DRAMs individually. The
professional developers can monitor CPU, GPU and DRAM power consumption via included
on-board power measurement circuit. By using the integrated power analysis tool, developers
will reduce the need for repeated trials when debugging for power consumption and get the
opportunity to enhance and optimize the performance of their CPU/GPU compute applications,
and therefore keeping power consumption as low as possible.
that aims to find the positions i, i=1...k of the clusters that minimize the distance from the data
points to the cluster mean.
K-means clustering solves
k
k
2
i=1 xci
(7)
c i=1 xci
The algorithm:
1. Initialize the center of the clusters
ci={j:d(xj,i)d(xj,l),li,j=1,...,n}
PARALLEL K-MEANS
A single program multiple data (SPMD) approach is employed based on a message passing
model. Sending and receiving messages between a master and the concurrently created process
are done in an asynchronous manner.
1. Set initial global centroid C = <C1, C2, , CK>
2. Partition data to P subgroups, each subgroup has equal size
3. for each P,
4.
5.
Send C to the created process for calculating distances and assigning cluster members
(8)
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes.
Samples on the margin are called the support vectors.
If the training data are linearly separable, we can select two hyperplanes in a way that they
separate the data and there are no points between them, and then try to maximize their distance.
The region bounded by them is called "the margin". These hyperplanes can be described by the
equations
w.x-b=1
(9), and
w.x-b=-1
(10)
2
|| w ||
By using geometry, we find the distance between these two hyperplanes is
So we want to minimize ||w|| to maximize the margin. As we also have to prevent data points
from falling into the margin, we add the following constraint
for each either
w.xi-b1 for xi of the first class
or
w.xi-b1 for xi of the second.
This can be rewritten as:
yi(w.xi-b)1, for all 1 i n
We can put this together to get the optimization problem:
Minimize (in w,b) ||w|| subject to (for any i=1,..n) yi(w.xi-b)1
MULTICLASS SVM
Multiclass SVM aims to assign labels to instances by using support vector machines, where the
labels are drawn from a finite set of several elements. The dominant approach for doing so is to
reduce the single multiclass problem into multiple binary classification problems. Define a multi
category classifier which classifies a pattern by computing c linear discriminant functions
defined as
gi(x)=wtx+wi0 i=1,..,c,
And assigning x to the category corresponding to the largest discriminant
PARALLELIZATION OF MULTICLASS SVM
Since there is no data dependency between matrix multiplications, computation of each of linear
discriminant function ,u.e multiplication with one weight is done in one core. The results are
compared globally and appropriate label is assigned depending on the largest value.
6.6 RESTRICTED BOLTZMANN MACHINE
A Boltzmann machine is a bidirectionally connected network of stochastic processing units,
which can be interpreted as neural network model. Artificial Neural Networks are
computational models that are capable of machine learning and pattern recognition. It comprises
of neuron-like units whose binary activations depend on the neighbors theyre connected to.
Stochastic means that a probabilistic element is present in the activation. Stochastic neural
networks are a type of artificial neural networks that are built by introducing random variations
into the network, either by giving the network's neurons stochastic transfer functions, or by
giving them stochastic weights. Stochastic neural networks that are built by using stochastic
transfer functions are often called Boltzmann machines.
A Restricted Boltzmann Machine is a stochastic neural network consisting of one layer of visible
units and one layer of hidden units. It is used to learn the probability distribution over a set of
inputs
Figure 13: Diagram of an RBM with three visible units and four hidden units[11]
RBM FOR CLASSIFICATION:
When unit i is given the opportunity to update its binary state, it first computes its total
input, zi , which is the sum of its own bias, bi ,and the weights on connections coming from other
active units:
zi = bi + j sjwij
where wij is
the
weight
on
the
connection
(11)
between i and j , and sj is 1 if
unit j is
and 0 otherwise. Unit i then turns on with a probability given by the logistic function:
prob(si = 1) = 1/(1+e- zi)
(12)
(13)
on
data is the input data which is an 1xd matrix. w_l1 is the weight matrix of dimension
dxm. , data_l1 is an 1xm matrix that is fed to the second stage.
data_l2 = 1/(1+(exp(-(data_l1.w_l2) - b_l2T)))
(14)
w_l2 is an mxp matrix and data_l2 is an 1xp matrix that is fed into the third stage.
data_l3 = 1/(1+(exp(- (data_l2.w_l3) - b_l3T)))
(15)
w_l3 is a px1 matrix. Hence, data_l3 becomes a single number. A data point is classified
as belonging to class 1 if data_l3>0.5 else 0
PARALLELIZATION OF RBM: There is an inter-stage dependency for the data as one stage
depends on the previous stage for inputs. The input data is divided into as many groups as cores
Each of the above series of multiplications is carried out in each one of the processor
6.7 RESULT
The non-threaded versions of the three algorithms were coded in C++ and the threaded versions
were coded using a combination of C and POSIX thread programming[18]. Both the threaded
and non-threaded codes were run on the Odroid and Core i7 processors and the execution time
for the codes was compiled into the following table:
ALGORITHM
CORE I7
ODROID-XU+E
THREADED
NONTHREADED
THREADED
NONTHREADED
K-MEANS
13020
13217
Ongoing
ongoing
SVM
86018
23602969
277122704
Really large
value
RBM
Ongoing
18801
odroid
TABLE 1: Number of clock cycles taken to execute threaded and non-threaded versions on
ODROID and CORE-i7
6.8 CONCLUSION:
From the table, it is seen that the execution time taken by the Odroid for the parallelized versions
of the algorithms is less than the time taken by it to execute the non-parallelized versions. The
threads were allocated to the cores by the compiler based on the existing process load in each of
the cores. The heavy and complex multiplications were carried out by the Big core and small
multiplications were carried out by the LITTLE core. When the number of threads is small, the
non-threaded code takes much lesser time than threaded. This is because of the overhead
involved in the creation of threads. The same code was run for a large number of iterations to
amortize the overhead caused due to the thread creation. The advantage of threading was
observed only when the execution time of the code was large.
7.REFERENCES:
for
Multi-core
Processors,
http://www.cse.wustl.edu/~jain/cse567-
11/ftp/multcore/#sec1.1
[18] https://computing.llnl.gov/tutorials/pthreads/
[19] Tutorial by Dr. Chaitali on Architecture and Algorithm-level Techniques to Reduce
Power/Energy