Low Power Design of Digital Systems

ARIZONA STATE UNIVERSITY
LOW POWER DESIGN

OF DIGITAL SYSTEMS
EEE590:Reading and conference
Asha Sasikumar
1206479205
1. INTRODUCTION
In the past, research in digital systems has been towards increasing the speed and throughput
and reducing the area. It was in the 90s that the trend changed and a growing demand for high
performance portable computing system made power consumption reduction one of the most
critical issues. Providing a low power solution without sacrificing the computational output to
perform a desired algorithm became the need of the hour. Power optimization can be achieved at
the following levels: circuit and logic level, architecture level, algorithm level, system level, and
process level. Since CMOS circuits do not dissipate power if they are not switching, a major
focus of low power design is to reduce the switching activity to the minimal level required to
perform the computation. This can range from simply powering down the complete circuit or
parts of it, to more complex schemes of gating the clock or optimal architectures with reduced
number of transitions.
The report is organized as follows: Section 2 introduces the sources of power dissipation,
Section 3 talks about the different levels of power optimization; Section 4 deals with reliability
of Low power systems; section 5 talks about the near threshold computing and Section 6 is the
description of the research project on low power systems undertaken during the course of this
semester.
2. SOURCES OF POWER DISSIPATION:
There are three major sources of power dissipation in digital CMOS circuits which are
summarized in the following equation [1]:
Pavg = Pswitching + Pshort-circuit + Pleakage

=01CLVdd2.fclk + Isc.Vdd + Ileakage.Vdd
CLVdd2.fclk : This is the switching component of power.
(1)
01 is the probability of transition,
i.e the average number of times the node makes a power consuming transition (01) in one
clock period, CL is the load capacitance, Vdd is the supply voltage and fclk is the clock frequency.
This is the dominant component.
Isc.Vdd : Isc is the short circuit current which arises when both nMOS and the pMOS are on. There
exists a conducting path from Vdd to ground and hence this gives rise to a short circuit and hence
power dissipation
Ileakage.Vdd : Ileakage is the leakage current associated with subthreshold effect. It is the current that
flows even when the transistor is not on, i.e when VGS is zero and IDS >0.
3. LEVELS OF POWER OPTIMIZATION
3.1 Physical, logic and circuit level: The switching component of the power can be varied by
varying the term
01CL which is the effective capacitance. This is varied by the choice of logic
function, logic style and signal statistics.

Logic style: Static circuits have lesser
01 than dynamic circuits. Also, if the inputs
do not
change from the previous clock period, then a static gate does not switch. However, the gates of a
dynamic CMOS gate can switch. The aim of logic optimization is to choose logic to minimize
switching, position registers to reduce glitching, choosing gate from a library to minimize
switching, etc such that Ceff is minimized.
Circuit topology: The overall switching activity is influenced by the manner in which logic gates
are interconnected. It has two components:
a. Static component: This is a function of topology and signal statistics.
b. Dynamic component: This is more concerned with timing behavior like spurious
transitions and glitches.
Also, the circuit can be implemented in either a chain topology or tree topology. A chain
topology has a higher switching activity than the tree topology because there are extra
transitions in the chain implementation. However a tree topology is balanced and glitchfree.
Placement and Routing optimization:
Placing and routing algorithms should be such that signals having high switching activity should
be assigned short wires and signals with lower switching activity should be assigned longer
wires.
3.2 ARCHITECTURE LEVEL OPTIMIZATION & MEMORY OPTIMIZATION:
This has different strategies like architecture driven voltage scaling, choice of number
representation, exploitation of signal correlations and resynchronization to minimize glitching.
3.2.1 Architecture driven voltage scaling:
Aim is to operate the architecture at reduced supply voltage. Since delay is proportional to
supply voltage, the operating speed can also be reduced. Hence Vdd and f are reduced, thereby
decreasing power dissipation. To compensate for the reduction in throughput caused due to the
reduction in Vdd and f, pipelining and/or parallel processing are incorporated.
Figure 1: A Simple
adder datapath[1]
If worst case delay
through
adder
&comparator is 30ns at 5V, then T=30ns. Therefore,

Pref=CrefVref2fref
(2)
Parallel Processing:
Figure 2: Parallel implementation of the simple datapath[1]

For same throughput, clock rate can be reduced by half. fpar=fref/2, Vpar=Vref/1.7 and
Cpar=2.15Cref. Therefore,
Ppar=CparVpar2fpar
= (2.15Cref) (0.58Vref)2 (fref/2)
Pipelining:
(3)
Figure 3:
Pipelined
implementation of
the simple
datapath[1]
Add a pipelined latch in the critical path. So now, the adder and comparator can operate at a
lower frequency. Since the delays in the adder and comparator are equal, the supply voltage can
reduce from 5V to 2.9V. Effective capacitance increases to 1.15C. Therefore,
Ppipe= CpipeVpipe2fpipe
= (1.15Cref)(0.58Vref)2fref 0.39Pref
(4)
Combination of pipelining and parallel processing:

Here the critical path is reduced by a factor of 4 and hence the supply voltage can be reduced
from 5V to 2V. The effective capacitance increases by a factor of 2.5. Hence power,
Ppar-pipe= (2.5Cref)(0.4Vref)2(fref/2) 0.2Pref
(5)
3.2.3 MEMORY OPTIMIZATION

REDUCING MEMORY ENERGY:
On-Chip Leakage power: The main idea is to power down the cache lines that do not
contain useful data. Vdd is turned off using a gater transisitor. The algorithm should be
such that if the cache line is not accessed for a long time, switch off Vdd.
Dynamic Voltage Scaling approach: In the gated V dd approach, the state of the cache line
is lost when turned off. Hence there is a significant power and performance overhead
associated with reloading the cache line. To avoid this, a drowsy cache is one in which
the cache lines are put in low power mode where the information of the cache line is
preserved. When contents need to be accessed, the cache lines need to be reinstated to a
high power mode.
Figure 4: Implementation of the drowsy cache line [2]

3.3 ALGORITHM LEVEL OPTIMIZATION
3.3.1 Matching algorithm and Architecture:
A linear systolic array is the best way to implement a matrix multiplication. A systolic array is an
arrangement of processors in an array fashion where data flows synchronously between
neighboring processors, with different data flowing in different directions. Each processor takes
data from one of its neighboring processors, processes it, and then outputs the result in the
opposite direction.
Figure 5: Matrix
multiplication using a linear systolic
array[19]
But
for
FFT, a
linear
computation. In that case, we need to
systolic array cannot be used for

go for a perfect shuffle structure.
Figure 6: Perfect shuffle structure for FFT[19]

3.3.2 Word length optimization
Fixed arithmetic operations are low power computations. So word length needs to be selected
that satisfies quality requirements. In programmable architectures, narrow computations need to
be supported. Using clock gating, the upper bits can be disabled.
3.3.3 Power aware algorithm design
Power aware algorithm involves adapting the operating configuration based on the
environmental conditions, channel conditions, battery conditions and quality requirements. For
example, varying the resolution of a video frame according to the battery life remaining is an
example of power aware algorithm.
3.4 SYSTEM LEVEL OPTIMIZATION

The key technique in Dynamic power management is that a component that is idle is switched
off. It exploits non-uniform workload and also predicts fluctuations in the workload. This power
management technique works by reducing power according to workloads and shutting down a
particular component only after a long idle time. The different techniques in dynamic power
management are: static techniques, which include fixed timeout, predictive shutdown and
predictive wake up; adaptive techniques which involve adapting either timeout values or idle
time prediction; and dynamic task scheduling.
3.4.1 STATIC TECHNIQUES
A. Fixed timeout: Use a pre-programmed timeout value to shut off a component if it remains idle
for a time value greater than the programmed value. The key to success of this technique is
programming the right timeout value. A timeout of too small value is undesirable and a too large
timeout leads to no reduction in power dissipation.
B. Predictive shutdown:Prediction is based on past history. The predicted timeout value of
timeout for ith instant is a non-linear function of the past idle and active times (i-1 upto i-k). if the
predicted timeout value is greater than a particular threshold, then the system goes to sleep.
3.4.2 ADAPTIVE TECHNIQUES
Predicted idle time, Tpredn, is the weighted sum of the last idle period T idlen-1 and the last predicted
idle period Tpredn-1
Tpredn = a*Tidlen-1 + (1-a)*Tpredn-1
(6)
Shutdown happens once the component is idle for a time greater than the predicted timeout
value. A timeout scheme is needed to periodically reevaluate Tpredn .and a saturation condition is
needed to handle overprediction.
4. RELIABILITY OF LOW POWER SYSTEMS
4.1 RAZOR METHOD:
When the architecture is operated at voltages lower than VDD-Critical, there are errors that crop
up in the system when longer paths are excited. This can be avoided by shaving voltage margins
with Razor. The goal of this technique is to reduce voltage margins with in-situ error detection
and correction for delay failures. According to the proposed approach, tune processor voltage
based on error rate. Eliminate safety margins, and purposely run below critical voltage and have
data dependent delay margins. There is a trade-off between voltage power savings and overhead
of correction.
Figure 7: Razor flip flop implementation [15]

4.2 ALGORITHMIC NOISE TOLERANCE
This is an algorithm based fault tolerance in which redundant computations are employed to
detect and locate errors. Algorithmic noise tolerance does not do exact error correction It
compensates for output degradation caused by voltage overscaling and deep submicron noise.
5. NEAR-THRESHOLD COMPUTING
Figure 8: Technology scaling trends of supply voltage and energy [8]

Developing energy-efficient solutions is critical for the survival of the semiconductor industry.
Though the subthreshold operation was proposed around 30 years ago, the challenges of
operating in this region are many, and so the application of this region of operation has been
limited to a handful of minor markets, such as wristwatches and hearing aids. Near-threshold
computing (NTC), is a design space where the supply voltage is approximately equal to the
threshold voltage of the transistors. This region retains much of the energy savings of
subthreshold operation with more favorable performance and variability characteristics. Even
with these attractive features, there are three key challenges that have been poorly addressed to
date, namely: a) Greater loss in performance b) Increase in performance variation c) increase in
functional failure rate and increased logic failures. Overcoming these challenges requires a
synergistic approach combining methods from the algorithm and architecture levels to circuit and
technology levels.
Figure 9: Energy and delay in different supply voltage operating regions [8]
In the superthreshold regime (Vdd > Vth), energy is highly sensitive to Vdd due to the quadratic
scaling of switching energy with Vdd. Hence voltage scaling down to the near-threshold regime
(Vdd ~ Vth) yields an energy reduction on the order of 10X at the expense of approximately 10X
performance degradation, as seen in Fig. 8 [8]. However, the dependence of energy on V dd
becomes more complex as voltage is scaled below Vth. In subthreshold (V dd < Vth), circuit delay
increases exponentially with Vdd, causing leakage energy (the product of leakage current, V dd,
and delay) to increase in a near-exponential fashion. This rise in leakage energy eventually
dominates any reduction in switching energy, creating an energy minimum seen in Fig. 2
5.1 NTC BARRIERS:
a. Performance Loss
Performance loss observed in NTC, poses one of the most challenging problems in
utilizing NTC. An F04 inverter at NTC supply is 10X times slower than at 1.1V
b. Increased performance variation
In the near-threshold regime, MOSFET drive current exhibits an exponential dependency
on Vth, Vdd, and temperature. As a result, NTC designs display a dramatic increase in
performance uncertainty.
c. Increased functional failure
The increased sensitivity of NTC circuits to variations in process, temperature and

voltage not only impacts performance but also circuit functionality. In particular, the
mismatch in device strength due to local process variations from such phenomena as
random dopant fluctuations (RDF) and line edge roughness (LER) can compromise state
holding elements based on positive feedback loops. This issue has been most pronounced
in SRAM where high yield requirements and the use of aggressively sized devices result
in prohibitive sensitivity to local variation.
5.2 ADDRESSING NTC BARRIERS:
a. Addressing performance loss
1. Cluster based architecture: To regain the performance lost in NTC without increasing
supply voltage, the use of NTC-based parallelism has been proposed. In applications where
there is an abundance of thread-level parallelism the intention is to use 10 s to 100 s of NTC
processor cores that will regain 1050X of the performance, while remaining energy efficient
Figure 10: Cluster
based architecture [8]
2. Device optimization: At the lowest level of abstraction, the performance of NTC systems
can be greatly improved through modifications and optimizations of the transistor structure and
its fabrication process.
b. Addressing performance variation
1. Soft edge clocking
The device variation inherent to semiconductor manufacturing continues to increase from such
causes as dopant fluctuation and other random sources, limiting the performance and yield of
ASIC designs. One potential solution to address timing variation while minimizing overhead is a
type of soft-edge flip-flop (SFF) that maintains synchronization at a clock edge, but has a small
transparency window, or softness. In one particular approach to soft-edge clocking, tunable
inverters are used in a masterslave flip-flop to delay the incoming master clock edge with
respect to the slave edge. As a result of this delay, a small window of transparency is generated in
the edge-triggered register that accommodates paths in the preceding logic that were too slow for
the nominal cycle time in essence allowing time borrowing within an edge-triggered register
environment. Hence, soft edge clocking results in a trade-off between short and long paths and is
effective at mitigating random, uncorrelated variations in delay, which are significant in NTC
2. Body Biasing
While effective in the superthreshold domain, the influence of body-biasing becomes particularly
effectual in the NTC domain where device sensitivity to the threshold voltage increases
exponentially. Body-biasing is therefore a strong lever for modulating the frequency and
performance in NTC, and is ideally suited as a technique for addressing the increased detriments
of process variation in NTC.
d. Addressing Functional Failures

1. ALTERNATIVE SRAM CELLS: SRAM cells require special attention when considering
cache optimization for the NTC design space. Even though it is clear that SRAM will generally
exhibit a higher Vmin than logic, it will still operate at supply level significant lower than the
nominal case. This in turn reduces cell stability and heightens sensitivity to Vth variation, which
is generally high in SRAM devices to begin with due to the particularly aggressive device sizing
necessary for high density.
The read and write operations can be decoupled by relying on extra complexity in the
periphery of the core array. This results in a cell that is functional below 200 mV and that
achieves relatively high energy efficiency.
2.
RECONFIGURABLE CACHE DESIGN: To maintain the excellent energy efficiency
of the NTC SRAM, but with minimal impact on die space a cache where only a subset of the
cache ways are implemented in larger NTC tolerant cells is proposed. This modified cache
structure, can dynamically reconfigure access semantics to act like a traditional cache if needed
for performance and act like a filter cache to balance energy in low power mode. When
performance is not critical, power can be reduced by accessing the low-voltage cache way first,
with the other ways of the cache only accessed on a miss. This technique is similar to that of
filter caches, and while providing power savings it does increase access time for hits in the
higher-NTC cache way voltages. When performance is critical, the access methodology is
changed to access all ways of the cache in parallel to provide a fast single cycle access to all
data.
6. IMPLEMENTATION OF CLASSIFICATION ALGORITHMS ON MULTICORE

PLATFORM (RESEARCH PROJECT)
6.1 OBJECTIVE:
The goal was to parallelize three very important classification algorithms, namely SVM, K
Means clustering and Reduced Boltzmann Machine and implement the same on a multi core
platform. The platform used here was Odroid-XU+E, which is based on the big little architecture
developed by arm. This part is organized as follows: first is an introduction to multi core,
followed by a discussion on the Big LITTLE architecture and the ODROID, then the different
algorithms are discussed, followed by how they were parallelized; finally results are presented.
6.2 MULTI CORE SYSTEMS
Figure 11: A basic block diagram of a generic multi core processor [17]
A multi-core processor is generally defined as an integrated circuit to which two or more
independent processors (called cores) are attached. This term is distinct from but related to the
term multi-CPU, which refers to having multiple CPUs which are not attached to the same
integrated circuit. The term uniprocessor generally refers to having one processor per system,
and that the processor has one core; it is used to contrast with multiprocessing architectures, i.e.
either multi-core, multi-CPU, or both. Figure 11 above, illustrates the basic components of a
generic multi-core processor.
There are many ways to operate the multicore so as to minimize the power consumed. One of the
many ways is power gating, i.e, the non-active cores can be turned off. When the core is active, it
is operated at its optimal voltage and optimal frequency. This scheme allows for fine grain power
management with two levels of voltage or two levels of frequency. This helps easier load balance
and achieves better control of temperatures.
NEED FOR MULTI CORE:
As different tasks demand higher processing speeds, single core processors need to be clocked at
higher frequencies. But they cannot be clocked beyond a certain clock frequency. Also, another
technique to improve the processing speed is pipelining. As the number of pipelines increase,
they face problems of heat dissipation, difficulty in complex design and verification and
demands large design teams accordingly. Multi core has emerged as the solution to all these
problems. There is a general trend in computer architecture wherein more and more applications
are getting parallelized and multi-threaded. Multicore processors seem to answer the deficiencies
of single core processors, by increasing bandwidth while decreasing power consumption
LEVELS OF PARALLELISM
1. TASK LEVEL PARALLELISM
Task level parallelism focuses on distributing execution processes (threads) across different
parallel computing nodes. In a multiprocessor system, this type of parallelism is achieved when
each processor executes a different thread (or process) on the same or different data. The threads
may execute the same or different code. In the general case, different execution threads
communicate with one another as they work. Communication usually takes place by passing data
from one thread to the next as part of a workflow. This parallelism emphasizes the distributed
(parallelized) nature of the processing (i.e. threads)
2. DATA LEVEL PARALLELISM

Data parallelism focuses on distributing the data across different parallel computing nodes. Each
node performs the same set of instructions but on different data streams.
Computer hardware can exploit these two application parallelism in four major ways:
1.
Instruction level parallelism: Exploits data level parallelism at modest levels with the help of
compiler and using ideas like pipelining and at medium levels using ideas like speculative
execution
2.
Vector architectures and Graphics processor units: Exploit data level parallelism by
applying single instruction to a collection of threads in parallel.

3.
Thread level parallelism: Exploits either data level parallelism or task level parallelism in a
tightly coupled hardware model that allows for interaction among parallel threads.
4.
Request level parallelism: Exploits parallelism among largely decoupled tasks specified by
the programmer or the Operating system.

6.3 Big LITTLE Architecture[11]
This is a heterogeneous computing architecture developed by ARM Holdings coupling
(relatively) slower, low-power processor cores with (relatively) more powerful and powerhungry ones. The intention is to create a multi-core processor that can adjust better to dynamic
computing needs and use less power than clock scaling alone
6.4 ODROID-XU+E[16]
The Odroid-XU+E comes in a package having four current/voltage sensors to measure the power
consumption of the Big A15 cores, Little A7 cores, GPUs and DRAMs individually. The
professional developers can monitor CPU, GPU and DRAM power consumption via included
on-board power measurement circuit. By using the integrated power analysis tool, developers
will reduce the need for repeated trials when debugging for power consumption and get the
opportunity to enhance and optimize the performance of their CPU/GPU compute applications,
and therefore keeping power consumption as low as possible.
Figure 12: Block diagram of ODROID[16]
6.5 K-MEANS CLUSTERING ALGORITHM

Clustering is an unsupervised learning problem that is the process of partitioning a group of data
points into a small number of clusters. We have n data points xi,i=1...n that have to be partitioned
into k clusters. The goal is to assign a cluster to each data point. K-means is a clustering method
that aims to find the positions i, i=1...k of the clusters that minimize the distance from the data
points to the cluster mean.
K-means clustering solves
k
k
2
arg min d(x,i)=arg min xi

c
i=1 xci
(7)
c i=1 xci
The algorithm:
1. Initialize the center of the clusters
i= some value ,i=1,...,k
2. Attribute the closest cluster to each data point
ci={j:d(xj,i)d(xj,l),li,j=1,...,n}
3. Set the position of each cluster to the mean of all

i=1/|ci|jcixj,i
data points belonging to that cluster
4. Repeat steps 2-3 until convergence
Notation
|c|= number of elements in c
PARALLEL K-MEANS
A single program multiple data (SPMD) approach is employed based on a message passing
model. Sending and receiving messages between a master and the concurrently created process
are done in an asynchronous manner.
1. Set initial global centroid C = <C1, C2, , CK>
2. Partition data to P subgroups, each subgroup has equal size
3. for each P,
4.
Create a new process
5.
Send C to the created process for calculating distances and assigning cluster members
6. Receive cluster members of K clusters from P processes

7. Recalculate new centroid C
8. If difference (C, C)
9. Then set C to be C and go back to step 2
10. Else stop and return C as well as cluster members
6.6 SUPPORT VECTOR MACHINES
Support vector machines rely on preprocessing the data to represent patterns in a higher
dimension-typically much higher than the original feature space. The goal in training an SVM is
to find the separating hyperplane with the largest margin. Support vectors are the training
samples that define the optimal separating hyperplane and are the most difficult patterns to
classify. They are the patterns most informative for the classification task.
Any hyperplane can be written as the set of points x satisfying
w.x-b=0
(8)
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes.
Samples on the margin are called the support vectors.
If the training data are linearly separable, we can select two hyperplanes in a way that they
separate the data and there are no points between them, and then try to maximize their distance.
The region bounded by them is called "the margin". These hyperplanes can be described by the
equations
w.x-b=1
(9), and
w.x-b=-1
(10)
2
|| w ||
By using geometry, we find the distance between these two hyperplanes is
So we want to minimize ||w|| to maximize the margin. As we also have to prevent data points
from falling into the margin, we add the following constraint
for each either
w.xi-b1 for xi of the first class
or
w.xi-b1 for xi of the second.
This can be rewritten as:
yi(w.xi-b)1, for all 1 i n
We can put this together to get the optimization problem:
Minimize (in w,b) ||w|| subject to (for any i=1,..n) yi(w.xi-b)1
MULTICLASS SVM
Multiclass SVM aims to assign labels to instances by using support vector machines, where the
labels are drawn from a finite set of several elements. The dominant approach for doing so is to
reduce the single multiclass problem into multiple binary classification problems. Define a multi
category classifier which classifies a pattern by computing c linear discriminant functions
defined as
gi(x)=wtx+wi0 i=1,..,c,
And assigning x to the category corresponding to the largest discriminant
PARALLELIZATION OF MULTICLASS SVM
Since there is no data dependency between matrix multiplications, computation of each of linear
discriminant function ,u.e multiplication with one weight is done in one core. The results are
compared globally and appropriate label is assigned depending on the largest value.
6.6 RESTRICTED BOLTZMANN MACHINE
A Boltzmann machine is a bidirectionally connected network of stochastic processing units,
which can be interpreted as neural network model. Artificial Neural Networks are
computational models that are capable of machine learning and pattern recognition. It comprises
of neuron-like units whose binary activations depend on the neighbors theyre connected to.
Stochastic means that a probabilistic element is present in the activation. Stochastic neural
networks are a type of artificial neural networks that are built by introducing random variations
into the network, either by giving the network's neurons stochastic transfer functions, or by
giving them stochastic weights. Stochastic neural networks that are built by using stochastic
transfer functions are often called Boltzmann machines.
A Restricted Boltzmann Machine is a stochastic neural network consisting of one layer of visible
units and one layer of hidden units. It is used to learn the probability distribution over a set of
inputs
Figure 13: Diagram of an RBM with three visible units and four hidden units[11]
RBM FOR CLASSIFICATION:
When unit i is given the opportunity to update its binary state, it first computes its total
input, zi , which is the sum of its own bias, bi ,and the weights on connections coming from other
active units:
zi = bi + j sjwij
where wij is
the
weight
on
the
connection
(11)
between i and j , and sj is 1 if
unit j is
and 0 otherwise. Unit i then turns on with a probability given by the logistic function:
prob(si = 1) = 1/(1+e- zi)
(12)
In effect, since it is a two stage RBM, the computations to be done are:

data_l1 = 1/(1+(exp(-(data.w_l1) - b_l1T)))
(13)
on
data is the input data which is an 1xd matrix. w_l1 is the weight matrix of dimension
dxm. , data_l1 is an 1xm matrix that is fed to the second stage.
data_l2 = 1/(1+(exp(-(data_l1.w_l2) - b_l2T)))
(14)
w_l2 is an mxp matrix and data_l2 is an 1xp matrix that is fed into the third stage.
data_l3 = 1/(1+(exp(- (data_l2.w_l3) - b_l3T)))
(15)
w_l3 is a px1 matrix. Hence, data_l3 becomes a single number. A data point is classified
as belonging to class 1 if data_l3>0.5 else 0
PARALLELIZATION OF RBM: There is an inter-stage dependency for the data as one stage
depends on the previous stage for inputs. The input data is divided into as many groups as cores
Each of the above series of multiplications is carried out in each one of the processor
6.7 RESULT
The non-threaded versions of the three algorithms were coded in C++ and the threaded versions
were coded using a combination of C and POSIX thread programming[18]. Both the threaded
and non-threaded codes were run on the Odroid and Core i7 processors and the execution time
for the codes was compiled into the following table:
ALGORITHM
CORE I7
ODROID-XU+E
THREADED
NONTHREADED
THREADED
NONTHREADED
K-MEANS
13020
13217
Ongoing
ongoing
SVM
86018
23602969
277122704
Really large
value
RBM
Ongoing
18801
Yet to be run on Yet to be run on

odroid
odroid
TABLE 1: Number of clock cycles taken to execute threaded and non-threaded versions on
ODROID and CORE-i7
6.8 CONCLUSION:
From the table, it is seen that the execution time taken by the Odroid for the parallelized versions
of the algorithms is less than the time taken by it to execute the non-parallelized versions. The
threads were allocated to the cores by the compiler based on the existing process load in each of
the cores. The heavy and complex multiplications were carried out by the Big core and small
multiplications were carried out by the LITTLE core. When the number of threads is small, the
non-threaded code takes much lesser time than threaded. This is because of the overhead
involved in the creation of threads. The same code was run for a large number of iterations to
amortize the overhead caused due to the thread creation. The advantage of threading was
observed only when the execution time of the code was large.
7.REFERENCES:
[1] Anantha.P.Chandrakasan, Robert.W. Broderson, Minimizing power consumption in Digital

CMOS circuits, Proceedings of IEEE,April 95, pp 498-523
[2] Krisztin Flautner, Nam Sung Kim, Steve Martin, David Blaauw, Trevor Mudge, Drowsy
Caches: Simple Techniques for Reducing Leakage Power, ISCA 2002
[3] M. Srivastava et. al.Predictive System Shut Down and other Architectural Techniques for
Energy Efficient Programmable Computation, IEEE TVLSI, 1996.
[4] T. Simunic et. al Dynamic Voltage Scaling and Power Management for Portable Systems,.
DAC 2001.
[5] P. Pillai and K. G. Shin, Real-Time Dynamic Voltage Scaling for Low Power Embedded
Operating Systems, SOSP 2001.
[6] F. Gruian, Hard Real-Time Scheduling for Low Energy using Stochastic Data and DVS
Processors, ISLPED 2001.
[7] C,-H. Hwang and A. Wu, A Predictive System Shutdown Method for Energy Saving of
Event-Driven Computation, ICCAD 2007
[8] Ronald G. Dreslinski, Michael Wieckowski, David Blaauw,Dennis Sylvester, and Trevor
Mudge, Near-Threshold Computing:Reclaiming Moores Law Through Energy Efficient
Integrated Circuits, Proceedings of the IEEE, Vol. 98, No. 2, February 2010
[9] Ronald G. Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew
Fojtik,Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski,
Gregory Chen, Dennis Sylvester, David Blaauw, and Trevor Mudge, Centip3De: A Many-Core
Prototype Exploring 3D Integration and Near-Threshold Computing communications of the

acm, november 2013, vol. 56 ,no. 11
[10] Sangwon Seo, Ronald G. Dreslinski, Mark Woh, Yongjun Park, Chaitali Chakrabarti, Scott
Mahlke, David Blaauw, Trevor Mudge, Process Variation in Near-Threshold Wide SIMD
Architectures , ACM 978-1-4503-1199-1/12/06
[11] Wikipedia
[12] Computer Architecture A quantitative approach by Hennesey Patterson
[13] Pattern classification-Duda, Hart and Stork
[14] IEEE Signal Processing Magazine Volumes 27 and 26
[15]Dan Ernst et al, Razor: Circuit-level Correction of Timing Errors for Low Power
Operation, IEEE Micro Top Picks Nov/Dec 2004
[16] http://www.hardkernel.com/main/products/prdt_info.php?g_code=G137463363079 Website
of the odroid
[17] Garrison Prinslow, Overview of Performance Measurement and Analytical Modeling
Techniques
for
Multi-core
Processors,
http://www.cse.wustl.edu/~jain/cse567-
11/ftp/multcore/#sec1.1
[18] https://computing.llnl.gov/tutorials/pthreads/
[19] Tutorial by Dr. Chaitali on Architecture and Algorithm-level Techniques to Reduce
Power/Energy

Low Power Design of Digital Systems

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low Power Design of Digital Systems

Uploaded by

Copyright:

Available Formats

ARIZONA STATE UNIVERSITY

LOW POWER DESIGN

Pavg = Pswitching + Pshort-circuit + Pleakage

CLVdd2.fclk : This is the switching component of power.

01 is the probability of transition,

function, logic style and signal statistics.

01 than dynamic circuits. Also, if the inputs

If worst case delay

&comparator is 30ns at 5V, then T=30ns. Therefore,

Figure 2: Parallel implementation of the simple datapath[1]

Combination of pipelining and parallel processing:

3.2.3 MEMORY OPTIMIZATION

Figure 4: Implementation of the drowsy cache line [2]

multiplication using a linear systolic

computation. In that case, we need to

systolic array cannot be used for

Figure 6: Perfect shuffle structure for FFT[19]

3.4 SYSTEM LEVEL OPTIMIZATION

Figure 7: Razor flip flop implementation [15]

Figure 8: Technology scaling trends of supply voltage and energy [8]

The increased sensitivity of NTC circuits to variations in process, temperature and

Figure 10: Cluster

based architecture [8]

d. Addressing Functional Failures

RECONFIGURABLE CACHE DESIGN: To maintain the excellent energy efficiency

6. IMPLEMENTATION OF CLASSIFICATION ALGORITHMS ON MULTICORE

2. DATA LEVEL PARALLELISM

applying single instruction to a collection of threads in parallel.

the programmer or the Operating system.

Figure 12: Block diagram of ODROID[16]

6.5 K-MEANS CLUSTERING ALGORITHM

arg min d(x,i)=arg min xi

i= some value ,i=1,...,k

2. Attribute the closest cluster to each data point

3. Set the position of each cluster to the mean of all

|c|= number of elements in c

Create a new process

6. Receive cluster members of K clusters from P processes

In effect, since it is a two stage RBM, the computations to be done are:

Yet to be run on Yet to be run on

[1] Anantha.P.Chandrakasan, Robert.W. Broderson, Minimizing power consumption in Digital

Prototype Exploring 3D Integration and Near-Threshold Computing communications of the

You might also like