You are on page 1of 13

Published in IET Computers & Digital Techniques

Received on 25th May 2011


Revised on 15th January 2012
doi: 10.1049/iet-cdt.2011.0089
Special Issue: High-Performance Computing System
Architectures: Design and Performance
ISSN 1751-8601
Decimal oating-point antilogarithmic converter based
on selection by rounding: algorithm and architecture
D. Chen L. Han S.-B. Ko
Department of Electrical and Computer Engineering, University of Saskatchewan, Campus Drive 57, Saskatoon,
SK S7N 5A9, Canada
E-mail: seokbum.ko@usask.ca
Abstract: This study presents the algorithm and architecture of the decimal oating-point (DFP) antilogarithmic converter, based
on the digit-recurrence algorithm with selection by rounding. The proposed approach can compute faithful DFP antilogarithmic
results for any one of the three DFP formats specied in the IEEE 754-2008 standard. The proposed architecture is synthesised
with an STM 90-nm standard cell library and the results show that the critical path delay and the number of clock cycles of the
proposed Decimal64 antilogarithmic converter are 1.26 ns (28.0 FO4) and 19, respectively, and the total hardware complexity is
29325 NAND2 gates. The delay estimation results of the proposed architecture show that it has a signicant decrease in terms of
latency in contrast with recently published high performance decimal CORDIC implementations.
1 Introduction
Nowadays, there are many commercial demands for decimal
oating-point (DFP) arithmetic operations such as nancial
analysis, tax calculation, currency conversion, Internet-
based applications and e-commerce [1]. This trend gives
rise to further development on DFP arithmetic units that
perform accurate computations with exact decimal
operands. Owing to its signicance, DFP arithmetic has
been included in specications of the IEEE 754-2008
standard [2]. As the main part of a decimal microprocessor,
the basic decimal arithmetic units, such as decimal adder/
subtracter, multiplier and divider, are attracting more and
more researchers attention. A complete survey of hardware
designs for the basic decimal arithmetic units is summarised
in [3]. Recently, the hardware components of the basic DFP
arithmetic units have been implemented in IBMs system z9
[4], POWER6 [5] and z10 [6] microprocessors.
Transcendental functions including logarithm, antilogarithm,
exponential, reciprocal, sine, cosine, tangent, arctangent and so
on are useful arithmetic concepts in many areas of science and
engineering, such as computer 3D graphics, scientic
computations, articial neural networks, digital signal
processing and logarithmic number system. The decimal
transcendental function computation is also very useful for
some specic applications, such as some computations used
in nancial applications in banks [7], the scientistic decimal
calculator [8] and some pocket computers [9]. The decimal
transcendental functions, as recommended decimal arithmetic
operations, have been specied in IEEE 754-2008. Recently,
Intel Corporation has provided the rst software solution
to compute the DFP transcendental functions using an
existing and well-established binary oating-point (BFP)
transcendental function mathematical library [10]. However,
with the strict requirement on computational speed and
accuracy in the future, the hardware components may be
included in the high-end microprocessor to support the
decimal transcendental computation.
Muller [11] presents both software and hardware-oriented
algorithms to compute transcendental functions, and
discusses issues related to accurate BFP implementations of
these functions. The hardware-oriented algorithms based on
digit-recurrence with selection by rounding are introduced
for high-radix binary division and square-root [1214],
CORDIC [15], logarithm [16] and exponential [17], [33]
operations, respectively. This method can efciently
decrease the cost of implementation, in particular, the
complexities of the selection function for redundant digits.
We have presented in [17] a DFP logarithmic converter
based on a radix-10 digit-recurrence algorithm by selection
by rounding. In this paper, the same approach is analysed
to implement the DFP antilogarithmic converter in order to
achieve faithful antilogarithmic results of DFP operands
specied in IEEE 754-2008. The design described in this
paper is an improved design based on our previous research
presented in [18], and includes the following novelties: (i)
using the redundant carry-save representation of the data-path;
(ii) selecting redundant digits by rounding estimated residuals;
(iii) retiming and balancing the delay of the proposed
architecture; (iv) implementing the novel subcomponents in
the carry-save data-path; and (v) processing the normalisation
and the parallel nal addition and the rounding operation to
display DFP antilogarithmic results.
This paper is organised as follows: Section 2 gives an
overview of the DFP antilogarithm operation. In Section 3,
the proposed algorithm and the error analysis for the DFP
antilogarithm computation are presented. Section 4 describes
the architecture of the proposed DFP antilogarithmic converter
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 277
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
with details of its hardware implementation. In Section 5, we
analyse the area-delay evaluation results of the proposed
architecture, and then compare the performance of the
proposed design with our previous design [18], the recent
decimal CORDIC designs [19, 20] and the software
implementation [10]. Section 6 gives conclusions.
2 DFP antilogarithm operation
IEEE 754-2008 specied three interchange DFP formats, a
32-bit storage format called Decimal32; 64-bit and 128-bit
computational formats called Decimal64 and Decimal128,
respectively. The value of a DFP operand v compliant with
the DFP format, is represented as follows
v = (1)
s
10
e
significand (1)
In (1), s is the 1-bit sign of the DFP operand. The real
exponent, (w +2)-bit e, is calculated by subtracting an
exponent bias from the value of the encoded exponent,
where the values of w are equal to 6, 8 and 12,
respectively, in the three DFP formats. In IEEE 754-2008,
emax and emin represent the maximum and minimum
values of the real exponent, where emax is equal to +96,
+384 and +6144, respectively, and emin is equal to
2(emax 21) for the three DFP formats. The signicand
is a q-digit non-normalised unsigned DXP number in the
form of d
0
, d
1
, d
2
, . . . , d
q21
, 0 d
i
, 10, where q is equal
to 7, 16 and 34, respectively, for the three DFP formats. In
IEEE 754-2008, the decimal signicand can be encoded by
densely packed decimal (DPD) encoding [21] or binary
integer decimal encoding [22]. In this paper, we choose the
DFP format in DPD encoding so that the decimal
signicand of a DFP operand can be decoded to binary-
coded decimal (BCD) representation in hardware.
2.1 Exception handling
A valid DFP antilogarithm operation is dened as
R = Anti log
10
(v) = 10
v
(2)
There are some exceptional cases that need to be dealt with
during a DFP antilogarithm operation:
If v is a NaN, the DFP antilogarithm operation returns NaN
and signals the invalid operation exception.
If v is a positive innite operand, the antilogarithm
operation simply returns +1, and if v is a negative innite
operand, the antilogarithm operation simply returns +0
with no exception.
If v is in the range of [(log
10
(|v
max
|), +1], the
antilogarithm operation satises the condition of overow
and returns the maximum representable DFP operand or
+1 based on the different rounding modes.
If v is in the range [21, log
10
(|v
min
|)], the antilogarithm
operation satises the condition of underow that rounds
the intermediate result down to zero or to the minimum
representable DFP number based on the different rounding
modes.
If v is in the range of [log
10
(|v
min
|), log
10
(|v
max
|)], a normal
DFP antilogarithm operation takes place. The rest of this
paper details the computation on this interval in particular.
The DFP antilogarithm operation can be transformed to a
decimal xed-point (DXP) antilogarithm computation
R = 10
(1)
s
10
e
d
0
.d
1
d
2
...d
q1
= 10
+v
int
10
+v
frac
(3)
In (3), v
int
is an n-digit decimal integer number which is in the
range of [emin 2q +1, emax], where n is equal to 3, 3 and 4
for Decimal32, Decimal64 and Decimal128 formats,
respectively. The v
int
plus the value of k represents the real
exponent of the DFP antilogarithmic result, where k is
achieved by normalising the DXP result of 10
v
frac
to the
decimal signicand of the DFP antilogarithmic result. Since
the valid value of v could be very close to zero, the fraction
number v
frac
can be represented by several leading zeros
plus the q-digit decimal signicand, v
frac
+0.00. . .00d
0
,
d
1
, . . . , d
q21
. Therefore v
frac
is a decimal fraction number
in the range of (21, 1), which can be completely
represented by at most (emin 2q +1)-digit or at least
(q 2n)-digit. Since the results of 10
v
frac
are in the range of
(0.1, 10), the q-digit faithful DXP antilogarithmic result is
enough to represent the q-digit decimal signicand of the
DFP antilogarithmic result. When v is very close to zero,
v
frac
cannot be processed in hardware implementation with
the limit width of the data-path. Therefore v
frac
is truncated
to at least (q +1)-digit v

frac
so that the approach can still
guarantee a faithful rounded DFP antilogarithmic result. In the
following, we focus on the algorithm and the architecture
of the (q +1)-digit DXP decimal antilogarithmic converter
which can produce the q-digit faithful decimal signicand of
the DFP antilogarithmic result.
2.2 Rounding
IEEE 754-2008 species ve types of rounding modes [2]. A
common requirement for the DFP antilogarithmic operation in
IEEE 754-2008 is capable of computing exactly rounded
results (within 0.5 ulp of precision). In order to achieve exactly
rounded results by any one of the rounding modes, it is needed
to determine whether the value of the exact result (innite
precision) is less or higher than the midpoint between the two
nearest DFP numbers. However, if the exact result is so close
to the midpoint that the exact rounding is difcult to perform,
unless we can determine the maximum length of chain of
nines or zeros after the rounding digit for every possible DFP
results (Table Makers Dilemma [23]). On the other hand,
providing additional guard digits before rounding cannot only
guarantee results much closer to a half-ulp, but also greatly
reduce the probability of incorrect rounding to near zero. In
this paper, we mainly focus on delay optimisation of the
proposed DFP antilogarithmic converter, so we design a
digit-recurrence algorithm to achieve faithfully rounded
results (within 1 ulp of precision) for the DFP antilogarithmic
operation by using the roundTiesToEven mode.
3 Algorithm
A digit-recurrence algorithm to compute 10
v

frac
is summarised
as follows
lim
j1
v

frac

log
10
(f
j
)
_ _
0 (4)
If (4) is satised
lim
j1

log
10
(f
j
)
_ _
v

frac
(5)
278 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
Thus
10
v

frac
=

1
j=1
f
j
(6)
f
j
is dened as f
j
1 +e
j
10
2j
by which v
frac
is transformed
to 0 through a successive subtraction of log
10
( f
j
). This form
of f
j
allows the use of a decimal shift-and-add
implementation.
According to (5) and (6), the corresponding recurrences for
transforming v

frac
and computing the antilogarithm are
presented in (7) and (8), where
j 1, L[1] = v

frac
and E[1] = 1
L[j +1] = L[j] log
10
(1 +e
j
10
j
) (7)
E[j +1] = E[j] (1 +e
j
10
j
) (8)
The digits e
j
are selected so that L( j +1) converges to 0. A 1-
digit accuracy is, therefore obtained in each iteration. After
performing the last iteration of recurrence, the results are
L[j +1] 0 (9)
E[j +1] 10
v

frac
(10)
To have a selection function for e
j
, a scaled remainder is
dened in (11), where g is dened as a scaled constant.
W[j] = 10
j
L[j] g (11)
Thus
L[j] = W[j] 10
j
g
1
(12)
To substitute (12) into (7)
W[j +1] = 10W[j] 10
j+1
g log
10
(1 +e
j
10
j
) (13)
3.1 Selection by rounding
The selection of the digit e
j
is achieved by rounding the scaled
residuals to its integer part. In order to reduce the delay of
selection function, the rounding is performed on an estimate

W[j], which is obtained by truncating W[ j ] to t fractional


digits (truncating W[ j ] at the position 10
2t
). The selection
function is indicated as
e
j
= round(

W[j]) (14)
In (14), round indicates that if the digit of

W[j] at the
position 10
21
is larger than or equal to 5, the digit e
j
is
obtained by adding the integer part of

W[j] and 1; otherwise
it is directly obtained by the integer part of

W[j]. In this
work, the selection by rounding is performed with the
maximum redundant set e
j
[ {29, 28, . . . , 0, . . . , 8, 9}.
Since |e
j
| 9
9.5 ,

W[ j] , 9.5 (15)
Since we must have (15) satised, the range of W[ j ] is
9.5 +d
t
, W[j] , 9.5 +d
t
(16)
In (16), d is the truncation error. It should be noted that
0 d
t
, (10/9)10
2t
, regarding the sign-magnitude carry-
save representation of W[ j ]. Therefore the bounds of
W[ j ] 2e
j
are
0.5 , W[j] e
j
, 0.5 +
10
9
10
t
(17)
Since (13) can be represented as
W[j +1] = 10(W[j] e
j
) 10
j+1
g
log
10
(1 +e
j
10
j
) +10e
j
(18)
If we want to keep 9.5 ,

W[j +1] , 9.5, we must keep
9.5 +
10
9
10
t
, W[j +1] , 9.5 (19)
According to (17), (18) and (19), the numerical analysis is
processed as follows
10
j+1
g log
10
(1 +e
j
10
j
) 10e
j
. 4.5 +
10
9
10
t+1
(20)
10
j+1
g log
10
(1 +e
j
10
j
) 10e
j
, 4.5
10
9
10
t
(21)
The results in the numerical analysis show that when g 2.3,
and only if j 3, t 1 the conditions (20) and (21) are
satised. In doing so, the selection by rounding is only
valid for iterations j 3, and e
1
and e
2
can be only
achieved by look-up tables. However, using two look-up
tables for j 1, 2 signicantly increase the overall
hardware implementations. Therefore the restriction for e
1
is dened so that e
2
can be achieved by selection by
rounding and one look-up table is saved. Since
W[1] 10 2.3 v

frac
, W[2] can be achieved as
W[2] = 230 v

frac
10
2
2.3 log
10
(1 +e
1
10
1
) (22)
When the value of j equals to 2 and t equals to 1, the value of
e
2
is in the range of 28 e
2
8 so that (20) and (21) are
satised. Substituting 28 e
2
8 and t 1 in (17) yields
8.5 , W[2] , 8.5 +
1
9
(23)
According to (22) and (23), we obtain
230 v

frac
10
2
2.3 log
10
(1 +e
1
10
1
) , 8.5 +
1
9
(24)
230 v

frac
10
2
2.3 log
10
(1 +e
1
10
1
) . 8.5 (25)
The results in the numerical analysis of (24) and (25) show
that the decimal input operand v

frac
is restricted in the range
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 279
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
of 21.03 v

frac
0.31 so that e
2
can be achieved with
selection by rounding. Since the value of v

frac
is in the
range of 21 , v

frac
, 1, in order to tune the positive v

frac
to negative, the fraction part of the positive v

frac
should be
rstly adjusted to negative by v

frac
21 and then its
corresponding integer part v
int
is adjusted by v
int
+1.
Table 1 shows the selection of e
1
. Since 1-digit e
1
fails to
create Table 1 for achieving continuous ranges to cover all
negative v

frac
, e
1
is extended to a 2-digit so that all negative
v

frac
can be achieved.
3.2 Error analysis and evaluation
The errors in the proposed antilogarithmic digit-recurrence
algorithm can be produced in four ways. The rst type of
error is the inherent error of algorithm, 1
i
, resulting from
the difference between the antilogarithm results obtained
from nite iterations and the exact results obtained from
innite iterations. The second one is the inexact input error,
1
v
, produced by the difference between antilogarithmic
results of the inexact input v

frac
and the real input v
frac
. The
third one is the quantisation error, 1
q
, resulting from
the nite precision of the intermediate values in hardware
implementation. The fourth one is the nal rounding error
1
r
, whose maximum value is 0.5 ulp (|1
r
| 0.5 10
2q
).
In order to achieve a q-digit decimal signicand of the
faithful DFP antilogarithmic result, the following condition
must be satised
1
t
= 1
i
+1
v
+1
q
10
q
(26)
3.2.1 Inherent error of algorithm: Since each DXP
antilogarithmic result is achieved after (q +1)th iterations,
1
i
can be dened as
1
i
=

1
j=1
(1 +e
j
10
j
)

q+1
j=1
(1 +e
j
10
j
) (27)
Thus, (27) can be written as
1
i
=

1
j=1
(1 +e
j
10
j
) 1
1

1
j=q+2
(1 +e
j
10
j
)
_ _
(28)
In (28), since the proposed DXP antilogarithmic algorithm can
compute the input values, which fall in the range of (21, 0), the
exact antilogarithmic results, obtained after the innite
iterations, are in the range of (0.1, 1). In order to use the static
error analysis method, we substitute the case e
j
9 or 29 and
the maximum value of the exact antilogarithmic results to
(28), then the maximum 1
i
is obtained
1
i
1
1

1
j=q+2
(1 +9 10
j
)
(29)
In (29), it is obvious that

1
j=q+2
(1 +9 10
j
) = e
S
1
j=q+2
ln(1+910
j
)
(30)
Since (30) is satised

1
j=q+2
ln(1 +9 10
j
) , 9 (10
q2
+10
q3
+ )
(31)
We obtain

1
j=q+2
(1 +9 10
j
) , e
9(10
q2
+10
q3
+)
(32)
Thus, the maximum absolute 1
i
is
|1
i
| , 1
1
e
9(10
q2
+10
q3
+)
1 10
q1
(33)
3.2.2 Inexact input error: If a DFP operand, v, is very
close to zero, the whole digit-width of v
frac
+0.00. . .00
d
0
, d
1
, . . . , d
q21
can be too long to be implemented. v
frac
has to be truncated to at least (q +1)-digit v

frac
in the DXP
antilogarithmic operation. Therefore the inexact input error
can be dened as
1
v
= 10
v
frac
10
v

frac
(34)
It is evident that the maximum 1
v
is obtained when (i) the v
frac
consists of (q +1)-digit leading zeros and q-digit decimal
signicand; (ii) each of decimal signicand digit, d
0
, d
1
,
. . . , d
q21
9
1
v
10
+0.00...00
....,,....
q+1
99...99
..,,..
q
10
+0.00...00
....,,....
q+1
(35)
Table 1 Selection of e
1
Range of v

frac
e
1
(BCD) Range of v

frac
e
1
(BCD)
[20.00, 20.02] 20.0(00000000) (20.49, 20.55] 27.0(00110000)
(20.02, 20.07] 21.0(10010000) (20.55, 20.61] 27.4(00100110)
(20.07, 20.12] 22.0(10000000) (20.61, 20.67] 27.7(00100011)
(20.12, 20.19] 23.0(01110000) (20.67, 20.72] 28.0(00100000)
(20.19, 20.24] 24.0(01100000) (20.72, 20.77] 28.2(00011000)
(20.24, 20.28] 24.5(01010101) (20.77, 20.82] 28.4(00010110)
(20.28, 20.32] 25.0(01010000) (20.82, 20.89] 28.6(00010100)
(20.32, 20.37] 25.5(01000101) (20.89, 20.94] 28.8(00010010)
(20.37, 20.42] 26.0(01000000) (20.94, 20.98] 28.9(00010001)
(20.42, 20.49] 26.5(00110101) (20.98, 21.00) 29.0(00010000)
280 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
Equation (35) can be written as
1
v
(10
+0. 99...99
..,,..
q
)
10
q1
1 (36)
Thus
log
10
(1 +1
v
) +0. 99 . . . 99
....,,....
q
10
q1
(37)
According to Taylor series expansion of the logarithm
function log
10
(1 +x), we obtain
1
v

1
2
v
2
+
_ _
/ ln(10) , 1
v
/ ln(10)
+0. 99 . . . 99
....,,....
q
10
q1
(38)
Therefore the maximum absolute 1
v
is
|1
v
| 2.303 10
q1
(39)
3.2.3 Quantisation error: Since only the nite precision
of the intermediate values is processed in hardware
implementation, the quantisation error is produced. In this
paper, we dene FD-digit as the minimal data-width of
fractional digits for each intermediate value. The DXP
antilogarithmic results can be achieved by (q +1) times
successive multiplication
10
v

frac
=

q+1
j=1
(1 +e
j
10
j
) (40)
Since the fractional digit-width of the intermediate
multiplication results are represented in the carry-save
representation in which the carry may occur in the FD-digit
that is shifted out of the data-path in the rst interaction.
Therefore the truncated error 10
2FD
is produced from the
rst iteration. After (q +1) iterations, the maximum
quantisation error, 1
q
, can be represented as
1
q
= 10
FD

q+1
j=1
(1 +e
j
10
j
) + +

q+1
j=q+1
(1 +e
j
10
j
) +1
_ _
(41)
According to the same mathematical method as (30), (31) and
(32), each successive multiplication in (41) satises
10
FD

q+1
j=1
(1 +e
j
10
j
) , 10
FD
e
S
q+1
j=1
e
j
10
j
(42)
Thus, the maximal quantisation error, 1
q
, satises
1
q
, 10
FD
(e
S
q+1
j=1
e
j
10
j
+ +e
S
q+1
j=q+1
e
j
10
j
+1) (43)
Considering the case e
j
9 or 29 in (43), we obtain the
maximum absolute 1
q
|1
q
| , (q +2) 10
FD
(44)
3.2.4 Error evaluation: Having obtained 1
i
, 1
v
, 1
q
in (33),
(39) and (44), respectively, we achieve the maximum absolute
error 1
t
as
|1
t
| = |1
i
| +|1
v
||1
q
|
0.331 10
q
+(q +2) 10
FD
(45)
We substitute the digit-width of the decimal signicand of the
three DFP formats, q 7, 16 and 34, into (45), respectively.
The results indicate that the maximum absolute errors |1
t
|
obtained in the three DFP formats are smaller than 0.5 ulp,
which can satisfy the condition (26). Thus, the nal
rounded results are smaller than the accuracy requirement
within 1 ulp after considering the nal rounding error.
Table 2 shows the error analysis for three different DFP
interchange formats. The error analysis in Table 2 proves
that only when the minimal data-width of the fractional
digits for each intermediate value (FD-digit) is larger than
or equal to (q +2)-digit or (q +3)-digit, the proposed
algorithm can guarantee q-digit accuracy for the DXP
antilogarithm operation, and therefore a q-digit decimal
signicand of the faithful DFP antilogarithmic result can
be achieved.
3.3 Guard digit of scaled residual
Since the scaled residual W[ j ] with only nite precision is
operated in hardware implementation, we need to analyse
how many guard digits g are enough to prevent the
rounding error of W[ j ], 1
w
, from affecting the correct
selection of digits e
j
. Since W[ j ] is converged in the range
of (29.5, 9.5), we dene the digit-width of W[ j ] as
(q +g +3)-digit, consisting of three-digit integer part and
(q +g)-digit fraction part.
The values of logarithm 22.3 log
10
(1 +e
j
10
2j
) in (13)
can be achieved by storing these values in the look-up
table. With the increasing number of iterations, however,
the size of the table will become prohibitively large.
Therefore there is a need for a method that can reduce the
table size and achieve a signicant reduction in the overall
hardware requirement. A Taylor series expansion of the
logarithm function log
10
(1 +x) is demonstrated in (46)
log
10
(1 +x) = x
x
2
2
+
_ _
/ ln(10) (46)
After h iterations, the values of 22.3 log
10
(1 +e
j
10
2j
) do
Table 2 Error analysis of DFP antilogarithm for DFP interchange
formats
Format names Decimal32 Decimal64 Decimal128
signicand (q-digit) 7 16 34
no. of iteration (q +1) 8 17 35
accuracy (q-digit) 7 16 34
FD-digit 9
a
19
b
37
b
max. error (|1
t
| 10
2q
) 0.421 0.349 0.367
a
(q +2)-digit
b
(q +3)-digit
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 281
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
not need to be stored in the look-up table, whereas
22.3 e
j
10
2j
/ln(10), instead, are used for approximation.
In iterations ( j 1) to ( j q +1), because (q +g +3)-
digit rounded values of 22.3 log
10
(1 +e
j
10
2j
) and
22.3 e
j
10
2j
/ln(10) are obtained from the look-up tables,
the rounding error, +0.5 10
2q2g
, is produced in each
iteration. The maximum quantisation error, 1
1
wq
, is
|1
wq
|

q+1
j=1
0.5 10
qg
(47)
Since the value of 22.3 log
10
(1 +e
j
10
2j
) is approximated
by the value of 22.3 e
j
10
2j
/ln(10) in iterations ( j h +1)
to ( j q +1). However, according to the series expansion of
the logarithmic function in (46), an approximation error, 1
wa
,
is produced in each iteration
1
wa
= 2.3

q+1
j=h+1

(e
j
10
j
)
2
2
+
(e
j
10
j
)
3
3

_ _
/ ln(10)
(48)
we keep (e
j
10
2j
)
2
/2 ln(10) to analyse 1
wa
1
wa
2.3

q+1
j=h+1
(e
j
10
j
)
2
2
_ _
/ ln(10) (49)
Considering the worst case (e
j
9 or 29), we obtain the
maximum 1
wa
|1
wa
| 4.01 10
2h1
(50)
Therefore according to (13), after the (q +1)th iteration, the
truncation error of W[ j ], 1
w
, is obtained as
|1
w
| 10
q+1
(|1
wq
| +|1
wa
|)
= (0.5q +0.5) 10
1g
+4.01 10
q2h
(51)
Since the digit e
j
is selected by rounding the scaled residual

W[j] to its integer part in each iteration, 1


w
needs to satisfy
the conditions, 1
w
, 1 in order to guarantee the correct
selection of digits e
j
. To satisfy this condition for three
different DFP interchange formats, we obtain that when the
values of h are equal to 4, 9 and 18, and g are equal to 2, 2
and 3 for three formats, respectively.
4 Architecture
Fig. 1 shows the architecture of the proposed DFP
antilogarithmic converter in the top level. Since such issues
of the DFP antilogarithmic converter as the exception
handling, the packing and the unpacking from the IEEE
754-2008 DFP format are straightforward, we only detail
the architecture for the computation of the sign bit (R
sign
),
the real exponent (R
exp
) and the decimal signicand
(R
signicand
) of the DFP antilogarithmic results in this paper.
To represent the signed decimal intermediate value, all
variables in the architecture are represented with 10s
complement number system in BCD encoding. To speed up
the execution of recurrences, all intermediate values in the
data-path are represented using the redundant decimal carry-
save representation, where ssss and c represents a 1-digit
sum and a 1-bit carry, respectively. As a consequence of
this representation, the delay of the addition and the
multiple operation in the recurrence are independent of the
computational precision.
Fig. 1 Architecture of the proposed DFP antilogarithmic converter
282 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
4.1 Data-path
The data-path of the proposed architecture is pipelined and
retimed into four stages in order to minimise and balance
the critical path delay. The initial processing stage (stage
1) is to obtain the initial digit e
1
. The digit recurrence
stage (stage 2) is to achieve the remaining digits e
j
. The
antilogarithm computation stage (stage 3) is to compute
the (q +3)-digit intermediate decimal signicand of the
DFP antilogarithmic result. Finally, 1-bit R
sign
, (w +2)-bit
R
exp
and q-digit R
signicand
of the DFP antilogarithmic
results are achieved in the nal processing stage (stage 4).
The cycle-based sequence of operations is summarised
as follows:
Stage 1, in the rst clock cycle (in iteration ( j 1)): The 1-bit
sign, (w +2)-bit real exponent, and the q-digit non-normalised
decimal signicand, as input operands, are obtained from input
registers. The q-digit decimal signicand and (w +2)-bit real
exponent are processed in the range reduction logic to achieve
the (q +1)-digit DXP operand v

frac
. Meanwhile, the value of
v
int
is obtained in the v
int
generator and sent to stage 4 by a
register. If the DFP operand is positive, the fraction part of the
DFP operand v

frac
is adjusted to a negative fraction number
by v

frac
21 in a 10s complement converter, and then it is
the input of the DXP decimal antilogarithmic converter.
Meanwhile, its corresponding integer part, v
int
, is adjusted by
v
int
+1 and sent to stage 4. The digit e
1
is obtained from
look-up table I based on the value of 2-digit MSDs of
v

frac
. The v

frac
is multiplied by a 2-digit constant 2.3 in a
multiple logic (Mult1) to achieve the (q +3)-digit value of
m 2.3 v

frac
with the carry-save representation (m
s
, m
c
).
The m out from Mult1 is shifted 2-digit to the left to achieve
10W[1] (W[1] 10 2.3 v

frac
); and the corresponding
value of 2230 log
10
(1 +e
1
10
21
) is achieved from look-up
table II. Then, the values of 10W[1] and 2230 log
10
(1 +
e
1
10
21
) are sent to stage 2 by registers.
Stage 2, from the second to the (q +1)th clock cycles
(in iterations j 2 to j q +1): In the second clock
cycle, the residual W[ j ] is achieved by adding 10W[1]
and 2230 log
10
(1 +e
1
10
21
) together in a decimal 3:2
CSA compressor. Then, the digit e
j
can be obtained by
rounding 3-digit

W[j] in a rounding e
j
logic. This can be
expressed by
(W
s
[j], W
c
[j]) = 10W[1] 230 log
10
(1 +e
1
10
1
)
e
j
= round(

W[j])
The W[ j ] in the carry-save representation is shifted 1-digit
to the left to achieve 10 W[ j ] that is sent back to
Mux2 for the next iteration. From the number of j 2 to
j hth iteration, the value of 22.3 10
j+1
log
10
(1 +e
j
10
2j
)10
j+1
is obtained from look-up table II and sent
back to Mux1 for the next iteration. This can be expressed
by
(W
s
[j +1], W
c
[j +1]) = 10W[j] 2.3
10
j+1
log
10
(1 +e
j
10
j
)
e
j
= round(

W[j +1])
From the number of j (h +1)th to j (q +1)th iteration,
the value of 22.3 10 e
j
/ln(10) is obtained from
look-up table III and sent back to Mux1. This can be
expressed by
(W
s
[j +1], W
c
[j +1]) = 10W[j] 2.3 10 e
j
/ ln(10)
e
j+1
= round(

W[j +1])
After the (q +1)th clock cycle, all the digits e
j
are achieved
by the selection by rounding.
Stage 3, from the second to the (q +2)th clock cycles (in
iterations j 1 to j q +1): In the second clock cycle,
2-digit e
1
is concatenated with 9 and zeros, and it is shifted
1-digit to the right to achieve e
1
10
21
in a barrel shifter and
then selected by Mux4. Meanwhile, E[1] 1 is selected by
Mux5. The decimal signicand result E[2] of the rst
iteration is obtained in the (q +3)-digit decimal 4:2 CSA
compressor. This can be expressed by
(E
s
[2], E
c
[2]) = 1 +e
1
10
1
From the third to the (q +2)th clock cycles, the intermediate
value of e
j
E[ j ] out from a multiple logic (Mult2) is shifted
j-digit to the right to obtain e
j
E[ j ]10
2j
in a barrel shifter.
The value of E[ j ] is selected by Mux5 for the computation
of E[ j +1] in the next iteration. This can be expressed by
((e
j
E[j]10
j
)
s
, (e
j
E[j]10
j
)
c
) = e
j
E[j]10
j
(E
s
[j +1], E
c
[j +1]) = e
j
E[j]10
j
+E[j]
After the (q +2)th clock cycle, (q +3)-digit decimal
signicand of the DFP antilogarithm result is obtained.
Stage 4, in the (q +3)th clock cycle: The sum and carry of
(q +1)-digit MSDs of the fractional part of E
s
[ j ] and E
c
[ j ],
E
s

and E
c

, are added together to achieve



E in a q-digit
decimal compound adder. At the same time, the

E is
rounded to the faithful decimal signicand R
signicand
based
on the value inc of the rounding position in a rounding
logic. Since we consider the roundTiesToEven mode in
this design, the rounding logic generates an increment inc
based on
inc =
1 if (r
d
. 5 or (r
d
= 5 and LSB(L) = 1))
0 if (r
d
, 5 or (r
d
= 5 and LSB(L) = 0))
_
where r
d
represent the LSD of E[ j]

. The (w +2)-bit exponent


R
exp
and 1-bit sign R
sign
are obtained in a sign & exponent
generator.
Table 3 shows some iterations of a 64-bit DFP
antilogarithm operation executed in the proposed architecture.
4.2 Hardware implementation
The details of the hardware implementation for each stage of
the proposed DFP antilogarithmic converter are presented in
this section. In the rest of this paper, the symbols of , ^,
_ and & represent the logical-XOR, logical-AND, logical-
OR and logical-concatenation, respectively. The symbol of
(A)
y
x
refers to the yth bit in the xth digit position in a
decimal number, A, where the least signicant bit (LSB)
and the least signicant digit (LSD) have the index of
0. For example, (W[j])
3
2
is the third bit of the second digit
in W[ j ]. The symbol of A refers to the logic-NOT of a
number A.
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 283
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
4.2.1 Initial processing stage: Fig. 2 shows the details of
the hardware implementation of the initial processing stage
(stage 1).
In stage 1: The range reduction logic consists of a decimal
bi-directional barrel shifter, a 10s complement converter, a
9s complement converter, a trailing zeros detector and a
3-to-1 multiplexer. The v
int
generator consists of a leading-
zero-counter (LZC), a decimal barrel shifter, an encoder and
decimal add-one logic and 3-to-1 multiplexer. The LZC is
implemented based on [24], whereas the decimal barrel
shifter is implemented by a log
2
(q) levels of multiplexer. The
trailing zeros detector is implemented by a prex tree based
on [24] to generate control signals of sel for the 3-to-1
multiplexer. Note that the overow and underow of the DFP
antilogarithm operation can be detected by the v
int
generator,
and the implementation of their detection is straightforward.
Table 3 Example of a 64-bit DFP antilogarithm operation
v (21)
1
8576308882936892 10
216
, R = 10
v
int
v
farc
= 10
0
10
8576308882936892
, v
int
0, v

frac
-0.85763088829868920 e
1
28.6,
m 2.3 v
frac
(m
s
7926448956923414740, m
c
0101000000001100100), h 9, g 2

W[j] W[ j] E[ j]
W
s
[1] 979.264489569234147400 In rst clock cycle (rst iteration)
W
c
[1] 001.010000000011001000
10W
s
[1] 792.644895692341474000
10W
c
[1] 010.100000000110010000 E
s
[1] 1.000000000000000000
22.3 log
10
(1 +e
1
10
21
) 10
2
+196.390551794005254100 E
c
[1] 0.000000000000000000
W
s
[2] 980 898.034346386456638100 (E[1]e
1
10
21
)
s
9.140000000000000000
W
c
[2] 011 101.101101100000100000 (E[1]e
1
10
21
)
c
+0.000000000000000000

W[2] 208 e
2
21 E
s
[2] 0.140000000000000000
10W
s
[2] 980.343463864566381000 E
c
[2] 0.000000000000000000
10W
c
[2] 011.011011000001000000 In second clock cycle (second iteration)
22.3 log
10
(1 +e
2
10
22
) 10
3
+010.039052425635195000
W
s
[3] 013 001.383426289192476000 (E[2]e
2
10
22
)
s
9.887488888888888888
W
c
[3] 000 000.010101001010100000 (E[2]e
2
10
22
)
c
+0.111111111111111111

W[3] +13 e
3
+1 E
s
[3] 9.038599999999999999
10W
s
[3] 013.834262891924760000 E
c
[3] 1.100000000000000000
10W
c
[3] 000.101010010101000000 In third clock cycle (third iteration)
22.3 log
10
(1 +e
3
10
23
) 10
4
+990.026217975671270000

W
s
[10] 021 002.199071104000000000 (E[9]e
9
10
29
)
s
9.999999984725084930
W
c
[10] 010 001.001011000000000000 (E[9]e
9
10
29
)
c
+0.000000011111110111

W[10] +31 e
10
+3 E
s
[10] 0.028782483451741196
10W
s
[10] 021.990711040000000000 E
s
[10] 0.110011011010010000
10W
c
[10] 010.010110000000000000 In 10th clock cycle (10th iteration)
22.3 10 e
10
/ln(10) +970.033680748675623892
W
s
[11] 019 001.933401788675623892
W
c
[11] 001 000.101100000000000000

W[11] +20 e
11
+2
10W
s
[11] 019.334017886756238920
10W
c
[11] 001.011000000000000000 In 11th clock cycle (11th iteration)
22.3 10 e
11
/ln(10) +980.022453832450415928

W
s
[17] 036 993.644904860602385560 E
s
[17] 9.027693494806399868
W
c
[17] 011 111.110100110111100000 E
c
[17] 1.111110000100001100

W[17] +47 e
17
+5 In 17th clock cycle (17th iteration)
In 18th clock cycle (E[17]e
17
10
217
s
)
s
9.999999999999999906
(E[17]e
17
10
217
)
c
+0.000000000000000100
E
s
[18] 0.138682384895290964
E
c
[18] 0.000111110011110010
In 19th clock cycle E
s

= .13868238489529096 r

d
v
int
= 0 R
exp
= v
int
16 = 16 E

= .1387934949064010 E
c

= .00011111001111001
R
sign
= 0 R
exp
= 16("111110000") R
significand
= 1387934949064010 +compound 1
addition
inc
284 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
The e
1
generator is implemented straightforwardly
according to Table 1. The corresponding (q +g +3)-digit
2230 log
10
(1 +e
1
10
21
) is obtained from look-up table
I. Since g is equal to 2, 2 and 3 for three formats, the size
of look-up table I is (2
5
48)-bit, (2
5
84)-bit and
(2
5
160)-bit for Decimal32, Decimal64 and Decimal128,
respectively.
The multiple logic (Mult1) is applied to compute
m 2.3 v

frac
, where v

frac
is a (q +1)-digit negative value
in the range of 21 , m 0. The Mult1 is implemented
based on the partial product generation logic presented in
[25]. The multiples are formed by adding two of an initial
multiple set 23m (achieved by adding 25m and 2m in a 3:2
CSA counter) and 220m (achieved by shifting 1-digit to the
right of 22m). Both 2m and 5m can be generated with only
a few logic delays. The Boolean equations for generating
double and quintuple of the BCD number are presented in
[26]. To decrease the delay of the addition, two levels of
decimal CSA adders are implemented to develop multiples
m
s
and m
c
. The Boolean equation for computing 1-digit
decimal addition of the BCD number is presented in [26].
The signals cin1 and cin2 are generated to supplement the
LSD owing to the 9s complement conversion (22m and
25m). The signal cin1 and cin2 are added in the LSD and
the second LSD of the rst level of CSA adders, respectively.
4.2.2 Digit recurrence stage: Fig. 3 shows the details of
the hardware implementation of the digit recurrence stage
(stage 2).
In stage 2: The 3:2 decimal CSA compressor, applied to
achieve the residual (W
s
[ j ], W
c
[ j ]), is implemented by one
level of (q +g +3)-digit 3:2 CSA counter. Then, 1-digit
sign (S
s
, S
c
), 1-digit integer part(I
s
, I
c
) and 1-digit fraction
part (f
s
MSD
, f
c
MSB
) of the residual are sent to the rounding e
j
logic for selecting digits e
j
by rounding the residual (

W
s
[j],

W
c
[j]). The sign of the digit e
j
is obtained by the sign
detector block which is implemented based on the equation
sign = (S
0
s
S
c
) (I
3
s
^ I
0
s
^ I
c
)
The 1-digit fraction (f
s
MSD
, f
c
MSB
) and the value of 5 are added
together in the 1-digit decimal full adder to generate the signal
carry to determine the rounding operation. The signals
carry and sign are sent to the selection generator to achieve
a control signal sel of a 4-to-1 multiplexer. The value of |e
j
|
is achieved in four parallel full adders by adding the value of
0, 1, 6 and 5 with the signals of f
s
MSD
and f
c
MSB
, respectively
|e
j
| =
I
s
+I
c
+0 if sign = 0 ^ carry = 0
I
s
+I
c
+1 if sign = 0 ^ carry = 1
I
s
+I
c
+6 if sign = 1 ^ carry = 0
I
s
+I
c
+5 if sign = 1 ^ carry = 1
_

_
Thus, the digit e
j
is obtained by concatenating 1-bit sign with
1-digit |e|.
Fig. 2 Details of hardware implementation of Stage 1 in Fig. 1
Fig. 3 Details of hardware implementation of Stage 2 in Fig. 1
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 285
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
The look-up table II stores all the (q +g +3)-digit
22.3 10
j+1
log
10
(1 +e
j
10
2j
), where j is in the range
of 1 j h. Since e
j
is in the range of 29 e
j
9, there
are 18 different values (except for the value when e
j
0)
that need to be stored in the look-up table for each iteration.
Since h are equal to 4, 9 and 18 and g are equal to 2, 2 and
3 for three formats, the size of look-up table II is
(2
6
48)-bit, (2
8
84)-bit and (2
9
160)-bit for
Decimal32, Decimal64 and Decimal128, respectively. In
order to reduce the size and delay of look-up table II,
(q +g +3)-digit 22.3 10
j+1
log
10
(1 +e
j
10
2j
) can be
efciently reallocated in the multiple tables. For Decimal64
shown in Fig. 3, the single look-up table II is relocated
into two parts in which the rst part (TabII 1) stores all the
values of 22.3 10
j+1
log
10
(1 +e
j
10
2j
), when
2 j 9 and e
j
+1; the second part (TabII 2) stores the
values when 2 j 9, and 2 e
j
9 and 29 e
j
22.
The sizes of TabII 1 and TabII 2 are (2
4
84) and
(2
7
84), respectively. Thus, the optimised size of look-
up table II is reduced from 2.64 to 1.48 kB. Look-up table
III stores 19 values of (q +g +3)-digit 22.3 e
j
/
ln(10) 10, thus it is implemented by a size of 2
5
84-bit
look-up table. Thus, the total optimised size of the
look-up tables is about 2.14 kB for Decimal64. The
implementations of address generators to address look-up
table II and look-up table III based on the values of j and
e
j
are straightforward.
4.2.3 Antilogarithm computation stage: Fig. 4 shows
the details of the hardware implementation of the
antilogarithm computation stage (stage 3).
In stage 3: The multiple logic (Mult2) is applied to
compute E
s
[ j ]e
j
+E
c
[ j ]e
j
, where e
j
is in the range of
29 e
j
9. The multiple of E
s
[ j ]e
j
is formed by adding
two of an initial multiple set m

, 2m

, 2m

, 22m

, 5m

,
25m

, 10m

, 210m

selected by sel1 and sel2 generated


by a recorder. The implementation of 4m logic can be
generated by connecting two 2m logics in series. The cin1
and cin2 are generated by a recorder to supplement the
LSD owing to the 10s complement conversion. The cin1
and cin2 are added in the LSD of the two levels of CSA
adders, respectively. Since each bit of E
c
[ j ] is only zero or
one, the signal of E
c
[ j ]|e
j
| can be achieved in a carry
extend block which can be implemented by a series of
logical-AND gates. If the digit e
j
, 0 (sign(e
j
) 1), E
c
[ j ]e
j
is obtained by the 9s complement conversion of E
c
[ j ]|e
j
|,
and then the signal sign(e
j
is supplemented in the LSD of
the signal (e
j
E[ j ])
c
, otherwise, E
c
[ j ]e
j
is directly obtained
from the signal of E
c
[ j ]|e
j
|.
(E
c
[j]e
j
)
3:0
i
=
E
c
[j]
i
^ e
3:0
j
if sign(e
j
) = 0
9

s com(E
c
[j]
i
^ e
3:0
j
) if sign(e
j
) = 1
_
Thus, the E[ j ]e
j
((e
j
E[ j ])
s
, (e
j
E[ j ])
c
) are achieved by adding
the E
s
[ j ]e
j
and E
c
[ j ]e
j
in a decimal CSA adder. Finally, the
(E[ j ]e
j
10
2j
)
s
and (E[ j ]e
j
10
2j
)
c
are obtained in a decimal
barrel shifter. The 4:2 decimal CSA compressor, applied to
add the E[ j ] (E
s
[ j ], E
c
[ j ]) and E[ j ]e
j
10
2j
((E[ j ]e
j
10
2j
)
s
,
(E[ j ]e
j
10
2j
)
c
) together is implemented by two levels of
(q +3)-digit 3:2 CSA counters.
4.2.4 Final processing stage: Fig. 5 shows the details of
the hardware implementation for the nal processing stage
(stage 4).
Fig. 4 Details of hardware implementation of Stage 3 in Fig. 1
286 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
In stage 4: The decimal compound adder is implemented
based on the conditional speculative method [27]. A prex
tree is implemented based on the binary KoggeStone
network [28]. The additions of E
s
i

+E
c
i

and E
s
i

+E
c
i

+1
can be implemented using three binary half adders and a
binary full adder connected as a ripple carry chain. The
logic for adding the value of 6 is used to compensate the E
j

to the correct representation of the BCD encoding, which


can be implemented using two binary half adders and two
binary full adders. Since the value of R
signicand
may be
rounded to the value of 1 (inc
q+1
1), when it happens,
the R
signicand
is directly set as one.
A BCD to binary converter, which is used to convert
the decimal value of v
int
to the w +2-bit binary format,
is implemented based on [29]. The (w +2) exponent R
exp
is
selected based on the value of inc
q+1
in a 2-to-1 multiplexer
R
exp
=
v
int
16 if inc
q+1
= 0
v
int
15 if inc
q+1
=0
_
Since the DFP antilogarithm result should be positive, the
sign bit (R
sign
) is zero.
5 Implementation and comparisons
The proposed improved DFP antilogarithmic converter that
can compute operands in Decimal32, Decimal64 and
Decimal128 formats was modelled with VHDL and then
simulated using ModelSim, respectively. A comprehensive
testbench, which includes special test cases (NaN, Innite,
Subnormal or zero operands), corner test cases and valid
random DFP operands was performed to verify the
correctness of the design. The proposed architectures were
synthesised using Synopsys Design Compiler with the STM
90-nm CMOS standard cells library [30] under the typical
condition (1.2 V
DD
core voltage and 258C operating
temperature). The clock, input signals and output signals
are assumed to be ideal. Inputs and outputs of the proposed
design are registered and the design is optimised for delay.
The delay model is based on a logical effort method [31],
which estimates the proposed architecture delay values in a
technology independent parameter, FO4 unit (the delay of
an inverter of the minimum drive strength (1) with a
fanout of four 1 inverters). To measure the total hardware
cost in terms of number of gates, the area of the proposed
architecture is estimated as the number of equivalent
1 two input NAND gates (NAND2). Note that 1
FO4 45 ps, and 1 NAND2 4.4 mm
2
in the STM 90-nm
CMOS standard cells library under the typical condition.
Table 4 summarises the delay and the area (without the
look-up tables) estimated using the area and delay
evaluation model for the Decimal64 antilogarithmic
converter. The synthesis report shows that the power
consumption of the proposed DFP and DXP architecture are
26.1 and 16.9 mW. The worst path delay of each stage is
highlighted in the corresponding gure by a dashed thick
line. The evaluation results show that the critical path of the
proposed architecture is located in stage 3 (highlighted in
Fig. 4), and the details of the critical path in the Decimal64
implementation are reported in Table 5.
The results of the proposed design are also compared with
those of our previous design implemented and reported in
[18]. The area and critical path delay of the previous 16-
digit DXP antilogarithmic converter is about 140 732 mm
2
and 8.25 ns, respectively (synthesised with the TSMC
0.18 mm standard cell library in the typical condition,
thus, 1 FO4 75 ps, and 1 NAND2 10.0 mm
2
). The
comparison results reported in Table 6 show that the
improved DXP antilogarithmic converter based on
the redundant data-path is about 3.91 times faster than the
previous design based on the non-redundant data-path in
terms of latency, with the expense of 1.85 times more area.
Fig. 5 Details of hardware implementation of Stage 4 in Fig. 1
Table 5 Details of critical path of the Decimal64 antilogarithmic
converter
Blocks in the critical path Total
Reg Mult2 Mux Shift CSA 4:2 setup (ns)
0.07 0.53 0.06 0.23 0.29 0.08 1.26
Table 4 Delay and area of Decimal64 antilogarithmic converter
Stage Worst delay
(FO4)
Areas
(NAND2)
initial processing stage (Fig. 2) 25.3 7323
digit recurrence stage (Fig. 3) 23.6 4667
antilog computation stage (Fig. 4) 28.0 15 485
nal processing stage (Fig. 5) 18.6 1178
top-level control logic (FSM
a
) 3.5 672
total 28.0
b
29 325
a
FSM, nite-state machine
b
Critical path delay
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 287
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org
With respect to the existing works implemented based on
the CORDIC algorithm in [19, 20], it is quite difcult to
compare the hardware performance between the two
different algorithms. In [19], only the number of clock
cycles and critical path delay are presented, which are 200
and 13 FO4 (taking the Power 6 processor as a reference),
respectively. The comparison results reported in Table 6
show that the digit-recurrence approach proposed in this
work is 4.89 times faster than the unit [19] based on the
CORDIC approach in terms of latency. Compared with the
design [20] which is an improved CORDIC xed-point unit
based on [19], the proposed architecture in this work is
2.40 times faster than this design, with the expense of 0.48
times less memory and 1.39 times more area.
For further analysis, we compare the performance of the
proposed architecture with the software approach reported
in [10]. The software DFP transcendental function
computation library is compiled by the Intel C++
Compiler (IA32 version 11.1) [32]. It takes about 1060
clock cycles to compute a Decimal64 exponential result,
running with Intel Core(TM) 2 Quad @ 2.66 GHz
microprocessor. The comparison results reported in Table 6
show that the proposed hardware implementation in this
work is about 45.8 times faster than software implementation.
6 Conclusions
In this work, we presented a DFP antilogarithmic converter
that is based on the digit-recurrence algorithm with
selection by rounding. We developed the radix-10
algorithm, improved the architecture and implemented it
with the STM 90-nm CMOS standard cells library. The
implementation results show that the improved architecture
is 3.91 times faster than our previous design [18] in terms
of latency. To provide a reference for oating-point-unit
designers when they consider a fast implementation for
the radix-10 implementation, we compared the proposed
architecture with a recent high performance implementation
based on the decimal CORDIC algorithm [19, 20].
Although a comparison between two different algorithms
depends on many parameters, the design presented in this
paper shows a latency 4.89 and 2.40 times faster than that
of the units based on the CORDIC algorithm. In addition,
compared with the software DFP transcendental function
computation library [10], the proposed hardware
implementation in this work is about 45.8 times faster than
the software implementation.
7 Acknowledgments
This work was supported by the Natural Sciences and
Engineering Research Council of Canada (NSERC). The
authors would appreciate the anonymous reviewers for their
valuable comments.
8 References
1 Cowlishaw, M.F.: Decimal oating-point: algorism for computers.
16th IEEE Symp. on Computer Arithmetic (ARITH16), 2003,
pp. 104111
2 IEEE Working Group of the Microprocessor Standards Subcommittee:
IEEE 754-2008 Standard for Floating-Point Arithmetic (August 2008)
3 Wang, L.-K., Erle, M.A., Tsen, C., Schwarz, E.M., Schulte, M.J.: A
survey of hardware designs for decimal arithmetic, J. IBM Res. Dev.,
2010, 54, (3), pp. 8:18:15
4 Duale, A.Y., Decker, M.H., Zipperer, H.-G., Aharoni, M., Bohizic, T.J.:
Decimal oating-point in z9: an implementation and testing
perspective, J. IBM Res. Dev., 2007, 51, (1/2), pp. 217227
5 Eisen, L., J.W.W. III, Tast, H.-W., et al.: IBM POWER6
accelerators: VMX and DFU, J. IBM Res. Dev., 2007, 51, (6),
pp. 663683
6 Schwarz, E.M., Kapernick, J.S., Cowlishaw, M.F.: Decimal oating-
point support on the IBM system z10 processor, J. IBM Res. Dev.,
2009, 53, (1), pp. 4:14:10
7 Harrison J.: Presentation: decimal transcendentals via binary (June
2009). http://www.ac.usc.es/arith19/sites/default/les/S7P2Decimal
TranscendentalsViaBinary.pdf
8 Kropa, J.C.: Calculator algorithms, Math. Mag., 1978, 51, (2),
pp. 106109
9 Imbert, L., Muller, J.M., Rico, F.: A radix-10 BKM algorithm for
computing transcendentals on pocket computers, J. VLSI Signal
Process. Syst., 2000, 25, (2), pp. 179186
10 Harrison, J.: Decimal transcendentals via binary. 19th IEEE Symp. on
Computer Arithmetic (ARITH19), 2009, pp. 187194
11 Muller, J.M.: Elementary functions, algorithms and implementation
(Birkhauser Verlag, Boston, USA, 2005, 2nd edn.)
12 Ercegovac, M.D., Lang, T., Montuschi, P.: Very high-radix division
with prescaling and selection by rounding, IEEE Trans. Comput.,
1994, 43, (8), pp. 909918
13 Lang, T., Montuschi, P.: Very-high radix square root with prescaling
and rounding and a combined division/square root unit, IEEE Trans.
Comput., 1999, 48, (8), pp. 827841
14 Antelo, E., Lang, T., Bruguera, J.D.: Computation of
....
x/d

in a very-
high radix combined division/square-root unit with scaling
and selection by rounding, IEEE Trans. Comput., 1998, 47, (2),
pp. 152161
15 Antelo, E., Lang, T., Bruguera, J.: High-radix CORDIC rotation based
on selection by rounding, J. VLSI Signal Process. Syst., 2000, 25, (2),
pp. 141153
16 Pineiro, A., Ercegovac, M.D., Bruguera, J.D.: High-radix logarithm
with selection by rounding: algorithm and implementation, J. VLSI
Signal Process. Syst., 2005, 40, (1), pp. 109123
17 Chen, D., Han, L., Choi, Y., Ko, S.: Improved decimal oating-point
logarithmic converter based on selection by rounding, IEEE Trans.
Comput., 2012, 61, (5), pp. 607621
18 Chen, D., Zhang, Y., Teng, D., Wahid, K., Lee, M.H., Ko, S.-B.: A new
decimal antilogarithmic converter. IEEE Symp. on Circuit and System
(ISCAS09), 2009, pp. 445448
19 Vazquez A., Villalba J., Antelo E., Zapata E.L.: Redundant oating-
point decimal CORDIC algorithm, IEEE Trans. Comput., PrePrint,
2012
20 Kaivani A., Jaberipur G.: Decimal cordic rotation based on selection
by rounding: algorithm and architecture, Comput. J., 2011, 54, (11),
pp. 17981809
Table 6 Comparison results Decimal64 antilogarithmic converter with other designs
Works Cycle time (FO4) Cycles (No.) Latency (FO4) Ratio Area (NAND) Ratio ROM (kB)
proposed 28.0 19 532.0 1.00 29 325 1.00 2.14
proposed
a
28.0 18 504.0 0.95 26 197 0.89 2.14
previous [18]
a
110.0 18 1980.0 3.72 14 073 0.48 2.14
CORDIC [20]
b
34.62 35 1211.7 2.40 18 826 0.64 4.50
CORDIC [19] 13.0 200 2600.0 4.89 N/A N/A N/A
software [10] 23.0 1060 24 380 45.8 N/A N/A N/A
Software library running at Intel Core(TM) 2 Quad @ 2.66 GHz
a
16-digit DXP antilogarithmic converter
b
16-digit DXP CORDIC unit
288 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0089
www.ietdl.org
21 Cowlishaw, M.F.: Densely packed decimal encoding, J. IEE Comput.
Digit. Tech., 2002, 149, (3), pp. 102104
22 Cornea, M., Harrison, J., Anderson, C., Tang, P.T.P., Schneider, E.,
Gvozdev, E.: A software implementation of the IEEE 754R decimal
oating-point arithmetic using the binary encoding format, IEEE
Trans. Comput., 2009, 58, (2), pp. 148162
23 Lefevre, V., Muller, J.M., Tisserand, A.: Toward correctly
rounded transcendentals, IEEE Trans. Comput., 1998, 47, (11),
pp. 12351243
24 Oklobdzija, V.G.: An algorithmic and novel design of a leading zero
detector circuit: comparison with logic synthesis, IEEE Trans. Very
Large Scale Integr. (VLSI) Syst., 1994, 2, (1), pp. 124128
25 Lang, T., Nannarelli, A.: A radix-10 combinational multiplier. IEEE
Asilomar Conf. on Signals, Systems and Computers (ACSSC06),
2006, pp. 313317
26 Erle, M.A., Schulte, M.J.: Decimal multiplication via carry-save
addition. 14th IEEE Int. Conf. on Application-Specic Systems,
Architectures, and Processors (ASAP03), 2003, pp. 348358
27 Vazquez, A., Antelo, E.: Conditional speculative decimal addition.
Seventh Conf. on Real Numbers and Computers (RNC 7), 2006,
pp. 4757
28 Kogge, P.M., Stone, H.S.: A parallel algorithm for the efcient solution
of a general class of recurrence equations, IEEE Trans. Comput., 1973,
C-22, (8), pp. 786793
29 Deschamps, J.-P., Bioul, G.J.A., Sutter, G.D.: Synthesis of
arithmetic circuits: FPGA, ASIC and embedded systems (Wiley,
2006, 1st edn.)
30 STMicroelectronics, 90 nm CMOS090 Design Platform, 2007
31 Sutherland, I., Sproull, R., Harris, D.: Logical effort: designing fast
CMOS circuits (Morgan Kaufmann, 1999, 1st edn.)
32 Intel Corporation, Using decimal oating-point with Intel C++
compiler, http://software.intel.com/en-us/articles/using-decimal-oating-
point-with-intel-c-compiler, 2010
33 Pineiro, A., Ercegovac, M.D., Bruguera, J.D.: Algorithm and
architecture for logarithm, exponential, and powering computation,
IEEE Trans. Comput., 2004, 53, (9), pp. 10851096
IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 5, pp. 277289 289
doi: 10.1049/iet-cdt.2011.0089 & The Institution of Engineering and Technology 2012
www.ietdl.org

You might also like