You are on page 1of 8

APZ 212 30—Ericsson’s new high-capacity AXE

central processor
Per Holmberg and Nils Isaksson

The APZ 212 30 is a completely new central processor for the AXE system switching, memory access and communica-
with three to four times the execution capacity of its predecessors. The tion enable the processor to execute thou-
high capacity is achieved by combining unique APZ performance features sands of tasks in parallel. Instead of relying
with state-of-the-art processor architecture and innovative design. on a single processor unit to do all the work,
Other features in the new processor are improved memory capacity the task of execution has been divided be-
tween two dedicated processors:
and a new ring network for external communication interfaces that greatly
• the instruction processor unit (IPU),
extends the flexibility of regional processor bus configurations and which executes application code; and
enables the APZ processors to make use of new, high-speed communica- • the signal processor unit (SPU), which
tion interfaces. The implementation of highly integrated CMOS circuits terminates protocols and schedules jobs
gives low power dissipation and improves reliability. (in conventional computers, these func-
Being fully compatible with existing applications and installed hard- tions are usually associated with the op-
ware, the new processor can be used in new installations and for upgrad- erating system).
ing the system execution and memory capacity of previously installed Another feature retained from previous
systems. With this new processor, the AXE 10 system satisfies the APZ 212 processors is the pure Harvard ar-
demands for greater capacity created by the increasing number of sub- chitecture, in which the IPU has separate in-
scribers in mobile networks and by new, revenue-generating service offer- struction and data caches and separate mem-
ory for instructions (program store, PS) and
ings.
data (data store, DS). This design permits
The authors describe the architecture and implementation of the new parallel access to instructions and data even
APZ 212 30 processor, paying special regard to its advanced execution at cache misses.
and communication mechanisms. Program execution in the IPU is very ad-
vanced: instructions are decoded and exe-
cuted in parallel (superscalar execution); the
APZ 212 30 architecture instructions are also dynamically reordered
(“out-of-order” execution) for optimum per-
The APZ 212 30 is a completely new de- formance. Instructions from application
sign (Figure 1). It retains and further en- programs are decoded into internal RISC-
hances the unique high-performance archi- style (reduced instruction-set computer) in-
tecture of the APZ 212 series of processors, structions. To handle jumps in the code, the
implements a state-of-the-art execution processor employs dynamic branch predic-
pipeline, and introduces several new perfor- tion, executing on the predicted path (spec-
mance features. Its entire architecture has ulative execution). Innovative features in-
been optimized for the characteristics of clude the pre-decoding of instructions as
telecommunications—efficient context they are loaded into the program store, and
a high-performance data store architecture.
The APZ 212 30 communicates with the
AXE system via the regional processor bus
handler (RPH), which implements a new
ring network that allocates communication
bandwidth to serial and parallel regional
processor (RP) bus interfaces and high-
capacity networks.

BOX A, ABBREVIATIONS Job signal flow


The regional processor bus handler connects
ALU Arithmetic logic unit MAS Maintenance system directly to up to 32 regional processor bus
ASIC Application-specific integrated MAU Maintenance unit branches. External job signals (messages)
circuit MTBSF Mean time between system failures
BGA Ball grid array POWC Power controller
that arrive over the RP bus are forwarded to
CMOS Complementary metal-oxide PRS Program and reference store the signal processor unit, which analyzes the
semiconductors PS Program store signal, assigns a priority to it, and queues it
CP Central processor RISC Reduced instruction set computer in the job buffer where it awaits execution
CPS Central processor operating system RP Regional processor
in the instruction processor unit. The SPU
DMA Direct memory access RPH Regional processor handler
DRAM Dynamic random-access memory RS Reference store loads one job signal at a time into the IPU.
DS Data store SDRAM Synchronous DRAM When a job signal arrives, the IPU identi-
ECC Error-correcting code SPU Signal processor unit fies it, looks up the start address of related
I/O Input-output SRAM Static RAM program code in the program and reference
IPU Instruction processor unit SSRAM Synchronous, static RAM
ISP In-service performance UMB Update and match bus
store (PRS) table, and then begins execut-
ing the program. Programs that execute in

148 Ericsson Review No. 3, 1999


RPH IPU

PS RS
Job buffers
RP bus SPU

Execution pipe
DS

Job signal flow


Figure 1
Instruction flow
The APZ 212 30 architecture.

the IPU can send new job signals to the SPU. ecutes one job, the SPU preloads the next
Signals that are designated to other program job signal directly into an extra bank of
blocks are queued in job buffers; signals des- processor registers in the IPU. Thus, when
ignated to other processors in the system are the IPU finishes the first job, it swaps reg-
routed for transmission over the RP buses. ister banks and immediately begins execut-
ing the preloaded job without first having
Job signal pipeline to copy registers (Figure 2).
The IPU-SPU interface has been optimized Job signals to the SPU are also transport-
to support high throughput of job signals, ed through a pipeline. The processor regis-
which are transported through a pipeline be- ters included in the signals are copied to a
tween the SPU and IPU. While the IPU ex- send buffer using the full internal band-

IPU Figure 2
The IPU-SPU interface: the jobs signal
pipeline.
Extra register bank
Job buffers Processor registers
SPU

DMA

DMA

Send buffer

Ericsson Review No. 3, 1999 149


width on the IPU processor chip. The SPU high-capacity buses connect to the program
fetches signals from the send buffer while and reference store memory, the data store
the IPU continues executing the same job memory boards, and the SPU. Each bus op-
or switches to the next. The SPU uses au- erates at full processor frequency (Figure 3).
tonomous direct-memory-access (DMA) en- The IPU memory interfaces have been op-
gines to transport job signals. timized to suit the characteristics of
The combination of off-loading job sched- telecommunications applications. Access to
uling and protocol termination from the the program and reference store is often se-
main CPU to the SPU and of job signal quential and concentrated to a narrow range
pipelines enables the APZ 212 30 of addresses. To support this kind of access,
• to switch contexts and start executing a the memory system implements a wide bus
new job in as few as 30 clock cycles; and and uses “page mode” in modern, synchro-
• to send a signal in just 15 clock cycles. nous, dynamic, random-access memory
Compare this to the hundreds or thousands (SDRAM). This combination gives almost
of clock cycles that a standard microproces- instantaneous access (three clock cycles) to
sor needs to do the same task. The APZ data within an 8 Kword (16 Kbyte) range
212 30 can thus efficiently switch context of addresses. Similarly, frequently used ta-
300,000 times per second and still devote bles and program blocks are copied to syn-
most of its time to executing application chronous, static RAM (SSRAM) for access
code. in just two clock cycles.
By contrast, access to the data store is usu-
ally non-sequential and distributed between
IPU structure several different memory addresses. The
Compared to ordinary microprocessors, the IPU supports this kind of access by divid-
APZ 212 30’s instruction processor circuit is ing the data store into banks and allowing
not limited to a single processor bus for com- up to eight parallel access attempts—pro-
municating with main memory and other vided no two attempts address the same
processors and networks. Instead, separate bank simultaneously. The memory area on
each data store memory board is divided into
16 banks, which means that one board can
give full memory bandwidth in the system.
The data store, which is highly configurable,
Figure 3
can hold any combination of from one to
The IPU block diagram. eight memory boards. Two types of memo-
ry board are currently available: a 512
MWord (1 GByte) DRAM board, and a 32
IPU
PS and RS MWord (64 MByte) high-speed SRAM
board.
SDRAM SSRAM
Execution of instructions
Instructions are executed in the instruction
Address Data Address processor unit, which has a short, six-stage
128 bit instruction-execution pipeline (Figure 4) in
+ECC
which each stage corresponds to one clock
UMB cycle. Instructions are shown flowing from
16+16 bit 36 bit Redundant
SPU +parity +parity processor side top to bottom, with a new pair of instruc-
tions starting each clock cycle. In telecom-
munication applications, which are charac-
Data
32 bit terized by many changes in control flow
+parity Address (short jobs and frequent jumps and calls to
other software blocks), the pipeline must be
short.
DS Internally, the processor uses the equiva-
1–8 DS memory boards lent of internal RISC microinstructions. Ap-
DRAM/SRAM
plication program instructions are decoded
into these microinstructions before they are
executed. Complex instructions are decod-
ed into a stream of microinstructions.

150 Ericsson Review No. 3, 1999


The first stage, called Instruction fetch, Instruction execution pipeline
fetches instructions in 128 bit memory
words (eight 16 bit words) from the on-chip
Instruction fetch
program cache, the external second-level
cache, or from the program and reference Instr
uctio
store. Given an average instruction length Partition n pro
cess
of 1.5 words, there are approximately five or cir
cuit
instructions in each memory word. Thus,
five new instructions are loaded every clock Decode Decode
cycle.
In the second stage, called Partition, up to Opread Opread
two instructions are extracted from the
memory word. These are decoded in the PS
third stage, Decode. Instructions that per- Execute Execute Out-
of-o RS
rder
form simple operations—such as an ADD exec
utio
n en
instruction, which adds the context of two Commit Commit
gine
processor registers—are directly decoded
into a single microinstruction. Instructions UMB
that perform complex operations—for in-
DS
stance, an end program (EP) instruction,
which ends the execution of the current job
SPU
and switches context—are decoded into a
stream of microinstructions.
The forth stage is called opread. Depend- Figure 4
ing on the type of Operand, the microin- Instruction pipeline.
structions are written to one of five queues,
called reservation stations. Here, the in-
structions wait for their operands to be
fetched from the register file or from mem- to predict the correct branch, then the reg-
ory, or they wait for the results of earlier in- isters are cleared and execution is restart-
structions. Up to eight instructions can be ed. Otherwise, the results are committed
active in this stage simultaneously. in the last stage of the pipeline (Commit).
When an instruction has received all its • Out-of-order (non-sequential) execu-
operands, it is passed to the fifth stage, Ex- tion—if one instruction is delayed (for ex-
ecute. In this stage, up to two instructions ample, from waiting for data to arrive
can be executed in parallel, in separate arith- from memory), then subsequent instruc-
metic logic units (ALU).
The final stage, Commit, writes the results
of the instructions to a register or memory.
The ultramodern design of the execution
pipeline features the following characteris- BOX B, MAIN FEATURES OF THE APZ 212 30 PROCESSOR
tics:
• superscalar execution—two instructions The main features of the APZ 212-30 processor – access, using new, rapid-communication
can be decoded, executed and committed are: buffer read-and-write instructions.
in the same clock cycle. • Large processing capacity—three to four • Optional configurations with high-speed
• branch prediction—when the processor times more capacity than the APZ 212 20 SRAM boards and standard dynamic RAM
performs a conditional jump, it does not (depending on application characteristics). boards for data storage. By using additional
To expand capacity, designers combined a SRAM boards, the system can be configured
wait until the branch condition is known; new, advanced processor architecture, a from optimal price/capacity to best capacity.
instead, it predicts the most probable higher clock frequency and high-speed stat- • In-service performance (ISP)—integration
branch and continues execution on that ic RAM data-storage boards; into custom complementary metal-oxide
branch. Branch prediction is based on a • Large memory capacity—4 GWord (8 GByte) semiconductors (CMOS) improves system
of data storage (up from 1.5 G Word), 96 reliability. In most configurations, mean time
very large 64 K entry prediction table MWord (192 MByte) of program storage (up between system failures (MTBSF) is now
used to achieve high prediction accuracy from 64 MWord), and 32 MWord (64 MByte) more than 10,000 years. Well-defined inter-
in the telecom application; of reference storage (up from 4 MWord); faces and fewer boards improve diagnostics
• speculative execution—execution on a • New functions—a generic communication when hardware faults occur.
conditional branch is speculative until the bus interface based on a ring network that • Improved hardware maintenance—board
allows for adaptation to high-speed net- number, revision.
branch condition is known. The results works. A new communication-buffer mecha- • Reduced size—the processor fits into a sin-
from executed instructions are stored in nism allows program blocks to share gle 600 mm, double-sided cabinet (half the
temporary registers. If the processor failed – data in a buffer; and size of its predecessor).

Ericsson Review No. 3, 1999 151


CP-A CP-B

RPH RPH RPH RPH

RHC RHC RHC RHC

Figure 5
SPU SSC UMB
Duplicated hardware structure of the SSC SPU
APZ 212 30 central processor.
DS Data store SMC Power Power SMC
DSC Data store circuit
DSU Data store unit
IPC Instruction procesor circuit IPU UMB IPU
IPU Instruction processor unit UBC UBC
MAI Maintenance unit interface POWC POWC
MAU Maintenance unit IPC MAI MAU MAI IPC
MIC Maintenance interface circuit MIC MIC
POU Power unit
POWC Power control unit
PRS Program and reference store DSU DSU
RHC RPH circuit DSC DSC DSC DSC
RPH Regional processor handler 0 0
1 1
SMC SPU master circuit 2
3
2
3
SPU Signal processor unit 4 4
DS 5 DS 5
SSC SPU slave circuit 6 6
7 7
UBC Update bus circuit

tions are allowed to bypass the waiting in- simultaneous access to instructions and
struction. The hardware first confirms data.
inter-instruction dependency to ensure • Load-time pre-decode—instructions are
that an instruction does not bypass any in- mapped to a new optimized format when
structions on which it depends for data. loaded into the program memory. This ac-
• register renaming—the processor has ac- tion, which is performed by the loader in
cess to more physical registers than the the operating system, is not visible to the
programmer can see. Dependencies are user. Thanks to the new format, the IPU
avoided by assigning a new, temporary is able to extract two instructions from a
register to the results of each instruction. memory word in just one clock cycle.
This further enhances the capacity of out- • Early jump extraction—jump instruc-
of-order-execution; tions and their target addresses are iden-
• multilevel instruction cache system— tified in the partition stage before the in-
support for fetching instructions is pro- structions have been decoded. This en-
vided by a small, single-clock-cycle ac- ables the IPU to fetch instructions from
cess, on-chip, level-one cache and a larg- the new path earlier and minimizes the
er SRAM-based level-two cache; and penalty for taking the jump. This feature
• data cache—support for fetching data is is especially important in processors for
provided by a small, on-chip data cache telecom applications, since the associated
together with optional, low-latency code has a high frequency of jump in-
SRAM boards. structions.
In addition to these features, the APZ in- • Early load-extraction—the one factor that
struction processor implements the follow- most affects capacity in modern proces-
ing unique features to further improve ca- sors that run applications, such as a
pacity: telecommunications control system
• Harvard architecture—the design of sep- (which uses a large amount of data stor-
arate instruction and data memory allows age) is access time for reading data from

152 Ericsson Review No. 3, 1999


DRAM. The new, optimized instruction Regional processor
format enables the IPU to identify and ex-
tract the variable address early on in the handler
pipeline (during the partition stage), The regional processor handler connects the
which decreases the access time for read- central processor to the regional processors
ing data. by providing interfaces to up to 32 RP bus
• Loop unroller—instead of running loops branches. Internally, the regional processor
in the microprogram, the loop unroller handler implements a new ring network for
generates sequential instructions on the communication between the SPU and in-
fly. This completely eliminates jumps in terface boards. The ring network yields
loops and improves capacity when data is greater bandwidth and facilitates flexible
copied to and from registers; when data is configuration of interface boards and flexi-
copied from memory to memory; and dur- ble allocation of communication band-
ing linear search operations in memory. width. There are currently two interface
board types: one for connecting to two par-
allel RP buses; and one for connecting to
Signal processor unit four of the new, serial RP buses. The ring
The signal processor unit (SPU) is equipped network in the RPH also supports the ad-
with two specialized processors: the SPU dition of new, high-speed data-
master processor and the SPU slave proces- communication interfaces (Box C).
sor, each of which is a micro-programmed
RISC processor whose instruction set has
been optimized for its specific duties.
In-service performance
The SPU master processor schedules jobs With its fault-tolerant configuration,
and preloads them for execution in the IPU. using two central-processor sides (CP-A
It also schedules periodic jobs in the system and CP-B) that execute in parallel, the
by scanning the job table and creating and APZ 212 30 furthers the tradition of
scheduling the job signals to start them. APZ robustness (Figure 5). A maintenance
The SPU slave processor administers the unit (MAU) supervises operation, selecting
RPH ring network and terminates the RP one side to execute and the other side to
bus protocol (for example, retransmission). operate on standby. The standby side per-
Outgoing messages are routed to the correct forms the same operations as the executing
RPH. Incoming messages are forwarded to side, trailing it by 12 clock cycles. Having
the SPU master processor. two sides guarantees tolerance against
Direct memory access engines serve as au- hardware faults and enables operators to
tomatic data transports of signals between conduct maintenance activities without
SPU buffer memory and the IPU as well as loss of service. For example, one side can be
to the RPH. extended with new hardware or software

BOX C, DESIGN AND POTENTIAL OF THE RPH/RING NETWORK

Logical design • 160 Mbit/s bandwidth


Several point-to-point connections from the • Highly configurable
SPU to the RPH with two kinds of communica- – 1 to 16 interface boards of available
tion channel: types
• a signaling channel – free configuration of parallel and serial
• a broadcast channel (from the SPU to every RP bus interfaces
RPH) – (prepared for) dynamically allocable
Physical design bandwidth per board
• Ring topology with a time-slot protocol that – support for new high-speed inter-
guarantees bandwidth per RPH. faces
• Synchronous clock operation supports fault – automatic configuration identifies new
tolerance in the central processor. boards

Ericsson Review No. 3, 1999 153


Central processor unit A

Central processor unit B

Regional processor
handler A (in front)
Regional processor
handler B (behind)

Figure 6
APZ 212 30 cabinet.

while the other side continues executing SPU board, one for the RPH, one for the
system operations. power controller (POWC) board, and one
In addition to the fault-tolerant configu- for the test unit (MIT trace equipment). All
ration, each CP side has been designed to circuits have been implemented in high in-
provide high availability. Memory (SRAM tegration 0.35 micron CMOS. This circuit
and SSRAM) is protected by error- technology facilitated the advanced archi-
correcting code (ECC), which corrects sin- tecture of the APZ 212 30 that was needed
gle bit errors. The EEC also corrects faults to attain high capacity. To a large extent,
in whole circuits. The data store, for exam- the processor capacity of telecommunica-
ple, contains a large number of memory cir- tions applications is limited by the access
cuits, providing up to 8 GByte of memory. time to large external data stores. At 80
Consequently, if faults are detected in these MHz system frequency, the APZ 212 30
circuits, the ECC corrects them. with its advanced, superscalar IPU, Harvard
The extensive use of application-specific architecture, SRAM boards, and use of load-
integrated circuits (ASIC) makes for a very time pre-decode procedures, can easily keep
clean design (circuit boards solely contain up with memory access attempts.
custom circuits and memory) and reduces The processor circuit of the IPU is housed
power dissipation. These features contribute in a 735-pin ball grid array (BGA) package.
toward exceptional mean time between sys- It is the largest circuit, with 2.8 million
tem failures (MTBSF) for hardware faults— transistors in logic and 7.4 million transis-
10,000 years in most configurations. tors in memory.
The APZ 212 30 processor is housed in a
single cabinet (600 mm wide) that holds four
Technology subracks, two for each CP side (Figure 6).
The APZ 212 30 is composed of eight sep- The CPU subrack of each side holds the
arate circuit designs: one for the data store processor and memory boards; the RPH sub-
unit, two for the IPU boards, two for the rack holds the interface cards to the RP buses.

154 Ericsson Review No. 3, 1999


Software support and IPU and the two SPU processors, and sig-
nals between units. This device is in-
adaptation valuable when debugging the new sys-
Accompanying the new hardware are new tem. In real systems, collected traces give
releases of the operating system (CPS) and input for detailed analyses of application
the maintenance system (MAS). These in- behavior.
clude support for the new processor hard-
ware and functionality:
• Communication buffers—operating sys-
Future directions
tem support for allocating and de- The APZ 212 30 is fully upgradable. Its ad-
allocating communication buffers and for vanced architecture will deliver substantial-
managing buffer pools. ly greater capacity when adapted to future
• Data store SRAM boards—the operating silicon processes. Moreover, new features
system measures and allocates frequently can be included for further decreasing sys-
used data to SRAM. tem downtime and simplifying system han-
• Measurement functions—support for dling.
built-in performance counters and for Subsequent designs will make greater use
measuring software behavior; capacity- of standard computer interfaces and com-
enhancing mechanisms in the processor. ponents and support standard Internet pro-
• RPH ring network—configuration of in- tocols.
terface boards.
Conclusion
Software design support The APZ 212 30 is a completely new proces-
While hardware designers developed the sor design that is now in operation in sever-
new APZ processor, other developers al markets around the world. It uses an ad-
worked on an upgrade for the software de- vanced architecture for achieving high ca-
sign support: pacity in telecommunications applications.
• PLEX compiler—the new release offers The processor yields three to four times
better capacity and supports the new com- greater execution capacity and extends data
munication buffers. store capacity to a full 4 Gword (8 Gbyte).
• EMU CP emulator—a new version, based A new, generic communication bus in-
on new emulation technology, speeds up terface gives greater flexibility and opens the
emulation and more exactly emulates the system to new types of communication
APZ processor. buses.
• MIT trace equipment—this single-board The advanced processor architecture—
trace device, which fits into a spare slot implemented through standard CMOS
in the CPU subrack, can record every op- technology—fulfills objectives for perfor-
eration in the CP, including the execu- mance, integration, and power dissipation,
tion of all application instructions, the and offers exceptionally high mean time be-
execution of microinstructions in the tween hardware-related system failures.

Ericsson Review No. 3, 1999 155

You might also like