You are on page 1of 65

The ARM Cortex-M4 Processor

Architecture

Module Syllabus
ARM Architectures and Processors
What is ARM Architecture
ARM Processor Families
ARM Cortex-M Series
Cortex-M4 Processor
ARM Processor vs. ARM Architectures

ARM Cortex-M4 Processor


Cortex-M4 Processor Overview
Cortex-M4 Block Diagram
Cortex-M4 Registers
2

ARM ARCHITECTURES AND


PROCESSORS
3

What is ARM Architecture


ARM architecture is a family of RISC-based processor architectures
Well-known for its power efficiency;
Hence widely used in mobile devices, such as smartphones and tablets
Designed and licensed to a wide eco-system by ARM

ARM Holdings
The company designs ARM-based processors;
Does not manufacture, but licenses designs to semiconductor partners who add their own

Intellectual Property (IP) on top of ARMs IP, fabricate and sell to customers;
Also offer other IP apart from processors, such as physical IPs, interconnect IPs, graphics

cores, and development tools.

ARM Processor Families


Cortex-A series (Application)

High performance processors capable of full Operating System (OS)


support;

Applications include smartphones, digital TV, smart books, home


gateways etc.

Cortex-R series (Real-time)

High performance for real-time applications;

High reliability

Applications include automotive braking system, powertrains etc.

Cortex-M series (Microcontroller)

Cost-sensitive solutions for deterministic microcontroller applications;

Applications include microcontrollers, mixed signal devices, smart


sensors, automotive body electronics and airbags;

SecurCore series

Cortex-A15
Cortex-A9
Cortex-A8
Cortex-A7
Cortex-A5
Cortex-R7
Cortex-R5
Cortex-R4
Cortex-M4
Cortex-M3
Cortex-M1
Cortex-M0+
Cortex-M0
SC000
SC100
SC300
ARM11
ARM9
ARM7

Cortex-A

Cortex-R
Cortex-M
SecurCore
Classic

High security applications.

Previous classic processors


5

Cortex-A57
Cortex-A53

Include ARM7, ARM9, ARM11 families

As of Dec 2013

Cortex-M processors are the optimal solution for low-power embedded computing applications. The 32-bit Cortex-M
processor family is the key to transforming all sorts of embedded systems into smart and connected systems. Often
provided as a black box with pre-loaded applications, they have limited capability to expand hardware functionality and in
most cases no screen.
Merchant MCUs
*Automotive Control Systems
White Goods controllers
*Smart Meters
*Sensors
6

*Motor Control Systems


*Internet of Things

Equipment Adopting ARM Cores

IR Fire Detector
Intelligent toys

Utility
Meters

Exercise
Machines

Energy Efficient Appliances

Tele-parking

R
A
7

Source: ARM University Program Overview

Intelligent
Vending

With More Than 50 Billion


Over 10 Billion
ARM-powered chips shipped in
2013 alone.

Strong and Consistent


Growth

Since 1993

This curve shows overall shipments leading to the 50


billion milestone. Theres been an upward trend as
shipments skyrocketed in recent years.

1993

10 Billion

2013
50 Billion

www.50billionchips.com

Source: ARM University Program Overview

Markets were POWERING

20% | Embedded

16% | Enterprise

Applications including automotive, touchscreen controllers, industrial equipment,


connectivity and smartcards

Applications such as hard disk drives, and


wireless/wireline networking infrastructure equipment

6% | Home
Consumers devices such as smart TVs, game
consoles and home networking gateways

58% | Mobile
Devices including smartphones,
mobile phones, tablets, e-readers
and wearables

www.50billionchips.com

Source: ARM University Program Overview

Based on Lecture Notes by Marilyn Wolf


10

ARM Architecture versions


(From arm.com)

Design an ARM-based SoC


Select a set of IP cores from ARM and/or other third-party IP vendors
Integrate IP cores into a single chip design
Give design to semiconductor foundries for chip fabrication
IP libraries
Cortex-A9

Cortex-R5

Cortex-M4

ARM7

ARM9

ARM11

DRAM ctrl

FLASH ctrl

SRAM ctrl

AXI bus

AHB bus

APB bus

GPIO

I/O blocks

Timer

Licensable IPs

12

SoC
ROM

ARM
processor
System bus

RAM

ARM-based
SoC

Peripherals
External Interface

SoC Design

Chip Manufacture

ARM Cortex-M Series


Cortex-M series: Cortex-M0, M0+, M1, M3, M4.
Energy-efficiency
Lower energy cost, longer battery life

Smaller code
Lower silicon costs

Ease of use
Faster software development and reuse

Embedded applications
Smart metering, human interface devices, automotive and industrial control

systems, white goods, consumer products and medical instrumentation

13

As of Dec 2013

ARM Processors vs. ARM Architectures


ARM architecture
Describes the details of instruction set, programmers model, exception model, and

memory map
Documented in the Architecture Reference Manual

ARM processor
Developed using one of the ARM architectures
More implementation details, such as timing information
Documented in processors Technical Reference Manual
ARMv4/v4T
Architecture

ARMv5/ v4E
Architecture

ARMv6
Architecture

ARMv7
Architecture

ARMv7-A
e.g. Cortex-A9
ARMv7-R
e.g. Cortex-R4

ARM v6-M
e.g. Cortex-M0, M1
e.g. ARM7TDMI
14

e.g. ARM9926EJ-S

ARMv8
Architecture

ARMv8-A
e.g. Cortex-A53
Cortex-A57
ARMv8-R

ARMv7-M
e.g. Cortex-M4

e.g. ARM1136
As of Dec 2013

ARM Cortex-M Series Family

15

Processor

ARM
Architecture

Core
Architecture

Thumb

Thumb-2

Hardware
Multiply

Hardware
Divide

Saturated
Math

DSP
Extensions

Floating
Point

Cortex-M0

ARMv6-M

Von
Neumann

Most

Subset

1 or 32
cycle

No

No

Software

No

Cortex-M0+

ARMv6-M

Von
Neumann

Most

Subset

1 or 32
cycle

No

No

Software

No

Cortex-M1

ARMv6-M

Von
Neumann

Most

Subset

3 or 33
cycle

No

No

Software

No

Cortex-M3

ARMv7-M

Harvard

Entire

Entire

1 cycle

Yes

Yes

Software

No

Cortex-M4

ARMv7E-M

Harvard

Entire

Entire

1 cycle

Yes

Yes

Hardware

Optional

RISC CPU Characteristics


32-bit

load/store architecture
Fixed instruction length
Fewer/simpler instructions than CISC CPU
Limited addressing modes, operand types
Simple design easier to speed up, pipeline &
scale

16

ARM assembly language


Fairly

standard RISC assembly


language:

label

LDR r0,[r8]
ADD r4,r0,r1

destination

17

source/left

; a comment
;r4=r0+r1

source/right

ARM data types


Word

is 32 bits long.
Word can be divided into four 8-bit bytes.
ARM addresses can be 32 bits long.
Address refers to byte.
Address 4 starts at byte 4.
Configure at power-up in either little- or bitendian mode.

18

Endianness
Relationship between bit and byte/word ordering

defines endianness:

bit 31
byte 3

byte 2

byte 1

little-endian
(default)

19

bit 0
byte 0

bit 0
byte 0

byte 1

byte 2

big-endian

bit 31
byte 3

ARM CORTEX-M4 PROCESSOR


OVERVIEW
20

Cortex-M4 Processor Overview


Cortex-M4 Processor
Introduced in 2010
Designed with a large variety of highly efficient signal processing features
Features extended single-cycle multiply accumulate instructions, optimized SIMD arithmetic,

saturating arithmetic and an optional Floating Point Unit.

High Performance Efficiency


1.25 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz) at the order of Watts / MHz

Low Power Consumption


Longer battery life especially critical in mobile products

Enhanced Determinism
The critical tasks and interrupt routines can be served quickly in a known number of cycles

21

Cortex-M4 Processor Features


32-bit Reduced Instruction Set Computing (RISC) processor
Harvard architecture
Separated data bus and instruction bus

Instruction set
Include the entire Thumb-1 (16-bit) and Thumb-2 (16/ 32-bit) instruction sets

3-stage + branch speculation pipeline


Performance efficiency
1.25 1.95 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz)

Supported Interrupts
Non-maskable Interrupt (NMI) + 1 to 240 physical interrupts
8 to 256 interrupt priority levels
22

Cortex-M4 Processor Features

Supports Sleep Modes

Up to 240 Wake-up Interrupts

Integrated WFI (Wait For Interrupt) and WFE (Wait For Event) Instructions and Sleep On Exit capability (to be
covered in more detail later)

Sleep & Deep Sleep Signals

Optional Retention Mode with ARM Power Management Kit

Enhanced Instructions

Hardware Divide (2-12 Cycles)

Single-Cycle 16, 32-bit MAC, Single-cycle dual 16-bit MAC

8, 16-bit SIMD arithmetic

Debug

Optional JTAG & Serial-Wire Debug (SWD) Ports

Up to 8 Breakpoints and 4 Watchpoints

Memory Protection Unit (MPU)

23

Optional 8 region MPU with sub regions and background region

Cortex-M4 Processor Features

Cortex-M4 processor is designed to meet the challenges of low dynamic power


constraints while retaining light footprints

180 nm ultra low power process 157 W/MHz

90 nm low power process 33 W/MHz

40 nm G process 8 W/MHz

ARM Cortex-M4 Implementation Data

24

Process

180ULL
(7-track, typical 1.8v, 25C)

90LP
(7-track, typical 1.2v, 25C)

40G
9-track, typical 0.9v, 25C)

Dynamic Power

157 W/MHz

33 W/MHz

8 W/MHz

Floorplanned Area

0.56 mm2

0.17 mm2

0.04 mm2

Cortex-M4 Block Diagram


ARM Cortex-M4 Microprocessor
Optional FPU
Optional
WIC

Nested Vector
Interrupt
Controller
(NVIC)

Optional
Debug
Access Port

Processor core

Optional
Embedded
Trace Macrocell

Optional Memory
protection unit

Optional Serial
Wire Viewer

Optional
Flash
patch

Optional
Data
watchpoints

Bus matrix
Code interface

25

SRAM and
peripheral interface

Cortex-M4 Block Diagram


Processor core
Contains internal registers, the ALU, data path, and some control logic
Registers include sixteen 32-bit registers for both general and special usage

Processor pipeline stages


Three-stage pipeline: fetch, decode, and execution
Some instructions may take multiple cycles to execute, in which case the pipeline will be stalled
The pipeline will be flushed if a branch instruction is executed
Up to two instructions can be fetched in one transfer (16-bit instructions)
Instruction 1

Fetch

Instruction 2

Fetch

Decode

Execute
Decode

Instruction 3

Fetch

Instruction 4

Fetch

Execute
Decode

Execute
Decode

Execute
Time

26

Cortex-M4 Block Diagram


Nested Vectored Interrupt Controller (NVIC)
Up to 240 interrupt request signals and a non-maskable interrupt (NMI)
Automatically handles nested interrupts, such as comparing priorities between interrupt

requests and the current priority level

Wakeup Interrupt Controller (WIC)


For low-power applications, the microcontroller can enter sleep mode by shutting down

most of the components.


When an interrupt request is detected, the WIC can inform the power management unit

to power up the system.

Memory Protection Unit (optional)


Used to protect memory content, e.g. make some memory regions read-only or

preventing user applications from accessing privileged application data

27

Cortex-M4 Block Diagram


Bus interconnect
Allows data transfer to take place on different buses simultaneously
Provides data transfer management, e.g. a write buffer, bit-oriented operations

(bit-band)
May include bus bridges (e.g. AHB-to-APB bus bridge) to connect different buses

into a network using a single global memory space


Includes the internal bus system, the data path in the processor core, and the

AHB LITE interface unit

Debug subsystem
Handles debug control, program breakpoints, and data watchpoints
When a debug event occurs, it can put the processor core in a halted state,
28

where developers can analyse the status of the processor at that point, such as
register values and flags

STM32F4xx Block Diagram

RDP (JTAG fuse)


More I/Os in UFBGA 176 package
29

JTAG/SW Debug
ETM
Nested vect IT Ctrl

1 x Systic Timer
DMA

AHB2 (max 168MHz)

D-bus
I-bus
S-bus

16 Channels

Clock Control

AHB1
(max 168MHz)

51/82/114/140 I/Os
2x6x 16-bit PWM
Synchronized AC Timer

3 x 16bit Timer
Up to 16 Ext. ITs
1 x SPI
2 x USART/LIN

Bridge

512kB- 1MB
Flash Memory

Flash I/F

CORTEX-M4
CPU + FPU +
MPU
168 MHz

ARM 32-bit multi-AHB bus matrix


Arbiter (max 168MHz)

New application specific peripherals


USB OTG HS w/ ULPI interface
Camera interface
HW Encryption**: DES, 3DES, AES
256-bit, SHA-1 hash, RNG.
Enhanced peripherals
USB OTG Full speed
ADC: 0.416s conversion/2.4Msps, up to
7.2Msps in interleaved triple
mode
ADC/DAC working down to 1.8V
2
Dedicated PLL for I S precision
Ethernet w/ HW IEEE1588 v2.0
32-bit RTC with calendar
4KB backup SRAM in VBAT domain
2 x 32bit and 8 x 16bit Timers
high speed USART up to 10.5Mb/s
high speed SPI up to 37.5Mb/s

64KB CCM data RAM

APB2 (max 84MHz)

Cortex-M4 w/ FPU, MPU and ETM


Memory
Up to 1MB Flash memory
192KB RAM (including 64KB CCM data RAM
FSMC up to 60MHz

Encryption**
Camera Interface
USB 2.0 OTG FS

128KB SRAM
External Memory
Interface
USB 2.0 OTG
FS/HS

Power Supply Reg


1.2V POR/PDR/PVD
XTAL oscillators
32KHz + 8~25MHz

Ethernet MAC 10/100,


IEEE1588
Bridge

APB1 (max 42MHz)

Int. RC oscillators
32KHz + 16MHz

PLL
RTC / AWU

5x 16-bit Timer
2x 32-bit Timer

4KB backup RAM


2x DAC + 2 Timers

2x Watchdog
(independent& window)

1x SDIO
3x 12-bit ADC
24 channels / 2Msps

Temp Sensor

HS requires an external PHY connected to ULPI interface,


** Encryption is only available on STM32F415 and STM32F417

2x CAN 2.0B
2 x SPI / I2S
4x USART/LIN
3x I2C

STM32F4 Series highlights 1/3


Based on Cortex M4 core
The new DSP and FPU instructions combined to 168MHz
Over 30 new part numbers pin-to-pin and software compatible with
existing STM32 F2 Series.

Advanced technology and process from ST:


Memory accelerator: ART Accelerator
Multi AHB Bus Matrix
90nm process

Outstanding results:
210DMIPS at 168MHz.
Execution from Flash equivalent to 0-wait state performance up to 168MHz thanks to ST
ART Accelerator
30
5

STM32F4 Series highlights 2/3


More Memory
Up to 1MB Flash with option to permanent readout protection (JTAG fuse),
192kB SRAM: 128kB on bus matrix + 64kB (Core Coupled Memory) on data bus dedicated to
the CPU usage

Advanced peripherals

USB OTG High speed 480Mbit/s


Ethernet MAC 10/100 with IEEE1588
PWM High speed timers: 168MHz max frequency
Crypto/Hash processor, 32-bit random number generator (RNG)
32-bit RTC with calendar: with sub 1 second accuracy, and <1uA

31
6

STM32F4 Series highlights 3/3


Further improvements

Low voltage: 1.8V to 3.6V VDD , down to 1.7*V on most packages


Full duplex I2S peripherals
12-bit ADC: 0.41s conversion/2.4Msps (7.2Msps in interleaved mode)
High speed USART up to 10.5Mbits/s
High speed SPI up to 37.5Mbits/s
Camera interface up to 54MBytes/s

*external reset circuitry required to support 1.7V


32
7

STM32F4 portfolio

Extensive tools and SW

Evaluation board for full product feature evaluation


Hardware evaluation platform for all interfaces
Possible connection to all I/Os and all peripherals
Discovery kit for cost-effective evaluation and prototyping

STM3240G-EVAL
$349

Starter kits from 3rd parties available soon

Large choice of development IDE solutions from the STM32 and ARM
ecosystem

STM32F4DISCOVERY

$14.90

34

Tools for development SW (examples)

Commercial ones:
IAR eval 32kB/30days for test
[RK-System]
Keil (ARM) eval 32kB for test
[WG Electronics]
Based on GCC commercial:
Atollic Lite (no hex/bin, limited debug), [Kamami]
Raisonance debug limited to 32kB
Rowley Crossworks 30 days for test
Free
STVP FLASH prog.
STLink utility FLASH prog.
(+cmd line)
ST FlashLoader FLASH prog.
Libraries (free)
Standard peripherals library with CMSIS
USB device library

35

Cortex-M processors binary compatible

Cortex-M feature set comparison


Cortex-M4

Cortex-M0
Cortex-M3
Instruction set architecture
Architecture Version

v7ME

Thumb, Thumb-2
Instructions
V6M Thumb + Thumb-2 System
v7M
0.9

DMIPS/MHz
Bus interfaces

Yes

Integrated NVIC
Number interrupts

1.25

1-32 + NMI

Yes
1-240 + NMI

Interrupt priorities

8-256
4/2/0, 2/1/0

8/4/0, 2/1/0

Thumb + Thumb-2, DSP,


SIMD, FP
1.25
3
Yes
1-240 + NMI
8-256

Breakpoints, Watchpoints

No

Yes (Option)

8/4/0, 2/1/0

Memory Protection Unit (MPU)

No

Yes (Option)

Yes (Option)

Integrated trace option (ETM)

No

Yes (Option)

Yes (Option)

Fault Robust Interface

Yes (Option)

No

Yes

Single Cycle Multiply

No

Yes

Yes

Hardware Divide

Yes

Yes

Yes

WIC Support

No

Yes

Yes

Bit banding support

No

No

Yes

Single cycle DSP/SIMD

No

No

Yes

Floating point hardware


Bus protocol
CMSIS Support
37

AHB Lite AHB Lite, APB Yes

Yes

Yes
AHB Lite, APB
Yes

13

ARM CORTEX-M4 PROCESSOR


REGISTERS
38

Cortex-M4 Registers
Processor registers
The internal registers are used to store and process temporary data within the

processor core
All registers are inside the processor core, hence they can be accessed quickly
Load-store architecture
To process memory data, they have to be first loaded from memory to registers,
processed inside the processor core using register data only, and then written back to
memory if needed

Cortex-M4 registers
Register bank
Sixteen 32-bit registers (thirteen are used for general-purpose);

Special registers
39

Cortex-M4 Registers
R0

Register bank

R1
R2
R3
R4

Low
Registers

R5
General purpose
register

R6
R7
R8
R9
R10
R11

Special registers

R12

MSP

Stack Pointer (SP)

R13(banked)

Main Stack Pointer

Link Register (LR)

R14

PSP

Program Counter (PC)

R15

Process Stack Pointer

Program Status Registers (PSR)

Interrupt mask register

x PSR

APSR

EPSR

IPSR

PRIMASK

Application
PSR

Execution
PSR

Interrupt
PSR

FAULTMASK
BASEPRI

Stack definition

40

High
Registers

CONTROL

Cortex-M4 Registers
R0 R12: general purpose registers
Data

Low registers (R0 R7) can be accessed by any instruction


High registers (R8 R12) sometimes cannot be accessed e.g. by some

Data

PUSH

POP

Thumb (16-bit) instructions

Low

R13: Stack Pointer (SP)

Stack

Records the current address of the stack


Used for saving the context of a program while switching between tasks
Cortex-M4 has two SPs: Main SP, used in applications that require privileged

access e.g. OS kernel, and exception handlers, and Process SP, used in baselevel application code (when not running an exception handler)

SP
High
PC

Heap

Program Counter (PC)


Records the address of the current instruction code
Automatically incremented by 4 at each operation (for 32-bit instruction

code), except branching operations


A branching operation, such as function calls, will change the PC to a specific
41

address, meanwhile it saves the current PC to the Link Register (LR)

Address

Code

Cortex-M4 Registers
R14: Link Register (LR)
The LR is used to store the return address of a subroutine or a function call
The program counter (PC) will load the value from LR after a function is finished

Current PC
PC

LR

LR
2. Load PC with
the starting
address of the
subroutine

Main
Program
code

Load PC with the


address in LR to
return to the
main program

subroutine

Current PC

Code region

Main
Program
code

Code region

1. Save current
PC to LR

Current LR

subroutine

PC

Call a subroutine

42

Return from a subroutine to the main program

Cortex-M4 Registers
xPSR, combined Program Status Register
Provides information about program execution and ALU flags
Application PSR (APSR)
Interrupt PSR (IPSR)
Execution PSR (EPSR)
APSR

NZCVQ

Reserved

IPSR

Reserved

EPSR

xPSR

NZCVQ
bit31

43

ISR number

ICI/IT

Reserved

ICI/IT

ICI/IT

Reserved

ICI/IT

bit24

bit16

bit8

ISR number
bit0

Cortex-M4 Registers
APSR
N: negative flag set to one if the result from ALU is negative
Z: zero flag set to one if the result from ALU is zero
C: carry flag set to one if an unsigned overflow occurs
V: overflow flag set to one if a signed overflow occurs
Q: sticky saturation flag set to one if saturation has occurred in saturating arithmetic instructions,

or overflow has occurred in certain multiply instructions

IPSR
ISR number current executing interrupt service routine number

EPSR
T: Thumb state always one since Cortex-M4 only supports the Thumb state (more on processor

states in the next module)


IC/IT: Interrupt-Continuable Instruction (ICI) bit, IF-THEN instruction status bit

44

ARM status bits


Every

arithmetic, logical, or shifting operation can


set CPSR bits:
N (negative), Z (zero), C (carry),V (overflow)
Examples:
-1 + 1 = 0:
NZCV = 0110.
231-1+1 = -231:
NZCV = 1001.
Setting status bits must be explicitly enabled on

each instruction
ex.
45

adds sets status bits, whereas add does not

Cortex-M4 Registers
Interrupt mask registers
1-bit PRIMASK
Set to one will block all the interrupts apart from nonmaskable interrupt (NMI) and the
hard fault exception

1-bit FAULTMASK
Set to one will block all the interrupts apart from NMI

1-bit BASEPRI
Set to one will block all interrupts of the same or lower level (only allow for interrupts
with higher priorities)

CONTROL: special register


1-bit stack definition
Set to one: use the process stack pointer (PSP)
Clear to zero: use the main stack pointer (MSP)
46

Cortex-M4 Registers
PRIMASK
PRIMASK

Reserved
FAULTMASK

FAULTMASK

Reserved
BASEPRI

BASEPRI

Reserved

CONTROL

Reserved
bit31

bit24

bit16

bit8

Stack definition

47

Cortex-M4 Operation Modes

48

49

50

Cortex M4

DSP features

Cortex-M4 processor architecture


ARMv7ME Architecture

Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 -> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3

Microarchitecture
3-stage pipeline with branch speculation
3x AHB-Lite Bus Interfaces

Configurable for ultra low power


Deep Sleep Mode, Wakeup Interrupt Controller
Power down features for Floating Point Unit

Flexible configurations for wider applicability


Configurable Interrupt Controller (1-240 Interrupts and Priorities)
Optional Memory Protection Unit
Optional Debug & Trace
52

15

Cortex-M4 overview
Main Cortex-M4 processor features
ARMv7-ME architecture revision
Fully compatible with Cortex-M3 instruction set

Single-cycle multiply-accumulate (MAC) unit


Optimized single instruction multiple data (SIMD)
instructions
Saturating arithmetic instructions
Optional single precision Floating-Point Unit (FPU)
Hardware Divide (2-12 Cycles), same as Cortex-M3
Barrel shifter (same as Cortex-M3)
Hardware divide (same as Cortex-M3)
53

Single-cycle multiply-accumulate unit


The multiplier unit allows any MUL or MAC
instructions to be executed in a single cycle
Signed/Unsigned Multiply
Signed/Unsigned Multiply-Accumulate
Signed/Unsigned Multiply-Accumulate Long (64-bit)

Benefits : Speed improvement vs. Cortex-M3


4x for 16-bit MAC (dual 16-bit MAC)
2x for 32-bit MAC
up to 7x for 64-bit MAC

54

Cortex-M4 extended single cycle MAC


O P ER ATIO N
16 x 16 = 32
16 x 16 + 32 = 32
16 x 16 + 64 = 64
16 x 32 = 32
(16 x 32) + 32 = 32

IN S TR U C TIO N S

CM 3

CM 4

SM U LBB, SM U LBT, SM U LTB, SM U LTT


SM LABB, SM LABT, SM LATB, SM LATT
SM LALBB, SM LALBT, SM LALTB, SM LALTT
SM U LW B, SM U LW T
SM LAW B, SM LAW T
SM U AD , SM U AD X, SM U SD , SM U SD X

n/a
n/a
n/a
n/a
n/a
n/a

1
1
1
1
1
1

(16 x 16) (16 x 16) + 32 = 32

SM LAD , SM LAD X, SM LSD , SM LSD X

n/a

(16 x 16) (16 x 16) + 64 = 64

SM LALD , SM LALD X, SM LSLD , SM LSLD X

n/a

32 x 32 = 32

M UL

32 (32 x 32) = 32
32 x 32 = 64
(32 x 32) + 64 = 64
(32 x 32) + 32 + 32 = 64

M LA, M LS

SM U LL, U M U LL
SM LAL, U M LAL
U M AAL

5-7
5-7
n/a

1
1
1

32 (32 x 32) = 32 (upper)


(32 x 32) = 32 (upper)

SM M LA, SM M LAR, SM M LS, SM M LSR

n/a

SM M U L, SM M U LR

n/a

(16 x 16) (16 x 16) = 32

All the above operations are single cycle on the Cortex-M4 processor
55

Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max
boundaries and remove CPU burden due to software range
checks
Benefits
1,5

1,5Audio applications

Without
saturation

0,5
0
-0,5
-1

0,5

-1,5
1,5

-0,5

0,5

-1

With
saturation

-1,5

Control applications

0
-0,5
-1
-1,5

The PID controllers integral term is continuously accumulated over time. The
saturation automatically limits its value and saves several CPU cycles per
regulators
56

Single-cycle SIMD instructions

Stands for Single Instruction Multiple Data


It operates with packed data
Allows to do simultaneously several operations with 8-bit or 16-bit data format

i.e.: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)

Benefits
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between memory and register file
(2 or 4 data transferred at once), if 32-bit is not necessary
Maximizes register file use (1 register holds 2 or 4 values)

57

Packed data types

Byte or halfword quantities packed into words


Allows more efficient access to packed structure types
SIMD instructions can act on packed data
Instructions to extract and pack data
A

00......00 A

Extract
00......00 B
Pack

A
58

IIR single cycle MAC benefit


xN = *x++;
yN = xN * b0;
yN += xNm1
* b1;
yN += xNm2
* b2;
yN -= yNm1
* a1;
yN -= yNm2
* a2;
*y++
= yN;
xNm2
= xNm1;
xNm1
= xN;
yNm2
= yNm1;
yNm1
= yN;
Decrement loop counter

Cortex-M3 Cortex-M4 cycle


countcycle count
2
3-7
3-7
3-7
3-7
3-7
2
1
1
1
1

1 Branch
2

Only looking at the inner loop, making these assumptions

59

2
1
1
1
1
1
2
1
1
1
1
1
2

Function operates on a block of samples


Coefficients b0, b1, b2, a1, and a2 are in registers
Previous states, x[n-1], x[n-2], y[n-1], and y[n-2] are in registers

Inner loop on Cortex-M3 takes 27-47 cycles per sample


Inner loop on Cortex-M4 takes 16 cycles per sample

yn

b0 x n
a1 y n
1

b1 x n 1
a2 y n

b2 x n 2
2

Further optimization strategies


Circular addressing alternatives
Loop unrolling
Caching of intermediate variables
Extensive use of SIMD and intrinsics

60

FIR Filter Standard C Code


void fir(q31_t *in, q31_t *out, q31_t *coeffs, int
int filtLen, int blockSize)
{
int sample; int k; q31_t
sum;
int stateIndex =
*stateIndexPtr;
for(sample=0; sample < blockSize; sample++)
{
state[stateIndex++] = in[sample]; sum=0;
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex]; stateIndex--;
if (stateIndex < 0)
{
stateIndex
}
}
out[sample]=sum;
}
*stateIndexPtr = stateIndex;
}

61

= filtLen-1;

*stateIndexPtr,

Block based processing


Inner loop consists of:
Dual memory
fetches
MAC
Pointer updates with
circular addressing

FIR Filter DSP Code


32-bit DSP processor assembly code
Only the inner loop is shown, executes in a single cycle
Optimized assembly code, cannot be achieved
in C

Zero overhead loop


lcntr=r2, do
FIRLoop
lce;
FIRLoop: untilf12=f0*f4,
f8=f8+f12, f4=dm(i1,m4), f0=pm(i12,m12);

Multiply and
accumulate
previous

62

Coeff fetch with linear


addressing

State fetch with


circular addressing

Cortex-M4 - Final FIR Code


sample = blockSize/4; do
{
sum0 = sum1 = sum2 = sum3 = 0; statePtr =
stateBasePtr; coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++); i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++); x3 = *(q31_t *)
(statePtr++); sum0 =
SMLALD(x0, c0, sum0);

sum1
sum2
sum3

= __SMLALD(x1, c0, sum1);


=
SMLALD(x2, c0, sum2);
=
SMLALD(x3, c0, sum3);

c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++); x1 =
*(q31_t *)(statePtr++);
sum0
=
SMLALD(x0, c0, sum0);
sum1
=
SMLALD(x1, c0, sum1);
sum2
=
SMLALD (x2, c0, sum2);
sum3
=
SMLALD (x3, c0, sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);

63

Uses loop unrolling, SIMD intrinsics, caching of


states and coefficients, and work around circular
addressing by using a large state buffer.
Inner loop is 26 cycles for a total of 16, 16-bit
MACs.
Only 1.625 cycles per filter tap!

Cortex-M4 - FIR performance


DSP assembly code = 1 cycle

Cortex-M4 standard C code takes 12 cycles


Using circular addressing alternative = 8 cycles
After loop unrolling < 6 cycles
After using SIMD instructions
< 2.5 cycles
After caching intermediate values ~ 1.6 cycles
Cortex-M4 C code now comparable in
performance

64

Useful Resources
Architecture Reference Manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html

Cortex-M4 Technical Reference Manual:


http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_processo
r_r0p1_trm.pdf

Cortex-M4 Devices Generic User Guide:


http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf

65

You might also like