You are on page 1of 42

Engineering Concurrent Software Guided by Statistical Performance Analysis

Bernd Scheuermann, SAP AG ParCo 2011


Clemens Grelck a Kevin Hammond b Heinz Hertlein c Philip Hlzenspies b Chris Jesshope a Raimund Kirner d Bernd Scheuermann e Alex Shafarenko d Iraneus te Boekhorst d Volkmar Wieser f a: Institute for Informatics, University of Amsterdam, The Netherlands b: School of Computer Science, University of St Andrews, United Kingdom c: BioID GmbH, Nrnberg, Germany d: School of Computer Science, University of Hertfordshire, United Kingdom e: SAP AG, SAP Research Center Karlsruhe, Germany f: Software Competence Center Hagenberg (SCCH), Austria

Challenges Multicore Software Engineering


ADVANCE Approach
Increasing degree of complexity Incresing demand for performance Increasing masses of data Too many sequential blocks (not easily parallelizable) Often non-deterministic performance Exloiting parallelism requires adaptation Vendor-dependence (low portability) High demand on technical expertise Poor usability of engineering tools Lack of qualified multicore developers Express performance requirements on average in a probabilistic manner Match with statistical performance analysis at execution time Dynamic reconfiguration during execution to match performance requirements Support automatic parallelization Avoid vendor lock-in Support heterogenous devices including FPGAs and GP GPUs Increase productivity of multicore developers Foster industrial adoption of multicore software engineering Thorough top-down design Provide integrated tool chain and languages for componentization and coordination Separation of concerns supports division of labor: decouple concurrency engineering from algorithm engineering

Applications

Multicore SoftwareEngineering

Multicore Hardware

Increasing number of cores Increasing heterogeneity Main memory access contention

EU-project ADVANCE: www.project-advance.eu

Concurrent Software Engineering with ADVANCE

Concurrent Software Engineering = Algorithm Engineering + Concurreny Engineering


Algorithm Engineering Algorithm design Computation Functional, stateless programming Resource-independent Concurrency Engineering Algorithm as black box Communication Synchronisation Component deployment

Interface definition

Algorithm Engineer Expertise in programming Tools: C, Single Assignment C (SAC) URL: www.sac-home.org

Concurrency Engineer Expertise in concurrent system design Tools: S-Net URL: www.snet-home.org

System Architecture
Concurrency engineer User Algorithm engineer

S-Net Declarative language Boxes are atomic or terminal symbols Inductive modeling of streaming networks CAL: Constraint Aggregation Language Describe functional and extra-functional behavior of S-NET boxes Define cause-and-effect relations between box input and output, define constraints

Concurrency Eng. S-NET


...

Algorithm Eng. ... SAC C

Performance Management Environment CAL

SAC: Single Assignment C Functional, array-based programming language Automatic parallelization Any other language may be used Constraint: statelessness

Modeling domain

Dynamic Compiler Transformations

Compilation Execution domain

Feedback and analytics domain

Statistical analysis

Resource Mgmt

Virtualization Monitoring SVP

multicore system User interfaces

Console

attached devices

User

Admin

Device user

SVP: SANE Virtual Processor Developed by AETHER project Targeted by high-level programming languages Dynamic creation of tasks and sub-tasks (sequential, parallel or pipelined) Static and dynamic mapping of tasks to resources (at compile time or run-time) Self-adaptive mapping

Agenda
1. Single Assignment C (SAC) 2. S-Net 3. Statistical Performance Analysis 4. Initial Evaluation Results 5. Conclusion

SAC Single Assignment C


Partners: Univ. of Amsterdam, Univ. of Hertfordshire
Language Characteristics: Functional Programming SaC compiler sac2c Implementations for: pthreads MPI Compute Unified Device Architecture (CUDA) What not How close to the algorithm no resource / cost awareness side-effect free no local variable declarations required function overloading APL-like operations Everything is an array generalised selections index-free, combinator style computations shape-invariant programming (able to apply functions to arbitrarily shaped arguments) Advantages: high programming productivity excellent code maintainability high code reuse high potential for data-parallelism Platforms supported: Solaris (Sparc,x86) Linux (x86 32bit, x86 64bit) DEC (alpha), Mac OS X (ppc,x86), netBSD (under preparation).

SAC Single Assignment C


Example: Shape-Generic Operations

SAC Single Assignment C


Fundamental idea: Formulate Algorithms in terms of SPACE rather than TIME Example: factorial

SAC Single Assignment C


Different hardware architectures require different code generation strategies

SAC Single Assignment C


SaC at a Glance

Agenda
1. Single Assignment C (SAC) 2. S-Net 3. Statistical Performance Analysis 4. Initial Evaluation Results 5. Conclusion

11

S-Net: Language Summary


Partners: Univ. of Amsterdam, Univ. of Hertfordshire S-Net as a coordination language provides separation of concerns:
logic programming in box language (e.g., SaC or ISO C) concurrency programming in S-Net

Synchronization mechanisms implicit (via streams) and explicit (via SynchroCell) Type system to determine routing of messages (with flow inheritance as additional glue) Works on shared memory and also distributed systems S-Net webpage: http://www.snet-home.org
12

How to exploit parallelism in practice?

Applications as streaming networks

Coordinator (S-Net)

allows explicit synchronization, but most synchronization is implicit via streams

S-Net Boxes

SaC

ISO C

{A,B,<T>}

box foo {A,B,<T>} {X, <T>}

{X,<T>}

S-Net Boxes

{A,B,C,<T>}

box foo {A,B,<T>} {X, <T>}

S-Net Boxes

flow inheritance {A,B,C,<T>}


box foo {A,B,<T>} {X, <T>}

{X,C,<T>}

Network Combinators
Serial Combination: net X connect foo1 .. foo2
box foo1 {A} {B} box foo1 {B} {C}

Parallel Combination: net X connect foo1 | foo2


box foo1 {A} {B} box foo2 {B} {C}

{A} -> {C}

{A} -> {B}, {B} -> {C}

Network Combinators
Parallel Replication: net X connect foo ! <T>
! <T> box foo
{A,<T>} {B}

Serial Replication: net X connect foo * {E}


* {E} box foo
{A} {A}; {E}

box foo {A,<T>} {B}

...

<T>

box foo {A} {A}; {E}

box foo {A} {A}; {E}

box foo {A} {A}; {E}

...

box foo {A,<T>} {B}

Explicit Synchronization
The one-shot synchronization cell: [|{A,B},{C,D}|] Pattern for continuous synchronization: [|{A,B}|] * {A,B}
* {A,B} sync
{A,B} {C,D}

sync
{A,B}

S-Net Tool Chain


Statistical Analysis CAL descr.

S-Net Compiler

S-Net Runtime

Box Interface

Box Compiler

S-Net Module

Library / Objects

Executable

Agenda
1. Single Assignment C (SAC) 2. S-Net 3. Statistical Performance Analysis 4. Initial Evaluation Results 5. Conclusion

22

The Problem
[Partner: University of St Andrews] Need to find statistically valid cost information
For S-Net networks composed from sub-networks For alternative hardware configurations Base this on sample execution data

Use this to predict future behaviours


For alternative combinations of sub-networks And/or different hardware combinations

Integrate this with S-Net and SVP


Predict costs for virtual hardware configurations
23

Analysis Approach
Build a type-based cost model for S-Net networks Determine probability distribution functions (pdfs) For key metrics of Latency, Jitter and Throughput For sub-networks and networks For different abstract hardware Combine them using a type-based analysis
Based on the cost-model Exploiting well-known type-and-effect technology

Use the combined pdfs to predict future costs

Box A Hardware H1

30 20 10 0 1 3 5 7 9 11 13 15 17 19

0.3 0.25 0.2 0.15 0.1 0.05 0

p(A/H1)

0.25 0.2 0.15 0.1 0.05 0 1 3 5 7 9 1113151719

A/H1
25 20 15 10 5 0 1 3 5 7 9 11 13 15 17 19

1 4 7 10 13 16 19 0.3 0.25 0.2 0.15 0.1 0.05 0 1 4 7 10 13 16 19 0.14

Box A Hardware H2

p(B/H1 || A/H1 )
0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 1 4 7 10 13 16 19 0 1 3 5 7 9 1113151719

A/H2 Box B Hardware H1


30 20 10 0 1 3 5 7 9 11 13 15 17 19

0.12 0.1 0.08 0.06 0.04 0.02 0

B/H1
Measure

Obtain Probability Density Function (pdf)

p(B/H1)
Combine pdfs

p(B/H1 || A/H2 )
25

Measurements
Obtained from execution on hardware platforms
Key metrics latency, throughput and jitter Associated with
networks/boxes

Depending on
physical hardware key execution parameters

Aggregated to give probability distributions

26

Copulas [Nelsen 1999]


H C F G

Need to combine two marginal distributions

Joint Distribution is: H = P( Z = f(x,y) ) H can be built using copula C:

For a given F, G, H there is always a unique copula Copulas capture all kinds of dependence relations, including independence and partial dependence

Some Caveats [Mikosch 2006]


Copulas must be chosen properly
Copulas model specific relationships between distributions Dont choose a standard copula carelessly

Copulas predict the future based on past results


So rare outcomes may not be predicted

Copulas can be expensive to calculate


But may be amenable to better programming technology

Poor use of copulas affected major financial decisions


Sub-prime mortgages Complex financial instruments

Why Our Situation is Different


Worst-case result different No high-stakes irrevocable investments Failure just causes bad performance

Fall-back solutions to be investigated


Persistent failure: fallback to simpler model

We can dynamically improve our cost model


Run-time verification/recalibration Machine Learning Technology

Comparison with Bernats Work


[G. Bernat, A. Burns, and M. Newby. Probabilistic timing analysis: An approach using copulas. J. Embedded Computing, 1(2):179194, 2005.] Bernat deals with worst-case execution times
We deal with both average case and worst-case

Bernat deals with sequential code only


We deal with parallel code, including all S-Net operations

The structured nature of S-Net allows us to deal with pdfs at multiple levels
Boxes, sub-networks, networks can all have pdfs

Agenda
1. Single Assignment C (SAC) 2. S-Net 3. Statistical Performance Analysis 4. Initial Evaluation Results 5. Conclusion

31

Applications
Overview
Interventional X-Ray Processing

Biometric identification Supply Chain Management Quality Inspection of Textured Surfaces

32

Biometric Test and Optimization Framework by BioID


Test protocol, implemented in SAC, to compute the recognition accuracy of speaker classification Experimental result: Computation time to compute a single identification rate Using SAC-feature -mt to automatically set the thread count Number of Threads 2 3 6 12 48

Speed-up

Intel system (6 cores) AMD system (48 cores)

1.99 1.98

2.91 2.91

5.66 5.55

5.97 10.20

25.24

Intel system (6 cores, Hyper-Threading)

Conclusion For the off-the-shelf Intel system, the SAC compiler achieves near-optimal speed-up for a real-world problem with an automatic parallelization of the identification classifier calls
33

High-Performance Inline Quality Inspection of Textured Surfaces for Manufacturing Industry, SCCH
Problem Description
In the eld of quality inspection of textured surfaces, e.g., for the production of foils or industrial woven fabrics, we have to cope with a high scanning speed up to 300m/min, i.e. about 80MB/sec per camera systems, and a complex phenomenology of textures and defects. This requires the application of advanced cost-intensive algorithms of image processing as well as machine learning, the use of high-performance computational hardware like GPUs or multi-core systems and the exploitation of parallelization potentials. The analysis of the whole processing pipeline (image acquisition, preprocessing, feature extraction, registration, defect detection and classi cation) with standard languages regarding performance is a resource- and time-intensive challenge.

34

Processing Pipeline, SCCH

35

hpFilter, SCCH
Benchmarking setup
CPU: DELL Precision 690 , 8 Cores, 3.2 GHz

Benchmarking CPU Execution time in seconds 50 40 30 20 10 0 px 1024x1024 px 2048x2048 px 4096x4096 OpenCV Sac-SEQ SAC-MT

OpenCV: http://opencv.willowgarage.com/wiki/
36

hpFilter, SCCH
Benchmarking Setup
NVIDIA GeForce GTX 480 Ultra , 480 Cores, 1.4 GHz

Benchmarking GPU 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 px 1024x1024 px 2048x2048 Data dimension px 4096x4096
37

Execution time in seconds

SAC-CUDA CUDA-manual

hpFilter, SCCH
Benchmarking Setup
UNIBENCH, 1-48 Cores, 2.2 GHz

Benchmarking Scalability 45 40 35 Speedup 30 25 20 15 10 5 0 2 8 24 Number of cores 48


38

px 2048 x 2048

Agenda
1. Single Assignment C (SAC) 2. S-Net 3. Statistical Performance Analysis 4. Initial Evaluation Results 5. Conclusion

39

Conclusion and Outlook


Presented ADVANCE approach to concurrent multicore SW engineering Will enable auto-parallelization and dynamic reconfiguration driven by statistical cost model and performance analysis Created the constraint aggregation language (CAL) as an interface to specify (extra-) functional behaviour of boxes and subsystems Maintaining developers productivity and vendor independence fosters industrial adoption of multicore programming Scalability tests with auto-parallelization through SAC showed encouraging results Initial experimental evaluation with SAC and SNET prototypes indicate competitive performance for the execution on CPU The manual CUDA implementation (hpFilter) on GPU still outperforms the implementation in SAC for CUDA Next steps include (amongst others): Extend statistical model to streams: Combinators transform hidden Markov models of input stream Build in test platform and statistically profile range of applications
40

THANK YOU!

41

The information in this document is proprietary to the following ADVANCE StatArch consortium members funded by means of European Union within the 7th Framework Program: University of Hertfordshire (HERTS), University of St Andrews (USTAN), University of Twente (TWENTE), Universiteit van Amsterdam (UvA), Philips Medical Systems Nederland BV (PHILIPS), Technion: Israeli Institute of Technology (TECHNION), University of California at Irvine (UCI), SAP AG (SAP), BioID GmbH (BioID) and Software Competence Center Hagenberg (SCCH). The information in this document is provided "as is", and no guarantee or warranty is given that the information is fit for any particular purpose. The above referenced consortium members shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials subject to any liability which is mandatory due to applicable law. Copyright 2011 by HERTS, USTAN, TWENTE, UvA, PHILIPS, TECHNION, UCI, SAP, BioID and SCCH. All rights reserved.

42

You might also like