CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
()
About this ebook
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.
To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.
- Leverage the power of GPU computing with PGI’s CUDA Fortran compiler
- Gain insights from members of the CUDA Fortran language development team
- Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches
- Includes full source code for all the examples and several case studies
- Download source code and slides from the book's companion website
Gregory Ruetsch
Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.
Related to CUDA Fortran for Scientists and Engineers
Related ebooks
Shared Memory Application Programming: Concepts and Strategies in Multicore Application Programming Rating: 0 out of 5 stars0 ratingsCUDA Application Design and Development Rating: 0 out of 5 stars0 ratingsAdvances in GPU Research and Practice Rating: 0 out of 5 stars0 ratingsCUDA Programming: A Developer's Guide to Parallel Computing with GPUs Rating: 4 out of 5 stars4/5GPU Programming in MATLAB Rating: 1 out of 5 stars1/5Accelerating MATLAB with GPU Computing: A Primer with Examples Rating: 3 out of 5 stars3/5GPU-based Parallel Implementation of Swarm Intelligence Algorithms Rating: 0 out of 5 stars0 ratingsPro Spring Boot 2: An Authoritative Guide to Building Microservices, Web and Enterprise Applications, and Best Practices Rating: 0 out of 5 stars0 ratingsProfessional CUDA C Programming Rating: 5 out of 5 stars5/5Parallel and High Performance Computing Rating: 0 out of 5 stars0 ratingsOpenCL Programming by Example Rating: 0 out of 5 stars0 ratingsLearning Jupyter Rating: 5 out of 5 stars5/5FORTRAN 90 for Scientists and Engineers Rating: 3 out of 5 stars3/5Mastering Julia Rating: 0 out of 5 stars0 ratingsPractical MATLAB Deep Learning: A Project-Based Approach Rating: 0 out of 5 stars0 ratingsMastering Python Scientific Computing Rating: 4 out of 5 stars4/5Clojure Programming Cookbook Rating: 0 out of 5 stars0 ratingsJulia Cookbook Rating: 0 out of 5 stars0 ratingsVisual Media Processing Using MATLAB Beginner's Guide Rating: 0 out of 5 stars0 ratingsPractical Scientific Computing Rating: 0 out of 5 stars0 ratingsScientific Computing with Python 3 Rating: 0 out of 5 stars0 ratingsHeterogeneous Computing with OpenCL 2.0 Rating: 0 out of 5 stars0 ratingsBoost.Asio C++ Network Programming Cookbook Rating: 0 out of 5 stars0 ratingsBuilding Python Real-Time Applications with Storm Rating: 0 out of 5 stars0 ratingsModern Fortran: Building efficient parallel applications Rating: 0 out of 5 stars0 ratingsLearning IPython for Interactive Computing and Data Visualization - Second Edition Rating: 2 out of 5 stars2/5Qt 5 Blueprints Rating: 4 out of 5 stars4/5Common Lisp A Complete Guide Rating: 1 out of 5 stars1/5Parallel Programming with OpenACC Rating: 5 out of 5 stars5/5Mathematica Navigator: Mathematics, Statistics and Graphics Rating: 4 out of 5 stars4/5
Computers For You
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsUltimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5Master Builder Roblox: The Essential Guide Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsMastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5User Friendly: How the Hidden Rules of Design Are Changing the Way We Live, Work, and Play Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5GarageBand Basics: The Complete Guide to GarageBand: Music Rating: 0 out of 5 stars0 ratingsCompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Storytelling with Data: Let's Practice! Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5AP® Computer Science Principles Crash Course Rating: 0 out of 5 stars0 ratings
Reviews for CUDA Fortran for Scientists and Engineers
0 ratings0 reviews
Book preview
CUDA Fortran for Scientists and Engineers - Gregory Ruetsch
CUDA Fortran for Scientists and Engineers
Best Practices for Efficient CUDA Fortran Programming
First Edition
Massimiliano Fatica
Gregory Ruetsch
NVIDIA Corporation, Santa Clara, CA
Table of Contents
Cover image
Title page
Copyright
Dedication
Acknowledgments
Preface
Companion Site
Part I: CUDA Fortran Programming
Chapter 1. Introduction
Abstract
1.1 A brief history of GPU computing
1.2 Parallel computation
1.3 Basic concepts
1.4 Determining CUDA hardware features and limits
1.5 Error handling
1.6 Compiling CUDA Fortran code
Chapter 2. Performance Measurement and Metrics
Abstract
2.1 Measuring kernel execution time
2.2 Instruction, bandwidth, and latency bound kernels
2.3 Memory bandwidth
Chapter 3. Optimization
Abstract
3.1 Transfers between host and device
3.2 Device memory
3.3 On-chip memory
3.4 Memory optimization example: matrix transpose
3.5 Execution configuration
3.6 Instruction optimization
3.7 Kernel loop directives
Chapter 4. Multi-GPU Programming
Abstract
4.1 CUDA multi-GPU features
4.2 Multi-GPU Programming with MPI
Part II: Case Studies
Chapter 5. Monte Carlo Method
Abstract
5.1 CURAND
with CUF kernels
with reduction kernels
5.4 Accuracy of summation
5.5 Option pricing
Chapter 6. Finite Difference Method
Abstract
6.1 Nine-Point 1D finite difference stencil
6.2 2D Laplace equation
Chapter 7. Applications of Fast Fourier Transform
Abstract
7.1 CUFFT
7.2 Spectral derivatives
7.3 Convolution
7.4 Poisson Solver
Part III: Appendices
Appendix A. Tesla Specifications
Appendix B. System and Environment Management
B.1 Environment variables
B.2 nvidia-smi System Management Interface
Appendix C. Calling CUDA C from CUDA Fortran
C.1 Calling CUDA C libraries
C.2 Calling User-Written CUDA C Code
Appendix D. Source Code
D.1 Texture memory
D.2 Matrix transpose
D.3 Thread- and instruction-level parallelism
D.4 Multi-GPU programming
D.5 Finite difference code
D.6 Spectral Poisson Solver
References
Index
Copyright
Acquiring Editor: Todd Green
Development Editor: Lindsay Lawrence
Project Manager: Punithavathy Govindaradjane
Designer: Matthew Limbert
Morgan Kaufmann is an imprint of Elsevier
225 Wyman Street, Waltham, MA 02451, USA
Copyright © 2014 Gregory Ruetsch/NVIDIA Corporation and Massimiliano Fatica/NVIDIA Corporation. Published by Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
Ruetsch, Gregory.
CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming / Gregory Ruetsch, Massimiliano Fatica.
pages cm
Includes bibliographical references and index.
ISBN 978-0-12-416970-8 (alk. paper)
1. FORTRAN (Computer program language) I. Fatica, Massimiliano. II. Title. III. Title: Best practices for efficient CUDA Fortran programming.
QA76.73.F25R833 2013
005.13’1--dc23
2013022226
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-416970-8
Printed and bound in the United States of America
14 15 16 17 18 10 9 8 7 6 5 4 3 2 1
For information on all MK publications visit our website at www.mkp.com
Dedication
To Fortran programmers, who know a good thing when they see it.
Acknowledgments
Writing this book has been an enjoyable and rewarding experience for us, largely due to the interactions with the people who helped shape the book into the form you have before you. There are many people who have helped with this book, both directly and indirectly, and at the risk of leaving someone out we would like to thank the following people for their assistance.
Of course, a book on CUDA Fortran would not be possible without CUDA Fortran itself, and we would like to thank The Portland Group (PGI), especially Brent Leback and Michael Wolfe, for literally giving us something to write about. Working with PGI on CUDA Fortran has been a delightful experience.
The authors often reflect on how computations used in their theses, which required many, many hours on large-vector machines of the day, can now run on an NVIDIA graphics processing unit (GPU) in less time than it takes to get a cup of coffee. We would like to thank those at NVIDIA who helped enable this technological breakthrough. We would like to thank past and present members of the CUDA software team, especially Philip Cuadra, Mark Hairgrove, Stephen Jones, Tim Murray, and Joel Sherpelz for answering the many questions we asked them.
Much of the material in this book grew out of collaborative efforts in performance-tuning applications. We would like to thank our collaborators in such efforts, including Norbert Juffa, Patrick Legresley, Paulius Micikevicius, and Everett Phillips.
Many people reviewed the manuscript for this book at various stages in its development, and we would like to thank Roberto Gomperts, Mark Harris, Norbert Juffa, Brent Leback, and Everett Phillips for their comments and suggestions.
We would like to thank Ian Buck for allowing us to spend time at work on this endeavor, and we would like to thank our families for their understanding while we also worked at home.
Finally, we would like to thank all of our teachers. They enabled us to write this book, and we hope in some way that by doing so, we have continued the chain of helping others.
Preface
This document is intended for scientists and engineers who develop or maintain computer simulations and applications in Fortran and who would like to harness the parallel processing power of graphics processing units (GPUs) to accelerate their code. The goal here is to provide the reader with the fundamentals of GPU programming using CUDA Fortran as well as some typical examples, without having the task of developing CUDA Fortran code become an end in itself.
The CUDA architecture was developed by NVIDIA to allow use of the GPU for general-purpose computing without requiring the programmer to have a background in graphics. There are many ways to access the CUDA architecture from a programmer’s perspective, including through C/C++ from CUDA C or through Fortran using The Portland Group’s (PGI’s) CUDA Fortran. This document pertains to the latter approach. PGI’s CUDA Fortran should be distinguished from the PGI Accelerator and OpenACC Fortran interfaces to the CUDA architecture, which are directive-based approaches to using the GPU. CUDA Fortran is simply the Fortran analog to CUDA C.
The reader of this book should be familiar with Fortran 90 concepts, such as modules, derived types, and array operations. For those familiar with earlier versions of Fortran but looking to upgrade to a more recent version, there are several excellent books that cover this material (e.g., Metcalf, 2011). Some features introduced in Fortran 2003 are used in this book, but these concepts are explained in detail. Although this book does assume some familiarity with Fortran 90, no experience with parallel programming (on the GPU or otherwise) is required. Part of the appeal of parallel programming on GPUs using CUDA is that the programming model is simple and novices can get parallel code up and running very quickly.
Often one comes to CUDA Fortran with the goal of porting existing, sometimes rather lengthy, Fortran code to code that leverages the GPU. Because CUDA is a hybrid programming model, where both GPU and CPU are utilized, CPU code can be incrementally ported to the GPU. CUDA Fortran is also used by those porting applications to GPUs mainly using the directive-base OpenACC approach, but who want to improve the performance of a few critical sections of code by hand-coding CUDA Fortran. Both OpenACC and CUDA Fortran can coexist in the same code.
This book is divided into two main parts. The first part is a tutorial on CUDA Fortran programming, from the basics of writing CUDA Fortran code to some tips on optimization. The second part is a collection of case studies that demonstrate how the principles in the first part are applied to real-world examples.
This book makes use of the PGI 13.x compilers, which can be obtained from http://pgroup.com. Although the examples can be compiled and run on any supported operating system in a variety of development environments, the examples included here are compiled from the command line as one would do under Linux or Mac OS X.
Companion Site
Supplementary materials for readers can be downloaded from Elsevier: http://store.elsevier.com/product.jsp?isbn=9780124169708.
Part I: CUDA Fortran Programming
Outline
Chapter 1 Introduction
Chapter 2 Performance Measurement and Metrics
Chapter 3 Optimization
Chapter 4 Multi-GPU Programming
Chapter 1
Introduction
Abstract
After a short discussion of the history of parallel computation on graphics processing units, or GPUs, this chapter goes through a sequence of simple examples that illustrate the fundamental aspects of computation on GPUs using CUDA Fortran. The hybrid nature of CUDA Fortran programming is illustrated, which contains both host code that is run on the CPU and device code that is executed on the GPU. Ways to determine hardware features and capabilities from within CUDA Fortran code are presented, as are error handling, compilation of CUDA Fortran code, and system management.
Keywords
Data parallelism; Hybrid computation; Host and device code; Kernel; Execution configuration; Compute capability; Error handling; Compilation; Device management
1.1 A brief history of GPU computing
Parallel computing has been around in one form or another for many decades. In the early stages it was generally confined to practitioners who had access to large and expensive machines. Today, things are very different. Almost all consumer desktop and laptop computers have central processing units, or CPUs, with multiple cores. Even most processors in cell phones and tablets have multiple cores. The principal reason for the nearly ubiquitous presence of multiple cores in CPUs is the inability of CPU manufacturers to increase performance in single-core designs by boosting the clock speed. As a result, since about 2005 CPU designs have scaled out
to multiple cores rather than scaled up
to higher clock rates. Although CPUs are available with a few to tens of cores, this amount of parallelisms pales in comparison to the number of cores in a graphics processing unit (GPU). For example, the NVIDIA Tesla® K20X contains 2688 cores. GPUs were highly parallel architectures from their beginning, in the mid-1990s, since graphics processing is an inherently parallel task.
The use of GPUs for general-purpose computing, often referred to as GPGPU, was initially a challenging endeavor. One had to program to the graphics application programming interface (API), which proved to be very restrictive in the types of algorithms that could be mapped to the GPU. Even when such a mapping was possible, the programming required to make this happen was difficult and not intuitive for scientists and engineers outside the computer graphics vocation. As such, adoption of the GPU for scientific and engineering computations was slow.
Things changed for GPU computing with the advent of NVIDIA’s CUDA® architecture in 2007. The CUDA architecture included both hardware components on NVIDIA’s GPU and a software programming environment that eliminated the barriers to adoption that plagued GPGPU. Since CUDA’s first appearance in 2007, its adoption has been tremendous, to the point where, in November 2010, three of the top five supercomputers in the Top 500 list used GPUs. In the November 2012 Top 500 list, the fastest computer in the world was also GPU-powered. One of the reasons for this very fast adoption of CUDA is that the programming model was very simple. CUDA C, the first interface to the CUDA architecture, is essentially C with a few extensions that can offload portions of an algorithm to run on the GPU. It is a hybrid approach where both CPU and GPU are used, so porting computations to the GPU can be performed incrementally.
In late 2009, a joint effort between The Portland Group® (PGI®) and NVIDIA led to the CUDA Fortran compiler. Just as CUDA C is C with extensions, CUDA Fortran is essentially Fortran 90 with a few extensions that allow users to leverage the power of GPUs in their computations. Many books, articles, and other documents have been written to aid in the development of efficient CUDA C applications (e.g., Sanders and Kandrot, 2011; Kirk and Hwu, 2012; Wilt, 2013). Because it is newer, CUDA Fortran has relatively fewer aids for code development. Much of the material for writing efficient CUDA C translates easily to CUDA Fortran, since the underlying architecture is the same, but there is still a need for material that addresses how to write efficient code in CUDA Fortran. There are a couple of reasons for this. First, though CUDA C and CUDA Fortran are similar, there are some differences that will affect how code is written. This is not surprising, since CPU code written in C and Fortran will typically take on a different character as projects grow. Also, there are some features in CUDA C that are not present in CUDA Fortran, such as certain aspects of textures. Conversely, there are some features in CUDA Fortran, such as the device variable attribute used to denote data that resides on the GPU, that are not present in CUDA C.
This book is written for those who want to use parallel computation as a tool in getting other work done rather than as an end in itself. The aim is to give the reader a basic set of skills necessary for them to write reasonably optimized CUDA Fortran code that takes advantage of the NVIDIA® computing hardware. The reason for taking this approach rather than attempting to teach how to extract every last ounce of performance from the hardware is the assumption that those using CUDA Fortran do so as a means rather than an end. Such users typically value clear and maintainable code that is simple to write and performs reasonably well across many generations of CUDA-enabled hardware and CUDA Fortran software.
But where is the line drawn in terms of the effort-performance tradeoff? In the end it is up to the developer to decide how much effort to put into optimizing code. In making this decision, we need to know what type of payoff we can expect when eliminating various bottlenecks and what effort is involved in doing so. One goal of this book is to help the reader develop an intuition needed to make such a return-on-investment assessment. To achieve this end, we discuss bottlenecks encountered in writing common algorithms in science and engineering applications in CUDA Fortran. Multiple workarounds are presented when possible, along with the performance impact of each optimization effort.
1.2 Parallel computation
Before jumping into writing CUDA Fortran code, we should say a few words about where CUDA fits in with other types of parallel programming models. Familiarity with and an understanding of other parallel programming models is not a prerequisite for this book, but for readers who do have some parallel programming experience, this section might be helpful in categorizing CUDA.
We have already mentioned that CUDA is a hybrid computing model, where both the CPU and GPU are used in an application. This is advantageous for development because sections of an existing CPU code can be ported to the GPU incrementally. It is possible to overlap computation on the CPU with computation on the GPU, so this is one aspect of parallelism.
A far greater degree of parallelism occurs within the GPU itself. Subroutines that run on the GPU are executed by many threads in parallel. Although all threads execute the same code, these threads typically operate on different data. This data parallelism is a fine-grained parallelism, where it is most efficient to have adjacent threads operate on adjacent data, such as elements of an array. This model of parallelism is very different from a model like Message Passing Interface, commonly known as MPI, which is a coarse-grained model. In MPI, data are typically divided into large segments or partitions, and each MPI process performs calculations on an entire data partition.
A few characteristics of the CUDA programming model are very different from CPU-based parallel programming models. One difference is that there is very little overhead associated with creating GPU threads. In addition to fast thread creation, context switches, where threads change from active to inactive and vice versa, are very fast for GPU threads compared to CPU threads. The reason context switching is essentially instantaneous on the GPU is that the GPU does not have to store state, as the CPU does when switching threads between being active and inactive. As a result of this fast context switching, it is advantageous to heavily oversubscribe GPU cores—that is, have many more resident threads than GPU cores so that memory latencies can be hidden. It is not uncommon to have the number of resident threads on a GPU an order of magnitude larger than the number of cores on the GPU. In the CUDA programming model, we essentially write a serial code that is executed by many GPU threads in parallel. Each thread executing this code has a means of identifying itself in order to operate on different data, but the code that CUDA threads execute is very similar to what we would write for serial CPU code. On the other hand, the code of many parallel CPU programming models differs greatly from serial CPU code. We will revisit each of