Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
Ebook486 pages84 hours

CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming

Rating: 0 out of 5 stars

()

Read preview

About this ebook

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran, the familiar language of scientific computing and supercomputer performance benchmarking. The authors presume no prior parallel computing experience, and cover the basics along with best practices for efficient GPU computing using CUDA Fortran.

To help you add CUDA Fortran to existing Fortran codes, the book explains how to understand the target GPU architecture, identify computationally intensive parts of the code, and modify the code to manage the data and parallelism and optimize performance. All of this is done in Fortran, without having to rewrite in another language. Each concept is illustrated with actual examples so you can immediately evaluate the performance of your code in comparison.

  • Leverage the power of GPU computing with PGI’s CUDA Fortran compiler
  • Gain insights from members of the CUDA Fortran language development team
  • Includes multi-GPU programming in CUDA Fortran, covering both peer-to-peer and message passing interface (MPI) approaches
  • Includes full source code for all the examples and several case studies
  • Download source code and slides from the book's companion website
LanguageEnglish
Release dateSep 11, 2013
ISBN9780124169722
CUDA Fortran for Scientists and Engineers: Best Practices for Efficient CUDA Fortran Programming
Author

Gregory Ruetsch

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.

Related to CUDA Fortran for Scientists and Engineers

Related ebooks

Computers For You

View More

Related articles

Reviews for CUDA Fortran for Scientists and Engineers

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    CUDA Fortran for Scientists and Engineers - Gregory Ruetsch

    CUDA Fortran for Scientists and Engineers

    Best Practices for Efficient CUDA Fortran Programming

    First Edition

    Massimiliano Fatica

    Gregory Ruetsch

    NVIDIA Corporation, Santa Clara, CA

    Table of Contents

    Cover image

    Title page

    Copyright

    Dedication

    Acknowledgments

    Preface

    Companion Site

    Part I: CUDA Fortran Programming

    Chapter 1. Introduction

    Abstract

    1.1 A brief history of GPU computing

    1.2 Parallel computation

    1.3 Basic concepts

    1.4 Determining CUDA hardware features and limits

    1.5 Error handling

    1.6 Compiling CUDA Fortran code

    Chapter 2. Performance Measurement and Metrics

    Abstract

    2.1 Measuring kernel execution time

    2.2 Instruction, bandwidth, and latency bound kernels

    2.3 Memory bandwidth

    Chapter 3. Optimization

    Abstract

    3.1 Transfers between host and device

    3.2 Device memory

    3.3 On-chip memory

    3.4 Memory optimization example: matrix transpose

    3.5 Execution configuration

    3.6 Instruction optimization

    3.7 Kernel loop directives

    Chapter 4. Multi-GPU Programming

    Abstract

    4.1 CUDA multi-GPU features

    4.2 Multi-GPU Programming with MPI

    Part II: Case Studies

    Chapter 5. Monte Carlo Method

    Abstract

    5.1 CURAND

    with CUF kernels

    with reduction kernels

    5.4 Accuracy of summation

    5.5 Option pricing

    Chapter 6. Finite Difference Method

    Abstract

    6.1 Nine-Point 1D finite difference stencil

    6.2 2D Laplace equation

    Chapter 7. Applications of Fast Fourier Transform

    Abstract

    7.1 CUFFT

    7.2 Spectral derivatives

    7.3 Convolution

    7.4 Poisson Solver

    Part III: Appendices

    Appendix A. Tesla Specifications

    Appendix B. System and Environment Management

    B.1 Environment variables

    B.2 nvidia-smi System Management Interface

    Appendix C. Calling CUDA C from CUDA Fortran

    C.1 Calling CUDA C libraries

    C.2 Calling User-Written CUDA C Code

    Appendix D. Source Code

    D.1 Texture memory

    D.2 Matrix transpose

    D.3 Thread- and instruction-level parallelism

    D.4 Multi-GPU programming

    D.5 Finite difference code

    D.6 Spectral Poisson Solver

    References

    Index

    Copyright

    Acquiring Editor: Todd Green

    Development Editor: Lindsay Lawrence

    Project Manager: Punithavathy Govindaradjane

    Designer: Matthew Limbert

    Morgan Kaufmann is an imprint of Elsevier

    225 Wyman Street, Waltham, MA 02451, USA

    Copyright © 2014 Gregory Ruetsch/NVIDIA Corporation and Massimiliano Fatica/NVIDIA Corporation. Published by Elsevier Inc. All rights reserved.

    No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

    This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

    Notices

    Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

    Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

    To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

    Library of Congress Cataloging-in-Publication Data

    Ruetsch, Gregory.

     CUDA Fortran for scientists and engineers : best practices for efficient CUDA Fortran programming / Gregory Ruetsch, Massimiliano Fatica.

       pages cm

     Includes bibliographical references and index.

     ISBN 978-0-12-416970-8 (alk. paper)

     1. FORTRAN (Computer program language) I. Fatica, Massimiliano. II. Title. III. Title: Best practices for efficient CUDA Fortran programming.

     QA76.73.F25R833 2013

     005.13’1--dc23

            2013022226

    British Library Cataloguing-in-Publication Data

    A catalogue record for this book is available from the British Library

    ISBN: 978-0-12-416970-8

    Printed and bound in the United States of America

    14 15 16 17 18 10 9 8 7 6 5 4 3 2 1

    For information on all MK publications visit our website at www.mkp.com

    Dedication

    To Fortran programmers, who know a good thing when they see it.

    Acknowledgments

    Writing this book has been an enjoyable and rewarding experience for us, largely due to the interactions with the people who helped shape the book into the form you have before you. There are many people who have helped with this book, both directly and indirectly, and at the risk of leaving someone out we would like to thank the following people for their assistance.

    Of course, a book on CUDA Fortran would not be possible without CUDA Fortran itself, and we would like to thank The Portland Group (PGI), especially Brent Leback and Michael Wolfe, for literally giving us something to write about. Working with PGI on CUDA Fortran has been a delightful experience.

    The authors often reflect on how computations used in their theses, which required many, many hours on large-vector machines of the day, can now run on an NVIDIA graphics processing unit (GPU) in less time than it takes to get a cup of coffee. We would like to thank those at NVIDIA who helped enable this technological breakthrough. We would like to thank past and present members of the CUDA software team, especially Philip Cuadra, Mark Hairgrove, Stephen Jones, Tim Murray, and Joel Sherpelz for answering the many questions we asked them.

    Much of the material in this book grew out of collaborative efforts in performance-tuning applications. We would like to thank our collaborators in such efforts, including Norbert Juffa, Patrick Legresley, Paulius Micikevicius, and Everett Phillips.

    Many people reviewed the manuscript for this book at various stages in its development, and we would like to thank Roberto Gomperts, Mark Harris, Norbert Juffa, Brent Leback, and Everett Phillips for their comments and suggestions.

    We would like to thank Ian Buck for allowing us to spend time at work on this endeavor, and we would like to thank our families for their understanding while we also worked at home.

    Finally, we would like to thank all of our teachers. They enabled us to write this book, and we hope in some way that by doing so, we have continued the chain of helping others.

    Preface

    This document is intended for scientists and engineers who develop or maintain computer simulations and applications in Fortran and who would like to harness the parallel processing power of graphics processing units (GPUs) to accelerate their code. The goal here is to provide the reader with the fundamentals of GPU programming using CUDA Fortran as well as some typical examples, without having the task of developing CUDA Fortran code become an end in itself.

    The CUDA architecture was developed by NVIDIA to allow use of the GPU for general-purpose computing without requiring the programmer to have a background in graphics. There are many ways to access the CUDA architecture from a programmer’s perspective, including through C/C++ from CUDA C or through Fortran using The Portland Group’s (PGI’s) CUDA Fortran. This document pertains to the latter approach. PGI’s CUDA Fortran should be distinguished from the PGI Accelerator and OpenACC Fortran interfaces to the CUDA architecture, which are directive-based approaches to using the GPU. CUDA Fortran is simply the Fortran analog to CUDA C.

    The reader of this book should be familiar with Fortran 90 concepts, such as modules, derived types, and array operations. For those familiar with earlier versions of Fortran but looking to upgrade to a more recent version, there are several excellent books that cover this material (e.g., Metcalf, 2011). Some features introduced in Fortran 2003 are used in this book, but these concepts are explained in detail. Although this book does assume some familiarity with Fortran 90, no experience with parallel programming (on the GPU or otherwise) is required. Part of the appeal of parallel programming on GPUs using CUDA is that the programming model is simple and novices can get parallel code up and running very quickly.

    Often one comes to CUDA Fortran with the goal of porting existing, sometimes rather lengthy, Fortran code to code that leverages the GPU. Because CUDA is a hybrid programming model, where both GPU and CPU are utilized, CPU code can be incrementally ported to the GPU. CUDA Fortran is also used by those porting applications to GPUs mainly using the directive-base OpenACC approach, but who want to improve the performance of a few critical sections of code by hand-coding CUDA Fortran. Both OpenACC and CUDA Fortran can coexist in the same code.

    This book is divided into two main parts. The first part is a tutorial on CUDA Fortran programming, from the basics of writing CUDA Fortran code to some tips on optimization. The second part is a collection of case studies that demonstrate how the principles in the first part are applied to real-world examples.

    This book makes use of the PGI 13.x compilers, which can be obtained from http://pgroup.com. Although the examples can be compiled and run on any supported operating system in a variety of development environments, the examples included here are compiled from the command line as one would do under Linux or Mac OS X.

    Companion Site

    Supplementary materials for readers can be downloaded from Elsevier: http://store.elsevier.com/product.jsp?isbn=9780124169708.

    Part I: CUDA Fortran Programming

    Outline

    Chapter 1 Introduction

    Chapter 2 Performance Measurement and Metrics

    Chapter 3 Optimization

    Chapter 4 Multi-GPU Programming

    Chapter 1

    Introduction

    Abstract

    After a short discussion of the history of parallel computation on graphics processing units, or GPUs, this chapter goes through a sequence of simple examples that illustrate the fundamental aspects of computation on GPUs using CUDA Fortran. The hybrid nature of CUDA Fortran programming is illustrated, which contains both host code that is run on the CPU and device code that is executed on the GPU. Ways to determine hardware features and capabilities from within CUDA Fortran code are presented, as are error handling, compilation of CUDA Fortran code, and system management.

    Keywords

    Data parallelism; Hybrid computation; Host and device code; Kernel; Execution configuration; Compute capability; Error handling; Compilation; Device management

    1.1 A brief history of GPU computing

    Parallel computing has been around in one form or another for many decades. In the early stages it was generally confined to practitioners who had access to large and expensive machines. Today, things are very different. Almost all consumer desktop and laptop computers have central processing units, or CPUs, with multiple cores. Even most processors in cell phones and tablets have multiple cores. The principal reason for the nearly ubiquitous presence of multiple cores in CPUs is the inability of CPU manufacturers to increase performance in single-core designs by boosting the clock speed. As a result, since about 2005 CPU designs have scaled out to multiple cores rather than scaled up to higher clock rates. Although CPUs are available with a few to tens of cores, this amount of parallelisms pales in comparison to the number of cores in a graphics processing unit (GPU). For example, the NVIDIA Tesla® K20X contains 2688 cores. GPUs were highly parallel architectures from their beginning, in the mid-1990s, since graphics processing is an inherently parallel task.

    The use of GPUs for general-purpose computing, often referred to as GPGPU, was initially a challenging endeavor. One had to program to the graphics application programming interface (API), which proved to be very restrictive in the types of algorithms that could be mapped to the GPU. Even when such a mapping was possible, the programming required to make this happen was difficult and not intuitive for scientists and engineers outside the computer graphics vocation. As such, adoption of the GPU for scientific and engineering computations was slow.

    Things changed for GPU computing with the advent of NVIDIA’s CUDA® architecture in 2007. The CUDA architecture included both hardware components on NVIDIA’s GPU and a software programming environment that eliminated the barriers to adoption that plagued GPGPU. Since CUDA’s first appearance in 2007, its adoption has been tremendous, to the point where, in November 2010, three of the top five supercomputers in the Top 500 list used GPUs. In the November 2012 Top 500 list, the fastest computer in the world was also GPU-powered. One of the reasons for this very fast adoption of CUDA is that the programming model was very simple. CUDA C, the first interface to the CUDA architecture, is essentially C with a few extensions that can offload portions of an algorithm to run on the GPU. It is a hybrid approach where both CPU and GPU are used, so porting computations to the GPU can be performed incrementally.

    In late 2009, a joint effort between The Portland Group® (PGI®) and NVIDIA led to the CUDA Fortran compiler. Just as CUDA C is C with extensions, CUDA Fortran is essentially Fortran 90 with a few extensions that allow users to leverage the power of GPUs in their computations. Many books, articles, and other documents have been written to aid in the development of efficient CUDA C applications (e.g., Sanders and Kandrot, 2011; Kirk and Hwu, 2012; Wilt, 2013). Because it is newer, CUDA Fortran has relatively fewer aids for code development. Much of the material for writing efficient CUDA C translates easily to CUDA Fortran, since the underlying architecture is the same, but there is still a need for material that addresses how to write efficient code in CUDA Fortran. There are a couple of reasons for this. First, though CUDA C and CUDA Fortran are similar, there are some differences that will affect how code is written. This is not surprising, since CPU code written in C and Fortran will typically take on a different character as projects grow. Also, there are some features in CUDA C that are not present in CUDA Fortran, such as certain aspects of textures. Conversely, there are some features in CUDA Fortran, such as the device variable attribute used to denote data that resides on the GPU, that are not present in CUDA C.

    This book is written for those who want to use parallel computation as a tool in getting other work done rather than as an end in itself. The aim is to give the reader a basic set of skills necessary for them to write reasonably optimized CUDA Fortran code that takes advantage of the NVIDIA® computing hardware. The reason for taking this approach rather than attempting to teach how to extract every last ounce of performance from the hardware is the assumption that those using CUDA Fortran do so as a means rather than an end. Such users typically value clear and maintainable code that is simple to write and performs reasonably well across many generations of CUDA-enabled hardware and CUDA Fortran software.

    But where is the line drawn in terms of the effort-performance tradeoff? In the end it is up to the developer to decide how much effort to put into optimizing code. In making this decision, we need to know what type of payoff we can expect when eliminating various bottlenecks and what effort is involved in doing so. One goal of this book is to help the reader develop an intuition needed to make such a return-on-investment assessment. To achieve this end, we discuss bottlenecks encountered in writing common algorithms in science and engineering applications in CUDA Fortran. Multiple workarounds are presented when possible, along with the performance impact of each optimization effort.

    1.2 Parallel computation

    Before jumping into writing CUDA Fortran code, we should say a few words about where CUDA fits in with other types of parallel programming models. Familiarity with and an understanding of other parallel programming models is not a prerequisite for this book, but for readers who do have some parallel programming experience, this section might be helpful in categorizing CUDA.

    We have already mentioned that CUDA is a hybrid computing model, where both the CPU and GPU are used in an application. This is advantageous for development because sections of an existing CPU code can be ported to the GPU incrementally. It is possible to overlap computation on the CPU with computation on the GPU, so this is one aspect of parallelism.

    A far greater degree of parallelism occurs within the GPU itself. Subroutines that run on the GPU are executed by many threads in parallel. Although all threads execute the same code, these threads typically operate on different data. This data parallelism is a fine-grained parallelism, where it is most efficient to have adjacent threads operate on adjacent data, such as elements of an array. This model of parallelism is very different from a model like Message Passing Interface, commonly known as MPI, which is a coarse-grained model. In MPI, data are typically divided into large segments or partitions, and each MPI process performs calculations on an entire data partition.

    A few characteristics of the CUDA programming model are very different from CPU-based parallel programming models. One difference is that there is very little overhead associated with creating GPU threads. In addition to fast thread creation, context switches, where threads change from active to inactive and vice versa, are very fast for GPU threads compared to CPU threads. The reason context switching is essentially instantaneous on the GPU is that the GPU does not have to store state, as the CPU does when switching threads between being active and inactive. As a result of this fast context switching, it is advantageous to heavily oversubscribe GPU cores—that is, have many more resident threads than GPU cores so that memory latencies can be hidden. It is not uncommon to have the number of resident threads on a GPU an order of magnitude larger than the number of cores on the GPU. In the CUDA programming model, we essentially write a serial code that is executed by many GPU threads in parallel. Each thread executing this code has a means of identifying itself in order to operate on different data, but the code that CUDA threads execute is very similar to what we would write for serial CPU code. On the other hand, the code of many parallel CPU programming models differs greatly from serial CPU code. We will revisit each of

    Enjoying the preview?
    Page 1 of 1