大卫·B. 柯克(David B. Kirk) 美国国家工程院院士,NVIDIA Fellow,曾任NVIDIA公司首席科学家。他领导了NVIDIA图形技术的开发,并且是CUDA技术的创始人之一。2002年,他荣获ACM SIGGRAPH计算机图形成就奖,以表彰其在把高性能计算机图形系统推向大众市场方面做出的杰出贡献。他拥有加州理工学院计算机科学博士学位。
胡文美(Wen-mei W. Hwu) 美国伊利诺伊大学厄巴纳-香槟分校电气与计算机工程系AMD Jerry Sanders讲席教授,并行计算研究中心首席科学家,领导IMPACT团队和CUDA卓越中心的研究工作。他在编译器设计、计算机体系结构、微体系结构和并行计算方面做出了卓越贡献,是IEEE Fellow、ACM Fellow,荣获了包括ACM SigArch Maurice Wilkes Award在内的众多奖项。他还是MulticoreWare公司的联合创始人兼CTO。他拥有加州大学伯克利分校计算机科学博士学位。
目錄:
Preface Acknowledgements
CHAPTER.1 Introduction.................................................................................1
1.1 Heterogeneous Parallel Computing................................................2
1.2 Architecture of a Modern GPU.......................................................6
1.3 Why More Speed or Parallelism?...................................................8
1.4 Speeding Up Real Applications....................................................10
1.5 Challenges in Parallel Programming ............................................12
1.6 Parallel Programming Languages and Models.............................12
1.7 Overarching Goals........................................................................14
1.8 Organization of the Book..............................................................15
References ............................................................................................18
CHAPTER.2 Data Parallel Computing.......................................................19
2.1 Data Parallelism............................................................................20
2.2 CUDA C Program Structure.........................................................22
2.3 A Vector Addition Kernel .............................................................25
2.4 Device Global Memory and Data Transfer...................................27
2.5 Kernel Functions and Threading...................................................32
2.6 Kernel Launch...............................................................................37
2.7 Summary.......................................................................................38
Function Declarations...................................................................38
Kernel Launch...............................................................................38
Built-in Predefined Variables .....................................................39
Run-time API................................................................................39
2.8 Exercises.......................................................................................39
References ............................................................................................41
CHAPTER.3 Scalable Parallel Execution................................................43
3.1 CUDA Thread Organization.........................................................43
3.2 Mapping Threads to Multidimensional Data................................47
3.3 Image Blur: A More Complex Kernel ..........................................54
3.4 Synchronization and Transparent Scalability ...............................58
3.5 Resource Assignment....................................................................60
3.6 Querying Device Properties..........................................................61
3.7 Thread Scheduling and Latency Tolerance...................................64
3.8 Summary.......................................................................................67
3.9 Exercises.......................................................................................67
CHAPTER.4 Memory and Data Locality ...................................................71
4.1 Importance of Memory Access Efficiency....................................72
4.2 Matrix Multiplication....................................................................73
4.3 CUDA Memory Types..................................................................77
4.4 Tiling for Reduced Memory Traffic..............................................84
4.5 A Tiled Matrix Multiplication Kernel...........................................90
4.6 Boundary Checks..........................................................................94
4.7 Memory as a Limiting Factor to Parallelism................................97
4.8 Summary.......................................................................................99
4.9 Exercises...........................................