Contents

White paper covering the most common issues related to NVIDIA GPUs.

1. Overview
2. Preface
- 2.1. Who Should Read This Guide?
- 2.2. Assess, Parallelize, Optimize, Deploy
- 2.3. Recommendations and Best Practices
- 2.4. Assessing Your Application
3. Heterogeneous Computing
- 3.1. Differences between Host and Device
- 3.2. What Runs on a CUDA-Enabled Device?
4. Application Profiling
- 4.1. Profile
5. Parallelizing Your Application
6. Getting Started
- 6.1. Parallel Libraries
- 6.2. Parallelizing Compilers
- 6.3. Coding to Expose Parallelism
7. Getting the Right Answer
- 7.1. Verification
  - 7.1.1. Reference Comparison
  - 7.1.2. Unit Testing
- 7.2. Debugging
- 7.3. Numerical Accuracy and Precision
8. Optimizing CUDA Applications
9. Performance Metrics
- 9.1. Timing
  - 9.1.1. Using CPU Timers
  - 9.1.2. Using CUDA GPU Timers
- 9.2. Bandwidth
10. Memory Optimizations
- 10.1. Data Transfer Between Host and Device
- 10.2. Device Memory Spaces
- 10.3. Allocation
- 10.4. NUMA Best Practices
11. Execution Configuration Optimizations
- 11.1. Occupancy
  - 11.1.1. Calculating Occupancy
- 11.2. Hiding Register Dependencies
- 11.3. Thread and Block Heuristics
- 11.4. Effects of Shared Memory
- 11.5. Concurrent Kernel Execution
- 11.6. Multiple contexts
12. Instruction Optimization
- 12.1. Arithmetic Instructions
- 12.2. Memory Instructions
13. Control Flow
- 13.1. Branching and Divergence
- 13.2. Branch Predication
14. Deploying CUDA Applications
15. Understanding the Programming Environment
- 15.1. CUDA Compute Capability
- 15.2. Additional Hardware Data
- 15.3. Which Compute Capability Target
- 15.4. CUDA Runtime
16. CUDA Compatibility Developer’s Guide
- 16.1. CUDA Toolkit Versioning
- 16.2. Source Compatibility
- 16.3. Binary Compatibility
  - 16.3.1. CUDA Binary (cubin) Compatibility
- 16.4. CUDA Compatibility Across Minor Releases
  - 16.4.1. Existing CUDA Applications within Minor Versions of CUDA
17. Preparing for Deployment
- 17.1. Testing for CUDA Availability
- 17.2. Error Handling
- 17.3. Building for Maximum Compatibility
- 17.4. Distributing the CUDA Runtime and Libraries
  - 17.4.1. CUDA Toolkit Library Redistribution
    - 17.4.1.1. Which Files to Redistribute
    - 17.4.1.2. Where to Install Redistributed CUDA Libraries
18. Deployment Infrastructure Tools
- 18.1. Nvidia-SMI
  - 18.1.1. Queryable state
  - 18.1.2. Modifiable state
- 18.2. NVML
- 18.3. Cluster Management Tools
- 18.4. Compiler JIT Cache Management Tools
- 18.5. CUDA_VISIBLE_DEVICES
19. Recommendations and Best Practices
- 19.1. Overall Performance Optimization Strategies
20. nvcc Compiler Switches
- 20.1. nvcc
21. Notices
- 21.1. Notice
- 21.2. OpenCL
- 21.3. Trademarks