CUDA C++ Programming Model Fundamentals

Kernels

CUDA C++ extends the standard C++ language by allowing the definition of kernels. A kernel is a function that, when called, is executed N times in parallel by N different CUDA threads, as opposed to a regular C++ function which executes only once. Kernels are defined using the __global__ declaration specifier and are invoked using a specific execution configuration syntax <<<...>>>. This syntax specifies the number of CUDA threads to be used for that specific kernel call. Each thread executing the kernel is assigned a unique Thread ID, accessible via built-in variables within the kernel.

The following code example illustrates how to add two vectors, in1 and in2, each of size N, and store the result in out using the built-in threadIdx variable:

// Kernel definition
__global__ void vectorAdd(const float* in1, const float* in2, float* out)
{
    int idx = threadIdx.x;
    out[idx] = in1[idx] + in2[idx];
}

int main() {
    // ...
    // Kernel invocation with N threads
    vectorAdd<<<1, N>>>(in1, in2, out);
    // ...
}

In this example, each of the N threads executing vectorAdd performs a single pair-wise addition.

Thread Hierarchy

To facilitate complex data processing, threadIdx is a 3-component vector. This allows threads to be indexed using one, two, or three dimensions, forming a thread block. This structure provides a natural mapping for computations involving elements in domains such as vectors (1D), matrices (2D), or volumes (3D).

The relationship between the thread index and the Thread ID varies by dimension. For a 1D block, they are identical. For a 2D block of size (Dx, Dy), the Thread ID of a thread at index (x, y) is (x + y * Dx). For a 3D block of size (Dx, Dy, Dz), the Thread ID of a thread at index (x, y, z) is (x + y * Dx + z * Dx * Dy).

The following code demonstrates adding two matrices, A and B, both of size N x N, and storing the result in C:

// Kernel definition
__global__ void matrixAdd(float A[N][N], float B[N][N], float C[N][N])
{
    int row = threadIdx.x;
    int col = threadIdx.y;
    C[row][col] = A[row][col] + B[row][col];
}

int main() {
    // ...
    // Kernel invocation with one block of N * N threads
    dim3 threadsPerBlock(N, N);
    matrixAdd<<<1, threadsPerBlock>>>(A, B, C);
    // ...
}

Because all threads within a block must reside on the same Streaming Multiprocessor and share the limited memory resources of that core, the number of threads per block is limited. Current GPUs allow a maximum of 1024 threads per block. However, a kernel can be executed by multiple blocks of the same shape, meaning the total number of threads equals the number of threads per block multiplied by the number of blocks.

Blocks are organized into a grid of thread blocks, which can be 1D, 2D, or 3D. The number of blocks in a grid is typically determined by the size of the data being processed. The <<<...>>> syntax accepts int or dim3 types for both blocks per grid and threads per block.

Each block within a grid has a unique index accessible via the built-in blockIdx variable, and the dimension of the block is accessible via blockDim. The following example extends matrixAdd to handle multiple blocks:

// Kernel definition
__global__ void matrixAdd(float A[N][N], float B[N][N], float C[N][N])
{
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    int col = blockIdx.y * blockDim.y + threadIdx.y;

    if (row < N && col < N)
        C[row][col] = A[row][col] + B[row][col];
}

int main() {
    // ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, 
                    (N + threadsPerBlock.y - 1) / threadsPerBlock.y);
    matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    // ...
}

A block size of 16x16 (256 threads) is a common choice. The grid is created with enough blocks so that each matrix element corresponds to a thread. Note that while this example assumes divisibility for simplicity, the kernel code includes boundary checks to handle arbitrary sizes.

Thread blocks must be able to execute independently in any order, whether parallel or serial. This requirement allows the system to schedule blocks across any number of cores in any order, enabling code scaling.

Threads within a block can cooperate by sharing data through shared memory and synchronizing execution. The __syncthreads() intrinsic creates a barrier where all threads in the block must wait before any proceed. Efficient cooperation relies on shared memory being low-latency (like L1 cache) and __syncthreads() being lightweight.

Thread Block Clusters

With the introduction of NVIDIA Compute Capability 9.0, the CUDA programming model added an optional hierarchy level known as Thread Block Clusters. Just as threads in a block are guaranteed to be co-scheduled on a Streaming Multiprocessor (SM), blocks in a cluster are guaranteed to be co-scheduled on a GPU Processing Cluster (GPC).

Clusters are organized in 1D, 2D, or 3D. The maximum portable cluster size is 8 blocks, though this may vary depending on specific GPU hardware or MIG configurations. The exact limits for specific architectures can be queried using the cudaOccupancyMaxPotentialClusterSize API.

Clusters can be enabled using a compile-time kernel attribute __cluster_dims__(X, Y, Z) or the runtime launch API cudaLaunchKernelEx.

Compile-time Cluster Example:

// Kernel definition with fixed cluster size
__global__ void __cluster_dims__(2, 1, 1) clusterKernel(float* input, float* output)
{
    // Kernel code
}

int main() {
    float *input, *output;
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

    // The grid dimension must be a multiple of the cluster size.
    clusterKernel<<<numBlocks, threadsPerBlock>>>(input, output);
}

Runtime Cluster Example:

// Kernel definition without compile-time attribute
__global__ void clusterKernel(float* input, float* output) {}

int main() {
    float *input, *output;
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

    cudaLaunchConfig_t config = {0};
    config.gridDim = numBlocks;
    config.blockDim = threadsPerBlock;

    cudaLaunchAttribute attribute[1];
    attribute[0].id = cudaLaunchAttributeClusterDimension;
    attribute[0].val.clusterDim.x = 2;
    attribute[0].val.clusterDim.y = 1;
    attribute[0].val.clusterDim.z = 1;
    config.attrs = attribute;
    config.numAttrs = 1;

    cudaLaunchKernelEx(&config, clusterKernel, input, output);
}

Clusters enable blocks to access Distributed Shared Memory, allowing reading, writing, and atomic operations on the shared memory of other blocks within the same cluster.

Memory Hierarchy

CUDA threads can access data from several distinct memory spaces during execution. Each thread possesses private local memory. Each thread block has access to shared memory visible to all threads within that block, which has the same lifespan as the block. For clusters, blocks can access the shared memory of other blocks in the cluster. All threads can access global memory.

Additionally, there are two read-only memory spaces accessible by all threads: constant memory and texture memory. These spaces are optimized for different memory use cases. Global, constant, and texture memory spaces are persistent across kernel launches within the same application.

Heterogeneous Programming

The CUDA programming model assumes a heterogeneous environment where the host (CPU) and the device (GPU) operate as separate processors. The host manages the execution, while the device handles the parallel kernels.

Furthermore, the model assumes the host and device maintain separate memory spaces (Host Memory and Device Memory). Consequently, the program must explicitly manage the global, constant, and texture memory visible to the kernel. This involves allocating and freeing device memory and transferring data between the host and the device via the CUDA runtime.

Unified Memory provides a managed memory space that bridges the host and device, creating a single coherent memory image accessible by all CPUs and GPUs in the system. This simplifies programming by removing the need for explicit data mirroring.

Asynchronous SIMT Programming Model

Starting with the NVIDIA Ampere architecture, the CUDA programming model introduces an asynchronous programming model to accelerate memory operations. This model defines the behavior of asynchronous operations initiated by CUDA threads.

Asynchronous Operations

An asynchronous operation is initiated by a CUDA thread but executes as if by a separate thread. Synchronization between CUDA threads and these operations is achieved using synchronization objects like cuda::barrier or cuda::pipeline.

These objects operate within specific thread scopes, which define the set of threads that can synchronize with the operation:

  • thread_scope_thread: Only the initiating thread synchronizes.
  • thread_scope_block: All or any threads in the initiating thread's block synchronize.
  • thread_scope_device: All or any threads in the initiating GPU device synchronize.
  • thread_scope_system: All or any threads in the system (CPU or GPU) synchronize.

Compute Capability

A device's compute capability is represented by a version number (e.g., 9.0) that identifies the supported hardware features and instructions. The version consists of a major revision number X and a minor revision number Y.

Devices with the same major revision number share the same core architecture. For example, a major revision of 9 corresponds to the NVIDIA Hopper architecture, 8 to Ampere, 7 to Volta, and so on. The minor revision number indicates incremental improvements within that architecture.

It is important not to confuse the compute capability (hardware version) with the CUDA software platform version (e.g., CUDA 12.0). Developers use the CUDA platform to create applications that can run across multiple generations of GPU architectures.

Tags: cuda C++ GPU Programming Parallel Computing Software Engineering

Posted on Sat, 09 May 2026 18:28:05 +0000 by ciber