2024 February 05 cuda

Cuda

Max reduction

// Device code
__global__ void VecMax(const float* A, float* B, int N)
{
    extern __shared__ float cache[];

    int i = blockDim.x * blockIdx.x + threadIdx.x; // Calculates the global index i for the current thread. This index is used to access elements in the input array A.
    int cacheIndex = threadIdx.x; // Determines the index within cache for the current thread. Each thread in a block will store its partial maximum in a unique location in cache.

    float temp = RAND_MIN;  // register for each thread
    while (i < N) {
    	if(A[i] > temp)
    		temp = A[i];
        i += blockDim.x * gridDim.x;   //This loop iterates over all elements assigned to this thread, updating temp if a larger value is found. Each thread skips elements by blockDim.x * gridDim.x, ensuring no overlap between threads.
    }  // 
   
    cache[cacheIndex] = temp;   // set the cache value 

    __syncthreads(); // Synchronizes all threads within the block, ensuring all partial maximums are written to cache before proceeding.

    // perform parallel reduction, threadsPerBlock must be 2^m

    int ib = blockDim.x / 2;
    while (ib != 0) {
      if(cacheIndex < ib && cache[cacheIndex + ib] > cache[cacheIndex])
        cache[cacheIndex] = cache[cacheIndex + ib]; 

      __syncthreads();

      ib /= 2;
    }  // Performs parallel reduction within the block. In each iteration, threads compare and update values in cache, halving the comparison range ib until it becomes 0. This effectively finds the maximum value within cache.
    
    if(cacheIndex == 0)
      B[blockIdx.x] = cache[0]; // The first thread in each block (cacheIndex == 0) writes the maximum value found in this block to the output array B.
}

Ensuring Full Coverage: It ensures that all parts of the array are checked as quickly as possible by spreading out the threads to cover the entire array. Load Balancing: It balances the load among all available threads, making sure each thread has approximately the same amount of work, assuming the array size N is much larger than the total number of threads. Efficiency: It helps in managing memory access patterns more efficiently. Sequential accesses by consecutive threads can leverage memory coalescing, improving access efficiency to global memory. However, in this case, the focus is on ensuring each thread can independently work on separate parts of the array without needing synchronization until reduction.

Registers are per-thread, extremely fast, but limited in number. They are ideal for thread-specific data that is frequently accessed. Shared Memory is shared among threads in the same block, faster than global memory but slower than registers. It requires explicit management and is suitable for inter-thread communication and data sharing within a block. Cache automatically stores copies of frequently accessed global memory data to reduce access times. It operates at the hardware level and does not require explicit management by the programmer, but its effectiveness depends on access patterns.

Data like the array A is stored in the global memory of the GPU when it’s passed from the host (CPU) to the device (GPU). Global memory is accessible by all threads across all blocks on the GPU but has higher latency compared to registers and shared memory.

Memory Transfer: Before the kernel execution, A is transferred from the host memory to the device’s global memory using CUDA memory copy operations (e.g., cudaMemcpy).

Global Memory Access: During kernel execution, each thread accesses elements of A directly from global memory unless cached values are available in L1/L2 cache, which can reduce access times.

while (i < N) {
    if (A[i] > temp)
        temp = A[i];
    i += blockDim.x * gridDim.x;  
}

This loop iterates over elements of A assigned to a specific thread. The way i is updated ensures that each thread processes a different subset of A, spaced apart by blockDim.x * gridDim.x. Here’s why and how this works:

Initial Index (i): Calculated as blockDim.x * blockIdx.x + threadIdx.x, assigns a unique starting index to each thread across all blocks.

Index Update (i += blockDim.x * gridDim.x):

This update skips indices by blockDim.x * gridDim.x for each iteration. It’s designed to distribute the workload evenly across all threads in the grid. This pattern ensures that if you have a large array and a grid of threads, each thread takes on the job of checking different segments of the array, rather than all threads working on contiguous segments and potentially leaving parts of the array unchecked until later. The jump (blockDim.x * gridDim.x) corresponds to the total number of threads in the grid. This way, each thread “jumps” over the elements that other threads are responsible for, ensuring that the entire array is covered without overlap.

Synchronization Point: __syncthreads(); acts as a barrier at which all threads in the block must wait before any can proceed. It ensures that all threads within the block reach this point in the execution before any thread can continue beyond it. Use Cases: It’s commonly used to ensure that: Shared memory accesses are safe. For example, when threads need to write data to shared memory (which other threads will read), or To coordinate between different phases of an algorithm within a block.

Parallel Writing to cache: Yes, the step cache[cacheIndex] = temp; is happening in parallel for all threads within the same block, with each thread using a different cacheIndex. This ensures that each thread’s partial maximum is stored independently without conflicts. Need for Synchronization: The synchronization with __syncthreads(); is necessary because, after storing their partial maximums, threads will work together to reduce these values to a single maximum for the block. This reduction process requires reading from cache, so it’s crucial that all writing to cache has been completed. In summary, __syncthreads(); ensures that all threads in a block are synchronized at certain points, making it safe to proceed with operations that depend on the results of prior parallel work by all threads in the block. This is essential for correct and efficient parallel algorithms on the GPU.

In CUDA, when you launch a kernel with a specific block size (defined by blockDim.x for a 1D block), each thread within that block can perform operations independently. The cache array in shared memory is used here to store intermediate results or specific data for each thread in the block.

Global memory

Global memory has the highest access latency. Coalesced access patterns should be used where possible to maximize throughput. Use of global memory is inevitable for large data, but its access should be minimized or optimized (e.g., through coalescing, minimizing divergent accesses, and maximizing memory throughput).

Shared memory

Shared memory access patterns should be optimized to avoid bank conflicts, which can serialize access and reduce throughput.

Cache

The main difference between shared memory and the L1 is that the contents of shared memory are managed by your code explicitly, whereas the L1 cache is automatically managed. Shared memory is also a better way to exchange data between threads in a block with predictable timing. My rule of thumb is: unpredictable reads and writes => prefer L1.

There might be some additional latency when using L1 due to the need to translate the global address you are accessing to a cache location

Improve float array summation precision and stability?

use double float bit representation

use double instead of float, sort the data from smallest (magnitude or absolute value) to largest.

Floating Point Representation (浮点数表示):

In IEEE-754, a floating point number is represented using three parts: a sign bit (符号位) for the number’s sign, an exponent (指数) to indicate the power of 2, and a mantissa (尾数) that holds the significant digits of the number. Positive numbers have a sign bit of 0. High-Set-Bit and Low-Set-Bit (高位设置位和低位设置位):

For any given number, the “high-set-bit” is the most significant bit (MSB) that is not zero The “low-set-bit” is the least significant bit (LSB) that is not zero.

Parallel Min and Max Reductions (并行最小值和最大值归约):

In an array of numbers, identify the highest position of any high-set-bit and the lowest position of any low-set-bit across all numbers. This can be done in parallel, enhancing efficiency.