Basic analogies¶
CUDA thread ↔ OpenCL work-item
CUDA thread block / CTA ↔ OpenCL work-group
CUDA warp ↔ OpenCL sub-group
CUDA SM (Streaming Multiprocessor) ↔ OpenCL compute unit (CU)
CUDA shared memory ↔ OpenCL local memory
CUDA grid ↔ OpenCL NDRangeTensor Cores¶
Tensor Core is a specialized matrix math unit inside the SM similar to ALU. It very efficiently does:
Matrix multiply / fused multiply-accumulate on small tiles
Common datatypes: FP16, BF16, TF32, INT8, etc. (depends on GPU generation) Think of tensor cores as a specialized ALU. Instructions can be run at once on a tile. The downside is they are not general like ALU and hence support mostly matrix multiply, add ops.
When a MMA operation is encountered in a kernel:
Threads (warps) are running on an SM.
The SM’s scheduler issues a tensor-math instruction.
The Tensor Cores perform the matrix operation.
Results go back to registers / shared memory under the SM’s control.
Asynchronous Copying¶
In the case of asynchronous copy which newer Nvidia gpus provide, the kernel asks the hardware to start copying a rectangular block of a matrix from global memory into on-chip shared memory, and the kernel can keep doing other work while that copy is in flight—then later it synchronizes (waits) before using the data.
“Asynchronous” means:
the instruction initiates the copy and returns immediately,
the data may not be available in shared memory yet,
you later do a wait/barrier before reading that shared memory.
This enables overlap (pipelining):
while you compute on tile k already in shared memory,
the hardware is copying tile k+1 into another shared-memory buffer.
This is exactly the classic double-buffering / software pipeline pattern.
New features mostly on SM90 and SM100 arch¶
Thread-Block Clusters: A team of work-groups that are forced to run close together and can coordinate. A small number of blocks can cooperate more like a “super work-group.” Within a cluster, a block can read/write/atomic into another block’s shared memory. NVIDIA calls this Distributed Shared Memory. Tensor Memory Accelerator (TMA) is an async copy engine that can move tensor-shaped data between global ↔ shared, and even between shared memory regions of different SMs in the same cluster. Hence the main benefit of clusters is sharing each others local memory.
CTA pairs (Blackwell/SM100) = “two work-groups inside a cluster that are paired so tightly the hardware treats them like one bigger matrix-multiply unit for certain instructions.” Pairs can only be 2 ctas but can cordinate on memory as well as compute. Two CTAs inside the same cluster that are scheduled “on the same or adjacent SMs” and can coordinate more deeply—specifically for tensor-core work(MMAs). They can:
The two CTAs can execute tensor core instructions cooperatively, and
they can access extra Tensor Memory (TMEM) of both CTAs in the pair.
Barriers exists across CTAs and pairs for synchronization.
Multicast¶
Suppose multiple(4 in our case) different work groups have to load (or send) the same data.
Without multicast: each work-group issues its own global→shared copy of that tile. That’s 4 separate requests and extra traffic.
With multicast: one “multicast” copy operation fetches the tile and writes it into the shared memory of multiple CTAs in a cluster in one operation (hardware-managed fan-out). This is explicitly called TMA multicast in Hopper/Blackwell cluster GEMM discussions
Advantage:
Memory bandwidth as well as copy instructions are saved.
Requirements:
This multicast behavior is tied to Thread Block Clusters which are scheduled nearby and can access each others shared memory.
When work-items in the same SIMD execution group (warp/sub-group) access consecutive global addresses, the hardware combines those per-lane loads/stores into fewer, larger memory transactions. This is memory coalescing which is different from multicast which reduces redundant global-memory traffic by distributing one fetch to multiple in cluster CTAs/work-groups. The CTAs has to be part of cluster.
Nvidia terms:¶
Prologue: Load next tiles of A and B from global
Compute MMA on tensor cores
Epilogue: Store results back to global (optionally with extra math)
SM90 vs SM100¶
SM90 keeps accumulators in per-thread registers; SM100 keeps accumulators in a CTA-owned TMEM tile. SM90 doesnt have TMEM. SM100 cant have input in TMEM. Only can you keep intermediate results of tensor operations in TMEM. TMEM has more restrictions than local memory. The benefit of TMEM is fast memory for intermediate results of compute which helps free registers leading to higher block size/ work group size. TMEM is for tensor cores (not general purpose) and can be used nu one CTA or CTA pair.