GPU Microarchitecture

Gpu microarchitecture

CUDA PTX: Learning to Read NVIDIA's Virtual ISA

TL;DR PTX is not the real hardware ISA. It is NVIDIA’s virtual instruction set that sits between CUDA C++ and SASS. PTX is the best layer for learning how the compiler thinks about types, addresses, predicates, and memory spaces. SASS is where architecture-specific details appear: actual opcodes, scheduling metadata, scoreboard behavior, and pipeline usage. If you can read PTX, you can usually answer: what computation is happening, what memory space it touches, and why the compiler generated a certain structure. If you want to optimize the last 20%, you eventually need to correlate PTX with SASS and profiler data. CPU Baseline: Why GPUs Need a Virtual ISA Layer On CPUs, most people think in terms of:

Gpu microarchitecture

CUDA SASS: Learning to Read NVIDIA's Native GPU ISA

TL;DR SASS is the real instruction stream executed by NVIDIA GPUs. PTX is not the final hardware ISA. It is a virtual ISA that ptxas lowers into architecture-specific SASS. The main things to learn first are: opcodes, registers, predicates, loads/stores, special registers, and modifiers. SASS is where you see performance-critical details that source code hides: final opcode selection, register usage, spills, and memory instructions. If PTX tells you the compiler’s intent, SASS tells you what the GPU will actually issue. Why Learn SASS at All? If you only write CUDA C++, it is tempting to stop at source code and trust the compiler. That works until performance becomes mysterious.

Gpu microarchitecture

CUDA Instruction Encoding: Why SASS Carries Scheduling Metadata

TL;DR CPU out-of-order cores solve dependencies dynamically in hardware every cycle. NVIDIA GPUs push much of per-instruction dependency timing into compiler-generated metadata. In disassembly tools, this often appears as control information like wait masks, barrier indices, stall counts, yield hints, and reuse flags. The exact bit layout is architecture-dependent, but the mental model is stable: compiler pre-annotates hazards, warp scheduler executes cheaply. Why This Topic Matters When a kernel underperforms, we usually look at occupancy, memory coalescing, or instruction mix. But one layer lower, instruction encoding itself influences how efficiently the SM can issue instructions from each warp.

Gpu microarchitecture

CUDA Register Mapping: From PTX to SASS

Introduction Register allocation is one of the critical aspects of GPU programming. On CPUs, the hardware’s “out-of-order” execution engine hides inefficiencies through register renaming, dynamically managing hundreds of physical registers behind 16 visible ones. GPUs work differently: what the compiler assigns is what actually runs, with no dynamic safety net.