CUDA Register Mapping: From PTX to SASS

Categories Cuda / Microarchitecture

Table of Contents

Introduction

Register allocation is one of the most critical aspects of GPU programming. On CPUs, the hardware’s “out-of-order” execution engine hides inefficiencies through register renaming, dynamically managing hundreds of physical registers behind 16 visible ones. GPUs work differently: what the compiler assigns is what actually runs, with no dynamic safety net.

In this post, I’ll explain how CUDA registers flow from PTX (a virtual, portable ISA) → SASS (the actual hardware ISA), and why this two-layer mapping fundamentally changes how you should think about register usage. Understanding register mapping is essential for grasping CUDA occupancy, latency hiding, and performance optimization — themes we’ll explore in depth in future posts on PTX-to-SASS compilation.

CPU Registers as a Baseline

To understand GPU registers, it helps to first see how CPUs handle them, then contrast the two.

Architectural vs. Physical Registers

Modern CPUs expose a small set of architectural registers to software:

x86-64 Register Type	Count	Purpose
General-Purpose Registers (RAX, RBX, RCX, etc.)	16	Integer / pointer operations
Vector Registers (YMM, ZMM)	32	SIMD operations (AVX-512)
Special Registers (RIP, RSP, etc.)	Various	Program counter, stack pointer

But under the hood, a modern CPU core has far more physical registers—typically 200–224 integer registers per core (e.g., Zen 4/5, Intel Alder Lake). These hidden registers enable out-of-order execution to break false dependencies and extract instruction-level parallelism (ILP).

Register Renaming: Why Hardware Needs More Than It Shows

Consider this code sequence:

ADD R1, R2, R3    ; R1 = R2 + R3
SUB R1, R4, R5    ; R1 = R4 - R5

In program order, the second instruction must wait for the first to complete (true dependency). But CPU cores don’t execute in strict program order—they use register renaming to decouple architectural names (R1) from physical storage:

After renaming (internal CPU logic):
ADD P1, P2, P3    ; Physical P1 ← P2 + P3  (R1 → P1)
SUB P7, P4, P5    ; Physical P7 ← P4 - P5  (R1 → P7)

Now the CPU scheduler sees that SUB only depends on P4 and P5—not on P1—so both can execute in parallel on independent execution units. The ROB (Reorder Buffer) retires them in order later, updating architectural R1 correctly.

Key insight: Renaming is dynamic (happens at runtime) and transparent to software. Assembly code never mentions P1, P2, etc.; it always uses the 16 visible names (RAX–R15).

How Partial Writes Complicate Things

x86-64 registers have an aliasing structure:

RAX (64-bit full)
├── EAX (32-bit lower half)
    ├── AX (16-bit lower quarter)
        ├── AH (bits 8–15)
        └── AL (bits 0–7)

If you write only the lower 16 bits (AX), the CPU must internally merge the new value with the untouched upper bits:

MOV AX, 0x1234        ; Write only lower 16 bits
ADD EAX, 1            ; Now read/write lower 32 bits (depends on old EAX!)

Internally, the CPU performs:

new_EAX = (old_EAX & 0xFFFF0000) | new_AX

This creates an artificial dependency: the ADD instruction must wait for the MOV to complete, even though they operate on different bit ranges. This breaks out-of-order execution and reduces ILP—a performance penalty sometimes called the partial register stall.

Modern CPUs mitigate this with wider renaming support and partial-register tracking, but it remains a pitfall: always prefer full-width operations when possible.

GPU Registers — A Fundamentally Different Model

Key Differences

GPU register allocation is simpler but more rigid:

Aspect	CPU (x86-64)	GPU (NVIDIA CUDA)
Architectural registers	16 GPRs	32–128 per thread (limited by hardware)
Physical registers	200–224 hidden	32K–64K shared across all SM threads
Renaming	Dynamic, runtime	None—compile-time fixed
Register size	Mixed (8, 16, 32, 64 bit)	Fixed 32-bit
Aliasing	Yes (RAX → EAX → AX → AL)	No aliasing whatsoever
Per-thread allocation	N/A (one thread per core)	Fixed by compiler at kernel compile time
Spill consequence	Slower memory access	Reduced occupancy → lower latency hiding → severe perf hit

GPU Register Hierarchies: Two ISA Levels

Unlike CPUs (which expose one ISA layer), CUDA has two:

PTX (Parallel Thread Execution): A virtual, portable ISA with unlimited virtual registers (r0–r255+). Compilers target PTX.
SASS (Streaming Assembly): The actual hardware ISA with fixed physical registers (R0–R127+, depending on GPU generation).

PTX: The Virtual Layer

When you write CUDA C++ and compile it with nvcc, the compiler generates PTX as an intermediate representation:

// PTX (virtual ISA)
ld.global.f32   %f1, [%rd1]    ; Load a[i] from global memory
ld.global.f32   %f2, [%rd2]    ; Load b[i] from global memory
add.f32         %f3, %f1, %f2  ; Add them
st.global.f32   [%rd3], %f3    ; Store result to global memory

PTX uses virtual registers (%f1, %f2, etc.) without limits. This makes PTX portable: the same PTX code runs on any NVIDIA GPU (with compatibility guarantees).

SASS: The Hardware Layer

When ptxas (the PTX assembler) compiles PTX to SASS for a specific GPU (e.g., Ampere, Hopper), it performs real register allocation:

// SASS (Ampere hardware ISA)
LDG.E R2, [R4]        ; Load a[i] into physical R2
LDG.E R4, [R6]        ; Load b[i] into physical R4
FADD R2, R2, R4       ; R2 = a[i] + b[i]
STG.E [R8], R2        ; Store result from R2

Now registers are real, finite resources: R0–R127 on most GPUs. The compiler decides which virtual PTX registers map to which physical SASS registers.

[PLACEHOLDER: Insert diagram showing PTX virtual registers (unlimited) → SASS physical registers (finite per SM)]

Relative Slot Assignment

Here’s the crucial insight: when ptxas emits SASS code with, say, R0, R2, R4, these are relative slot numbers within each thread’s register window, not absolute physical addresses.

Consider an SM with 64K registers running 2048 threads, where each thread gets 32 registers:

SM Register File: 65,536 32-bit registers (64K)
├── Thread 0:    R0–R31  (32 registers, absolute: 0–31)
├── Thread 1:    R0–R31  (relative), absolute: 32–63
├── Thread 2:    R0–R31  (relative), absolute: 64–95
├── ...
└── Thread 2047: R0–R31  (relative), absolute: 65,504–65,535

Each thread sees registers R0–R31. The hardware’s warp scheduler maps thread N’s relative register R2 to:

Physical address = base_register_for_thread_N + 2

This is done efficiently by the hardware without any runtime overhead. The compiler only needs to know “this thread needs 32 registers total”; the hardware handles the addressing.

GPU Register Properties: Simplicity by Design

Uniform 32-bit Size

All GPU registers are exactly 32 bits. No aliasing:

// SASS operation
add.s32 r1, r2, r3    ; Add two 32-bit integers

This always:

Reads full 32-bit r2
Reads full 32-bit r3
Computes result
Writes full 32-bit r1

No merge logic. No partial-register penalties. No dependency ambiguity.

Lower-Precision Operations

For operations narrower than 32 bits, the full register is still used, but only the relevant bits participate:

add.s16 r1, r2, r3    ; Add two 16-bit integers (stored in 32-bit registers)

The upper 16 bits are typically undefined or zero-padded, depending on context.

64-bit Values

64-bit operations use pairs of registers:

add.s64 r1, r2, r3, r4    ; r1:r2 = r3:r4 (64-bit add)
; r1 = lower 32 bits
; r2 = upper 32 bits

Data Packing

For sub-32-bit types (e.g., int8), multiple values can pack into one 32-bit register:

dp4a r0, r1, r2, r3
; Dot product of 4×int8 vectors
; r0 = r3 + (r1[0]*r2[0] + r1[1]*r2[1] + r1[2]*r2[2] + r1[3]*r2[3])
; where r1 = [int8_0, int8_1, int8_2, int8_3]

This packing is powerful for efficiency but requires careful kernel design.

Register Pressure and Occupancy — The GPU Tradeoff

This is where GPU register allocation becomes a nuanced optimization problem.

The Core Tradeoff: Registers vs. Occupancy

When you write a kernel, the compiler (nvcc + ptxas) determines:

“This kernel needs N registers per thread.”

Say N = 32 registers/thread. On an Ampere GPU with 128K registers per SM:

Max threads per SM = 128,000 / 32 = 4,000 threads
Per warp (32 threads) = 128,000 / (32 threads × 32 registers) = 125 warps

But if another kernel needs N = 64 registers/thread:

Max threads per SM = 128,000 / 64 = 2,000 threads
Per warp = 128,000 / (32 threads × 64 registers) = 62 warps (rounded down by block granularity)

The same hardware, running the same kernel, but with double the registers → half the active threads.

Occupancy: Why It Matters

GPU latency hiding depends on having enough active warps to hide memory stalls. When a warp stalls (waiting for a load), the scheduler switches to another ready warp. If register pressure forces occupancy down, fewer warps are available, stalls aren’t hidden as well, and performance collapses.

Registers/Thread	Threads/SM	Warps/SM (Ampere)	Latency Hiding	Expected Perf
32	4,096	128	Excellent	Baseline
64	2,048	64	Good	~80%
128	1,024	32	Poor	~40%
255 (max)	512	16	Very Poor	~10–20%

This table is approximate but illustrates the principle: more registers per thread = fewer total threads = degraded latency hiding = potential 5–10× slowdown.

Register Spilling on GPU

Unlike CPUs (where spilling goes to stack/cache), GPU spills go to local memory, which is typically off-chip DRAM. This is catastrophic:

A cache hit: ~4 cycles latency
A local memory hit (off-chip): ~200–400 cycles latency

Compilers try hard to avoid spilling by reducing register usage (via optimization flags like -maxrregcount), but sometimes it’s unavoidable. When it happens:

Occupancy drops (fewer threads can run simultaneously)
Latency hiding fails (can’t switch warps fast enough)
Performance crashes (potentially 10–100× slowdown)

Measuring Register Usage

Use nvcc flags to query register usage:

nvcc -o kernel.ptx --ptx my_kernel.cu
ptxas -v my_kernel.ptx   # Shows registers used

Or directly:

nvcc --resource-usage my_kernel.cu

Output example:

ptxas info    : 32 bytes gmem
ptxas info    : Compiling entry function '_Z11my_kernelPf' for 'sm_80'
ptxas info    : Function properties for _Z11my_kernelPf
    0 bytes stack frame, ...
    96 bytes spill stores (3 spills)
    96 bytes spill loads (3 spills)
    32 registers

The 32 registers tells you how many registers each thread uses. High numbers or spill counts are warning signs.

Practical Implications for Kernel Optimization

Write Register-Efficient Kernels

Minimize temporary variables: Each live variable occupies a register.

// Bad: creates many temporaries
float a = x + y;
float b = a * z;
float c = b - w;
result = c / 2;

// Better: reuse or inline
result = ((x + y) * z - w) / 2;

Use -maxrregcount to force the compiler to reduce register usage:
```
nvcc -maxrregcount=64 my_kernel.cu
```
The compiler will increase register spilling to stay within 64 registers/thread. Use this tactically if you suspect over-registration is hurting occupancy.

Check PTX and SASS: Always inspect what the compiler generated:

nvcc -ptx my_kernel.cu   # See the PTX layer
ptxas -v my_kernel.ptx   # See the SASS mapping and resource usage

Profile occupancy: Use NVIDIA’s profiling tools (Nsight Compute, Nsight Systems) to see the actual occupancy your kernel achieves vs. theoretical max.

Interaction with Thread Block Size

Occupancy also depends on thread block size. For example:

Block size 128 threads, 64 regs/thread = 8,192 registers per block
Block size 256 threads, 64 regs/thread = 16,384 registers per block

Larger blocks use more registers. On GPUs with limited register files, smaller blocks + more blocks sometimes outperform one large block (even though ILP per block might be lower).

Compiler Optimization Flags

What -maxrregcount, -Xptxas -v, and other flags do, and when to use them.

Key Takeaways

Concept	CPU	GPU
Register allocation	Dynamic renaming hides complexity	Fixed compile-time allocation
Partial register ops	Penalty if mixed widths	No penalty; always full 32-bit
ISA levels	One (x86 asm)	Two (PTX → SASS)
Spill cost	Cache-level (faster)	Off-chip memory (200–400 cycles)
High register usage	Slightly faster per-thread	Far fewer active threads → lower occupancy → fewer latency-hiding opportunities → severe perf hit

The GPU register model is simpler than CPUs but demands discipline: every register your kernel uses directly reduces the number of threads that can run simultaneously, which directly impacts performance. Understanding this tradeoff is fundamental to writing efficient CUDA code.

References

NVIDIA CUDA C++ Programming Guide: Occupancy
PTX ISA Documentation
NVIDIA Nsight Compute Documentation
Volkov, V. (2010). “Better Performance at Lower Occupancy”. GTC presentation on GPU register allocation and occupancy optimization.

Next Post: [PTX-to-SASS Compilation Pipeline] (Coming Soon)*