CUDA SASS: Learning to Read NVIDIA's Native GPU ISA

Table of Contents

TL;DR

  • SASS is the real instruction stream executed by NVIDIA GPUs.
  • PTX is not the final hardware ISA. It is a virtual ISA that ptxas lowers into architecture-specific SASS.
  • The main things to learn first are: opcodes, registers, predicates, loads/stores, special registers, and modifiers.
  • SASS is where you see performance-critical details that source code hides: final opcode selection, register usage, spills, and memory instructions.
  • If PTX tells you the compiler’s intent, SASS tells you what the GPU will actually issue.

Why Learn SASS at All?

If you only write CUDA C++, it is tempting to stop at source code and trust the compiler. That works until performance becomes mysterious.

Typical examples:

  • you expected a fused multiply-add, but got separate multiply and add instructions,
  • your kernel suddenly spills after a tiny source change,
  • Nsight Compute reports long scoreboard stalls,
  • or a simple loop turns into much more code than you expected.

At that point, PTX helps, but SASS is the final truth.

SASS is the layer where you can answer questions like:

  • Which instruction did the compiler finally choose?
  • Did a value stay in registers or spill to local memory?
  • Did a load become scalar or vectorized?
  • Is the kernel mostly arithmetic, address generation, or memory traffic?

For a beginner, the goal is not to reverse engineer every bit of every instruction. The goal is much simpler:

learn to look at a SASS dump and understand what the warp is doing.

[PLACEHOLDER: Insert hero diagram showing CUDA C++ -> PTX -> ptxas -> SASS -> SM execution]


What SASS Is — and What It Is Not

SASS is

  • the native instruction set executed by NVIDIA GPUs,
  • architecture-specific,
  • the output of ptxas and binary code generation,
  • and the best place to inspect final instruction selection.

SASS is not

  • a stable portable ISA like PTX,
  • officially documented at the same level of completeness as PTX,
  • or a one-to-one textual reflection of CUDA source.

That distinction matters.

For example, if PTX shows:

add.f32 %f1, %f2, %f3;

the final SASS might be something like:

FADD R1, R2, R3;

But the final form can also include architecture-specific modifiers, different register allocation, different operand sources, or even a different instruction pattern if the compiler found a better lowering.

PTX tells you intent. SASS tells you implementation.


Anatomy of a SASS Instruction

A simple SASS line usually looks like this:

FADD.FTZ.RN R1, R2, R3;

Break it down:

  • FADD -> operation
  • .FTZ -> flush denormals to zero
  • .RN -> round to nearest
  • R1 -> destination register
  • R2, R3 -> source registers

The general pattern is:

OPCODE.MOD1.MOD2 DEST, SRC1, SRC2, ...;

For beginners, every instruction has three questions:

  1. What operation is this?
  2. What register or memory location does it read/write?
  3. Do the modifiers change precision, caching, width, or execution conditions?

That is enough to start reading real kernels.


PTX -> SASS Lowering: Same Intent, Different Surface

One of the best ways to learn SASS is to compare it with PTX.

Opcode selection

PTX is strongly typed, so opcode choice in SASS often reflects the data type being lowered:

PTXTypical SASS
add.s32IADD
add.f32FADD
mad.lo.s32IMAD
fma.rn.f32FFMA

Example:

// PTX
add.s32 %r1, %r2, %r3;
add.f32 %f1, %f2, %f3;
// SASS
IADD R1, R2, R3;
FADD R4, R5, R6;

This is the first big lesson: hardware instructions are less “typed-looking” than PTX, but the operation still implies how operands are interpreted.

Registers are untyped at the hardware level

SASS general registers are usually shown as plain names like:

R0, R1, R2

Unlike PTX, the type is not encoded in the register name.

The same register can hold bits that one instruction interprets as integer, while another interprets as floating point:

IADD R1, R2, R3;
FADD R1, R2, R3;

The register itself did not become “int” or “float.” The instruction defines how the bits are used.

That is very different from PTX, where %r and %f visually suggest different typed register classes.

Special Registers and S2R

Threads need identity information before they can do anything useful.

At the source level, you write things like:

  • threadIdx.x
  • blockIdx.x
  • blockDim.x
  • laneId-style queries

At the SASS level, those values are read from special registers.

Representative examples:

Special registerCUDA meaning
SR_TID_XthreadIdx.x
SR_CTAID_XblockIdx.x
SR_NTID_XblockDim.x
SR_LANEIDlane ID inside a warp
SR_CLOCK / SR_CLOCK64cycle counter

The instruction you will commonly see is:

S2R R0, SR_TID_X;
S2R R1, SR_CTAID_X;

Read that as:

  • move thread ID into R0,
  • move block ID into R1.

These instructions often show up near the top of the kernel because the compiler first needs to establish the thread’s global index before it can load or store anything.

Why S2R matters for performance reading

  • it tells you where the thread indexing logic starts,
  • it reminds you that indexing values are not “free,”
  • and it helps you separate kernel prologue from kernel body.

When you are new to SASS, recognizing S2R is one of the easiest ways to orient yourself in the first dozen instructions.


Common SASS Instructions

If you are new to SASS, learn these first.

Arithmetic

IADD R1, R2, R3;
IMAD R4, R5, R6, R7;
FADD R8, R9, R10;
FFMA R11, R12, R13, R14;
  • IADD -> integer add
  • IMAD -> integer multiply-add, often used for address math
  • FADD -> floating-point add
  • FFMA -> floating-point fused multiply-add

For performance work, FFMA is one of the most important opcodes to recognize because dense numeric kernels often want as many fused multiply-adds as possible.

Data Movement

// PTX
mov.u32 %r1, %r2;
ld.global.u32 %r3, [%rd1];
st.global.u32 [%rd2], %r3;
// SASS
MOV R1, R2;
LDG.E R3, [R4];
STG.E [R6], R3;
  • MOV -> register move or immediate materialization
  • LDG -> load from global memory
  • STG -> store to global memory

Common memory opcodes

InstructionMeaningTypical memory space
LDGload globaldevice memory
STGstore globaldevice memory
LDSload sharedshared memory
STSstore sharedshared memory
LDCload constantconstant memory
LDLload localthread-local memory, often spills
STLstore localthread-local memory, often spills

If you only memorize one performance heuristic from this table, make it this one:

seeing LDL or STL in a hot path often means register spilling.

Why IMAD Shows Up Everywhere

Beginners often expect IMAD to appear only in integer-heavy code. In practice, it appears constantly because address computation is integer math.

For example:

addr = base + index * sizeof(float)

often lowers into a multiply-add style instruction sequence. A representative SASS idiom is:

IMAD.WIDE R2, R3, 4, R5;

Conceptually, this means:

R2 = R5 + R3 * 4

with wider address precision than a plain 32-bit integer add.

So if you open a kernel and keep seeing IMAD, do not assume the algorithm is doing “integer compute.” Very often, the compiler is just building addresses.


Instruction Modifiers: Small Suffixes, Big Meaning

In SASS, modifiers often carry the details that PTX made explicit with types or extra syntax.

Floating-point behavior

FADD.FTZ.RN R1, R2, R3;
  • .FTZ -> flush denormals to zero
  • .RN -> round to nearest

Width and addressing

IMAD.WIDE R2, R3, 4, R5;
  • .WIDE usually means the result or address path is wider than a normal 32-bit arithmetic result

Caching behavior

Representative examples:

LDG.E  R1, [R2];
LDG.CG R1, [R2];
LDG.CS R1, [R2];
  • .E -> normal cached global load form commonly seen in disassembly
  • .CG -> cache behavior biased toward global caching policy, often associated with L2-oriented behavior
  • .CS -> streaming-style hint, typically used when reuse is expected to be low

The exact cache implications can vary across architectures, so treat these as important hints, not as universal fixed rules.

Predicate modifiers

@P0 IADD R1, R2, R3;
@!P0 IADD R1, R2, R3;

These do not change the arithmetic operation itself. They change whether the instruction is active for a thread.

Operand reuse hints

You may see forms like:

FADD R1, R2.reuse, R3;

This is a low-level performance hint tied to how the hardware reuses recently accessed operands.

For a beginner, the safe takeaway is simple:

if you see .reuse, the compiler is trying to reduce register-file pressure or improve operand delivery efficiency.

Do not worry about hand-tuning it yet. Just recognize it.


Predicates: GPU “If” Logic Without Always Taking a Branch

Predication is central to GPU control flow.

The important idea is:

  • predicate values are computed per thread,
  • the warp still issues instructions in lockstep,
  • and an instruction can be conditionally enabled or masked based on a predicate.

Setting a predicate

// PTX
setp.gt.s32 %p1, %r1, %r2;
// SASS
ISETP.GT.AND P0, PT, R1, R2, PT;

You can read this as:

set P0 if R1 > R2

Executing under a predicate

@P0 IADD R3, R3, 1;
@!P0 MOV R3, R5;
  • @P0 -> execute only when predicate is true
  • @!P0 -> execute only when predicate is false

That makes predication look branchless, but there is an important nuance.

When a predicate is false, the instruction is not magically absent from the instruction stream. The warp still fetches and issues it, but the inactive lanes are masked.

So predication can reduce branch overhead for small control-flow regions, but it does not automatically solve memory divergence or cache inefficiency.

SEL: a compact selection idiom

// PTX
selp.s32 %r1, %r2, %r3, %p1;
// Representative SASS form
SEL R1, R2, R3, P0;

This is conceptually a hardware mux:

  • if predicate is true -> choose first value,
  • else -> choose second value.

It is a useful pattern because it often replaces multiple predicated instructions with one instruction.


Branching and Reconvergence

Sometimes the compiler uses real branches instead of pure predication.

At the PTX level, a simple conditional branch can look like this:

@%p1 bra LABEL;

In SASS, the branch itself is simple:

@P0 BRA LABEL;

But on GPUs, divergence matters because not every lane in a warp may take the same path.

That is why you may see reconvergence-related control instructions in real disassembly. The exact mechanism and visible mnemonics vary by architecture, but the core idea is stable:

the hardware needs a way to split and later rejoin warp control flow.

For a beginner, do not obsess over every branch-management instruction on day one. First identify:

  • where the branch condition is formed,
  • which instructions are guarded by it,
  • and where the paths come back together.

Memory Access: The Part That Usually Dominates

Many CUDA kernels are limited less by arithmetic and more by moving data.

That becomes obvious in SASS.

A simple example

// PTX
ld.global.u32 %r1, [%rd1];
ld.shared.u32 %r2, [%r3];
// SASS
LDG.E R1, [R2];
LDS R3, [R4];

Common access forms you will see

FormMeaning
c[bank][offset]constant memory access
[R+offset]address from a general register plus offset
[R] with LDG/STGglobal or generic memory access
[R] with LDS/STSshared memory access

Two beginner-friendly rules matter here:

  1. Address generation is real work. A lot of instructions in simple kernels are just computing addresses.
  2. Memory instructions are often the real bottleneck. A kernel can have very little actual math compared with all the loads, stores, and pointer arithmetic around it.

That is why reading SASS often changes how you think about a kernel. A “simple add kernel” may spend most of its instruction budget on indexing and memory traffic.


A Minimal “How to Read a Kernel” Example

Suppose you are looking at a vector add kernel. A simplified SASS reading strategy is:

  1. find the S2R instructions,
  2. identify the global index computation,
  3. find the bounds check,
  4. locate the two LDG instructions,
  5. locate the FADD,
  6. locate the final STG.

That is the whole story:

  • who am I? -> S2R
  • which element do I own? -> address math
  • am I in bounds? -> predicate/branch
  • load inputs -> LDG
  • compute -> FADD
  • write result -> STG

This is how you should train your eye. Not by reading every instruction equally, but by finding the semantic skeleton first.

[PLACEHOLDER: Insert annotated SASS screenshot with arrows labeling prologue, bounds check, loads, compute, and store]


What to Look for in Profiling Tools

Reading SASS is much more useful when you connect it to profiler evidence.

Nsight Compute

Use Nsight Compute when you want to answer:

  • is the kernel memory-bound or compute-bound?
  • did register pressure reduce occupancy?
  • are there spills?
  • are long scoreboard stalls tied to specific loads?

Good things to inspect:

  • Source / SASS correlation -> map source, PTX, and SASS together
  • Registers per thread -> helps explain occupancy limits
  • Local memory usage -> often confirms spills when you also see LDL / STL
  • Warp stall reasons -> especially long scoreboard, memory dependency, execution dependency
  • Memory workload analysis -> tells you whether the LDG/STG heavy region is the real bottleneck

Practical reading pattern

If Nsight Compute says:

  • high long scoreboard stalls,
  • low arithmetic throughput,
  • and low L1/L2 efficiency,

then your SASS investigation should start with loads, address math, and data reuse, not with the arithmetic opcodes.

Nsight Systems

Use Nsight Systems for a higher-level view:

  • are kernels too small?
  • is launch overhead dominating?
  • is stream overlap missing?
  • is host-device synchronization breaking concurrency?

Nsight Systems will not replace SASS reading, but it tells you whether the kernel is even the right thing to optimize first.

[PLACEHOLDER: Insert Nsight Compute screenshot showing source/PTX/SASS correlation and stall metrics]


Final Takeaway

SASS can look intimidating because it is dense, architecture-specific, and not explained with the same level of official documentation as PTX.

But the beginner version is manageable.

You do not need to start with scheduler internals, control codes, or binary encodings.

Start with this smaller goal:

recognize the thread index setup, the predicate logic, the loads, the math, and the store.

Once you can do that reliably, later topics become much easier:

  • register pressure,
  • instruction modifiers in depth,
  • memory coalescing as seen in SASS,
  • operand reuse,
  • and eventually scheduling and scoreboarding.

SASS stops being “mysterious machine noise” once you realize that most kernels repeat the same skeleton over and over.

The only thing that changes is how much work the compiler wraps around the core computation.

Tags

Related Posts

Cuda Microarchitecture Gpu

CUDA PTX: Learning to Read NVIDIA's Virtual ISA

TL;DR PTX is not the real hardware ISA. It is NVIDIA’s virtual instruction set that sits between CUDA C++ and SASS. PTX is the best layer for learning how the compiler thinks about types, addresses, predicates, and memory spaces. SASS is where architecture-specific details appear: actual opcodes, scheduling metadata, scoreboard behavior, and pipeline usage. If you can read PTX, you can usually answer: what computation is happening, what memory space it touches, and why the compiler generated a certain structure. If you want to optimize the last 20%, you eventually need to correlate PTX with SASS and profiler data. CPU Baseline: Why GPUs Need a Virtual ISA Layer On CPUs, most people think in terms of:

Cuda Microarchitecture

CUDA Register Mapping: From PTX to SASS

Introduction Register allocation is one of the most critical aspects of GPU programming. On CPUs, the hardware’s “out-of-order” execution engine hides inefficiencies through register renaming, dynamically managing hundreds of physical registers behind 16 visible ones. GPUs work differently: what the compiler assigns is what actually runs, with no dynamic safety net.