CUDA SASS: Learning to Read NVIDIA's Native GPU ISA
Table of Contents
TL;DR
- SASS is the real instruction stream executed by NVIDIA GPUs.
- PTX is not the final hardware ISA. It is a virtual ISA that
ptxaslowers into architecture-specific SASS. - The main things to learn first are: opcodes, registers, predicates, loads/stores, special registers, and modifiers.
- SASS is where you see performance-critical details that source code hides: final opcode selection, register usage, spills, and memory instructions.
- If PTX tells you the compiler’s intent, SASS tells you what the GPU will actually issue.
Why Learn SASS at All?
If you only write CUDA C++, it is tempting to stop at source code and trust the compiler. That works until performance becomes mysterious.
Typical examples:
- you expected a fused multiply-add, but got separate multiply and add instructions,
- your kernel suddenly spills after a tiny source change,
- Nsight Compute reports long scoreboard stalls,
- or a simple loop turns into much more code than you expected.
At that point, PTX helps, but SASS is the final truth.
SASS is the layer where you can answer questions like:
- Which instruction did the compiler finally choose?
- Did a value stay in registers or spill to local memory?
- Did a load become scalar or vectorized?
- Is the kernel mostly arithmetic, address generation, or memory traffic?
For a beginner, the goal is not to reverse engineer every bit of every instruction. The goal is much simpler:
learn to look at a SASS dump and understand what the warp is doing.
[PLACEHOLDER: Insert hero diagram showing CUDA C++ -> PTX -> ptxas -> SASS -> SM execution]
What SASS Is — and What It Is Not
SASS is
- the native instruction set executed by NVIDIA GPUs,
- architecture-specific,
- the output of
ptxasand binary code generation, - and the best place to inspect final instruction selection.
SASS is not
- a stable portable ISA like PTX,
- officially documented at the same level of completeness as PTX,
- or a one-to-one textual reflection of CUDA source.
That distinction matters.
For example, if PTX shows:
add.f32 %f1, %f2, %f3;
the final SASS might be something like:
FADD R1, R2, R3;
But the final form can also include architecture-specific modifiers, different register allocation, different operand sources, or even a different instruction pattern if the compiler found a better lowering.
PTX tells you intent. SASS tells you implementation.
Anatomy of a SASS Instruction
A simple SASS line usually looks like this:
FADD.FTZ.RN R1, R2, R3;
Break it down:
FADD-> operation.FTZ-> flush denormals to zero.RN-> round to nearestR1-> destination registerR2,R3-> source registers
The general pattern is:
OPCODE.MOD1.MOD2 DEST, SRC1, SRC2, ...;
For beginners, every instruction has three questions:
- What operation is this?
- What register or memory location does it read/write?
- Do the modifiers change precision, caching, width, or execution conditions?
That is enough to start reading real kernels.
PTX -> SASS Lowering: Same Intent, Different Surface
One of the best ways to learn SASS is to compare it with PTX.
Opcode selection
PTX is strongly typed, so opcode choice in SASS often reflects the data type being lowered:
| PTX | Typical SASS |
|---|---|
add.s32 | IADD |
add.f32 | FADD |
mad.lo.s32 | IMAD |
fma.rn.f32 | FFMA |
Example:
// PTX
add.s32 %r1, %r2, %r3;
add.f32 %f1, %f2, %f3;
// SASS
IADD R1, R2, R3;
FADD R4, R5, R6;
This is the first big lesson: hardware instructions are less “typed-looking” than PTX, but the operation still implies how operands are interpreted.
Registers are untyped at the hardware level
SASS general registers are usually shown as plain names like:
R0, R1, R2
Unlike PTX, the type is not encoded in the register name.
The same register can hold bits that one instruction interprets as integer, while another interprets as floating point:
IADD R1, R2, R3;
FADD R1, R2, R3;
The register itself did not become “int” or “float.” The instruction defines how the bits are used.
That is very different from PTX, where %r and %f visually suggest different typed register classes.
Special Registers and S2R
Threads need identity information before they can do anything useful.
At the source level, you write things like:
threadIdx.xblockIdx.xblockDim.xlaneId-style queries
At the SASS level, those values are read from special registers.
Representative examples:
| Special register | CUDA meaning |
|---|---|
SR_TID_X | threadIdx.x |
SR_CTAID_X | blockIdx.x |
SR_NTID_X | blockDim.x |
SR_LANEID | lane ID inside a warp |
SR_CLOCK / SR_CLOCK64 | cycle counter |
The instruction you will commonly see is:
S2R R0, SR_TID_X;
S2R R1, SR_CTAID_X;
Read that as:
- move thread ID into
R0, - move block ID into
R1.
These instructions often show up near the top of the kernel because the compiler first needs to establish the thread’s global index before it can load or store anything.
Why S2R matters for performance reading
- it tells you where the thread indexing logic starts,
- it reminds you that indexing values are not “free,”
- and it helps you separate kernel prologue from kernel body.
When you are new to SASS, recognizing S2R is one of the easiest ways to orient yourself in the first dozen instructions.
Common SASS Instructions
If you are new to SASS, learn these first.
Arithmetic
IADD R1, R2, R3;
IMAD R4, R5, R6, R7;
FADD R8, R9, R10;
FFMA R11, R12, R13, R14;
IADD-> integer addIMAD-> integer multiply-add, often used for address mathFADD-> floating-point addFFMA-> floating-point fused multiply-add
For performance work, FFMA is one of the most important opcodes to recognize because dense numeric kernels often want as many fused multiply-adds as possible.
Data Movement
// PTX
mov.u32 %r1, %r2;
ld.global.u32 %r3, [%rd1];
st.global.u32 [%rd2], %r3;
// SASS
MOV R1, R2;
LDG.E R3, [R4];
STG.E [R6], R3;
MOV-> register move or immediate materializationLDG-> load from global memorySTG-> store to global memory
Common memory opcodes
| Instruction | Meaning | Typical memory space |
|---|---|---|
LDG | load global | device memory |
STG | store global | device memory |
LDS | load shared | shared memory |
STS | store shared | shared memory |
LDC | load constant | constant memory |
LDL | load local | thread-local memory, often spills |
STL | store local | thread-local memory, often spills |
If you only memorize one performance heuristic from this table, make it this one:
seeing
LDLorSTLin a hot path often means register spilling.
Why IMAD Shows Up Everywhere
Beginners often expect IMAD to appear only in integer-heavy code. In practice, it appears constantly because address computation is integer math.
For example:
addr = base + index * sizeof(float)
often lowers into a multiply-add style instruction sequence. A representative SASS idiom is:
IMAD.WIDE R2, R3, 4, R5;
Conceptually, this means:
R2 = R5 + R3 * 4
with wider address precision than a plain 32-bit integer add.
So if you open a kernel and keep seeing IMAD, do not assume the algorithm is doing “integer compute.” Very often, the compiler is just building addresses.
Instruction Modifiers: Small Suffixes, Big Meaning
In SASS, modifiers often carry the details that PTX made explicit with types or extra syntax.
Floating-point behavior
FADD.FTZ.RN R1, R2, R3;
.FTZ-> flush denormals to zero.RN-> round to nearest
Width and addressing
IMAD.WIDE R2, R3, 4, R5;
.WIDEusually means the result or address path is wider than a normal 32-bit arithmetic result
Caching behavior
Representative examples:
LDG.E R1, [R2];
LDG.CG R1, [R2];
LDG.CS R1, [R2];
.E-> normal cached global load form commonly seen in disassembly.CG-> cache behavior biased toward global caching policy, often associated with L2-oriented behavior.CS-> streaming-style hint, typically used when reuse is expected to be low
The exact cache implications can vary across architectures, so treat these as important hints, not as universal fixed rules.
Predicate modifiers
@P0 IADD R1, R2, R3;
@!P0 IADD R1, R2, R3;
These do not change the arithmetic operation itself. They change whether the instruction is active for a thread.
Operand reuse hints
You may see forms like:
FADD R1, R2.reuse, R3;
This is a low-level performance hint tied to how the hardware reuses recently accessed operands.
For a beginner, the safe takeaway is simple:
if you see
.reuse, the compiler is trying to reduce register-file pressure or improve operand delivery efficiency.
Do not worry about hand-tuning it yet. Just recognize it.
Predicates: GPU “If” Logic Without Always Taking a Branch
Predication is central to GPU control flow.
The important idea is:
- predicate values are computed per thread,
- the warp still issues instructions in lockstep,
- and an instruction can be conditionally enabled or masked based on a predicate.
Setting a predicate
// PTX
setp.gt.s32 %p1, %r1, %r2;
// SASS
ISETP.GT.AND P0, PT, R1, R2, PT;
You can read this as:
set
P0ifR1 > R2
Executing under a predicate
@P0 IADD R3, R3, 1;
@!P0 MOV R3, R5;
@P0-> execute only when predicate is true@!P0-> execute only when predicate is false
That makes predication look branchless, but there is an important nuance.
When a predicate is false, the instruction is not magically absent from the instruction stream. The warp still fetches and issues it, but the inactive lanes are masked.
So predication can reduce branch overhead for small control-flow regions, but it does not automatically solve memory divergence or cache inefficiency.
SEL: a compact selection idiom
// PTX
selp.s32 %r1, %r2, %r3, %p1;
// Representative SASS form
SEL R1, R2, R3, P0;
This is conceptually a hardware mux:
- if predicate is true -> choose first value,
- else -> choose second value.
It is a useful pattern because it often replaces multiple predicated instructions with one instruction.
Branching and Reconvergence
Sometimes the compiler uses real branches instead of pure predication.
At the PTX level, a simple conditional branch can look like this:
@%p1 bra LABEL;
In SASS, the branch itself is simple:
@P0 BRA LABEL;
But on GPUs, divergence matters because not every lane in a warp may take the same path.
That is why you may see reconvergence-related control instructions in real disassembly. The exact mechanism and visible mnemonics vary by architecture, but the core idea is stable:
the hardware needs a way to split and later rejoin warp control flow.
For a beginner, do not obsess over every branch-management instruction on day one. First identify:
- where the branch condition is formed,
- which instructions are guarded by it,
- and where the paths come back together.
Memory Access: The Part That Usually Dominates
Many CUDA kernels are limited less by arithmetic and more by moving data.
That becomes obvious in SASS.
A simple example
// PTX
ld.global.u32 %r1, [%rd1];
ld.shared.u32 %r2, [%r3];
// SASS
LDG.E R1, [R2];
LDS R3, [R4];
Common access forms you will see
| Form | Meaning |
|---|---|
c[bank][offset] | constant memory access |
[R+offset] | address from a general register plus offset |
[R] with LDG/STG | global or generic memory access |
[R] with LDS/STS | shared memory access |
Two beginner-friendly rules matter here:
- Address generation is real work. A lot of instructions in simple kernels are just computing addresses.
- Memory instructions are often the real bottleneck. A kernel can have very little actual math compared with all the loads, stores, and pointer arithmetic around it.
That is why reading SASS often changes how you think about a kernel. A “simple add kernel” may spend most of its instruction budget on indexing and memory traffic.
A Minimal “How to Read a Kernel” Example
Suppose you are looking at a vector add kernel. A simplified SASS reading strategy is:
- find the
S2Rinstructions, - identify the global index computation,
- find the bounds check,
- locate the two
LDGinstructions, - locate the
FADD, - locate the final
STG.
That is the whole story:
- who am I? ->
S2R - which element do I own? -> address math
- am I in bounds? -> predicate/branch
- load inputs ->
LDG - compute ->
FADD - write result ->
STG
This is how you should train your eye. Not by reading every instruction equally, but by finding the semantic skeleton first.
[PLACEHOLDER: Insert annotated SASS screenshot with arrows labeling prologue, bounds check, loads, compute, and store]
What to Look for in Profiling Tools
Reading SASS is much more useful when you connect it to profiler evidence.
Nsight Compute
Use Nsight Compute when you want to answer:
- is the kernel memory-bound or compute-bound?
- did register pressure reduce occupancy?
- are there spills?
- are long scoreboard stalls tied to specific loads?
Good things to inspect:
- Source / SASS correlation -> map source, PTX, and SASS together
- Registers per thread -> helps explain occupancy limits
- Local memory usage -> often confirms spills when you also see
LDL/STL - Warp stall reasons -> especially long scoreboard, memory dependency, execution dependency
- Memory workload analysis -> tells you whether the
LDG/STGheavy region is the real bottleneck
Practical reading pattern
If Nsight Compute says:
- high long scoreboard stalls,
- low arithmetic throughput,
- and low L1/L2 efficiency,
then your SASS investigation should start with loads, address math, and data reuse, not with the arithmetic opcodes.
Nsight Systems
Use Nsight Systems for a higher-level view:
- are kernels too small?
- is launch overhead dominating?
- is stream overlap missing?
- is host-device synchronization breaking concurrency?
Nsight Systems will not replace SASS reading, but it tells you whether the kernel is even the right thing to optimize first.
[PLACEHOLDER: Insert Nsight Compute screenshot showing source/PTX/SASS correlation and stall metrics]
Final Takeaway
SASS can look intimidating because it is dense, architecture-specific, and not explained with the same level of official documentation as PTX.
But the beginner version is manageable.
You do not need to start with scheduler internals, control codes, or binary encodings.
Start with this smaller goal:
recognize the thread index setup, the predicate logic, the loads, the math, and the store.
Once you can do that reliably, later topics become much easier:
- register pressure,
- instruction modifiers in depth,
- memory coalescing as seen in SASS,
- operand reuse,
- and eventually scheduling and scoreboarding.
SASS stops being “mysterious machine noise” once you realize that most kernels repeat the same skeleton over and over.
The only thing that changes is how much work the compiler wraps around the core computation.