CUDA PTX: Learning to Read NVIDIA's Virtual ISA

Table of Contents

TL;DR

  • PTX is not the real hardware ISA. It is NVIDIA’s virtual instruction set that sits between CUDA C++ and SASS.
  • PTX is the best layer for learning how the compiler thinks about types, addresses, predicates, and memory spaces.
  • SASS is where architecture-specific details appear: actual opcodes, scheduling metadata, scoreboard behavior, and pipeline usage.
  • If you can read PTX, you can usually answer: what computation is happening, what memory space it touches, and why the compiler generated a certain structure.
  • If you want to optimize the last 20%, you eventually need to correlate PTX with SASS and profiler data.

CPU Baseline: Why GPUs Need a Virtual ISA Layer

On CPUs, most people think in terms of:

C/C++ -> x86-64 or ARM64 assembly -> hardware

On NVIDIA GPUs, there is an extra layer:

CUDA C++ -> PTX -> SASS -> hardware

That extra layer exists for a practical reason: portability.

PTX is stable across GPU generations. SASS is not.

PTX lets NVIDIA keep a stable compiler target while still generating different hardware instructions for Volta, Ampere, Hopper, Blackwell, or whatever comes next.

That is why PTX feels a bit like LLVM IR for CUDA developers: not exactly source, not exactly machine code, but the most useful middle ground.


What is PTX?

PTX stands for Parallel Thread Execution. It is a virtual instruction set architecture (ISA) designed by NVIDIA to serve as an intermediate representation for CUDA programs. When you write a CUDA kernel in C++, the NVIDIA compiler (nvcc) compiles it down to PTX code. This PTX code is then further compiled by ptxas into SASS (the actual assembly language for NVIDIA GPUs) that runs on the hardware. Think of PTX as the point where CUDA stops being “high-level C++ with kernels” and starts becoming a GPU program description.

CUDA Compilation Pipeline
CUDA Compilation Pipeline: C++ -> PTX -> SASS -> Hardware

Why Start with PTX?

Coming from CUDA kernel development, PTX is still close enough to the source code to be readable, while already exposing the machine-oriented ideas that matter on GPUs:

  • explicit data types
  • explicit state spaces
  • predicated execution
  • address calculations
  • and the compiler’s lowered control flow

What PTX Is — and What It Is Not

PTX is

  • a virtual ISA,
  • strongly typed,
  • explicit about memory spaces,
  • readable enough for humans,
  • and designed to be lowered into hardware-specific SASS.

PTX is not

  • the exact instruction stream that runs on the SM,
  • a guarantee of final opcode selection,
  • a guarantee of final register count,
  • or a guarantee of final scheduling behavior.

That distinction matters.

If you see this in PTX:

mul.lo.s32 %r4, %r2, %r3;

you know the compiler wants an integer multiply with the low 32 bits of the result.

You do not yet know:

  • the exact SASS opcode,
  • whether it shares a pipe with FP32 on the target GPU,
  • how many physical registers survive allocation,
  • or what latency-hiding behavior the final kernel exhibits.

PTX tells you the intent. SASS tells you the implementation.


Anatomy of a PTX Instruction

The general pattern is:

opcode[.modifier].type   destination, source1, source2, ...;

Example:

add.s32 %r0, %r1, %r2;

Break it down:

  • add -> operation
  • .s32 -> signed 32-bit integer type
  • %r0 -> destination
  • %r1, %r2 -> input operands

Another example:

fma.rn.f32 %f4, %f1, %f2, %f3;
  • fma -> fused multiply-add
  • .rn -> round-to-nearest
  • .f32 -> 32-bit floating point

PTX is explicit by design. Unlike C++, there are no silent promotions hiding inside the instruction text.


PTX Register Names and Data Types

PTX uses virtual registers. They are not the final physical registers of the GPU.

You will commonly see names like these:

PTX name styleTypical meaning
%r32-bit integer register
%rd64-bit integer register
%f32-bit float register
%fd64-bit float register
%ppredicate register

Example declarations in handwritten PTX look like this:

.reg .b32  %r<6>;
.reg .b64  %rd<6>;
.reg .f32  %f<5>;
.reg .pred %p<2>;

Two important points:

  1. These are virtual registers in PTX.
  2. Final physical allocation happens later in ptxas.

That means a PTX file can look like it uses many registers, yet the final SASS allocation may differ substantially after optimization, dead code removal, copy coalescing, and spilling decisions.

CPU contrast

On x86-64, the assembly names you see are architectural register names such as rax, rbx, xmm0, ymm1.

On PTX, the names are closer to compiler temporaries.

This is one of the biggest mindset shifts when moving from CPU assembly to GPU IR.

The Most Important PTX Types

PTX is typed at the instruction level.

SuffixMeaning
.s32signed 32-bit integer
.u32unsigned 32-bit integer
.s64signed 64-bit integer
.u64unsigned 64-bit integer
.f3232-bit float
.f6464-bit float
.predpredicate value
.b16/.b32/.b64raw bit containers

Examples:

add.s32    %r0, %r1, %r2;
add.f32    %f0, %f1, %f2;
mul.wide.u32 %rd0, %r1, %r2;

No implicit mixed-type arithmetic

If types do not match, PTX usually makes the conversion visible:

cvt.rn.f32.s32 %f0, %r0;

That explicitness is one reason PTX is so helpful when you are debugging codegen.

If your kernel suddenly contains more conversions than expected, PTX often shows the problem immediately.


State Spaces: PTX Makes Memory Explicit

One of the best parts of PTX is that it does not let memory hide behind generic pointer syntax.

The big state spaces to remember are:

State spaceMeaning
.paramkernel or function parameters
.globaldevice memory visible across the grid
.sharedon-chip shared memory visible within a block
.localthread-private memory, often spills or stack-like storage
.constconstant memory

Examples:

ld.param.u32    %r1, [kernel_param_0];
ld.global.f32   %f1, [%rd1];
st.global.f32   [%rd2], %f2;
ld.shared.u32   %r2, [%rd3];
st.shared.u32   [%rd4], %r5;

This is already much more informative than a high-level pointer dereference.

If you see ld.local, that should immediately raise a question:

Is this true thread-local storage, or did register pressure cause a spill?

That single PTX clue often explains performance problems before you even open a profiler.

Address Calculation in PTX

GPU code spends a lot of time computing addresses.

For a 1D array access like x[i], PTX typically does something like this:

mul.wide.s32  %rd1, %r1, 4;
add.s64       %rd2, %rd_base, %rd1;
ld.global.f32 %f1, [%rd2];

Why mul.wide.s32?

Because:

  • i is often a 32-bit integer,
  • the byte offset may need 64-bit addressing,
  • and float is 4 bytes.

This is one of the first places where PTX makes the machine model visible.

In C++, x[i] looks tiny. In PTX, you can see the full lowering:

$$ \text{address} = \text{base} + i \times \text{sizeof}(T) $$

That visibility becomes even more useful for multidimensional indexing and pitched memory.

Memory Cache Hints

Beyond state spaces, PTX also lets you control how memory operations interact with the cache hierarchy.

When you write to global memory, you can hint whether data should remain in cache or be evicted immediately. This is especially useful for:

  • large arrays that do not fit in cache,
  • one-time reads that will not be reused,
  • streaming workloads where cache coherency overhead outweighs benefit.

The most common cache specifiers are:

ModifierMeaning
.cacache at all levels (default)
.cgcache in L2 only, bypass L1
.cscache in both L1 and L2, but with streaming behavior (weaker coherency)
.clcache in L1 only, bypass L2
.cccache coherent (use when multiple threads in different blocks may read/write the same address)

Examples:

ld.global.ca.f32  %f1, [%rd1];
ld.global.cg.f32  %f2, [%rd2];
st.global.cs.f32  [%rd3], %f3;

Two practical lessons:

  1. Default (.ca) works for most kernels. Do not override unless profiling shows cache pressure or unnecessary coherency stalls.
  2. .cg is common for large, one-time reads like input matrices in matrix multiply.

This is another case where PTX makes the compiler’s assumptions visible. If you see .cg everywhere, the compiler is telling you it expects streaming access patterns.

Special Registers: Where Thread Identity Comes From

PTX exposes special registers for CUDA’s execution hierarchy.

The most common ones are:

Special registerMeaning
%tid.xthread index within block
%ntid.xblock dimension
%ctaid.xblock index
%nctaid.xgrid dimension
%laneidlane index within warp
%warpidwarp index within SM context

Example:

mov.u32    %r1, %tid.x;
mov.u32    %r2, %ctaid.x;
mov.u32    %r3, %ntid.x;
mad.lo.s32 %r4, %r2, %r3, %r1;

This is just:

int i = blockIdx.x * blockDim.x + threadIdx.x;

Again, PTX makes the mapping explicit.

Thread Index Calculation in PTX

In PTX, this same computation appears as:

mov.u32    %r1, %tid.x;
mov.u32    %r2, %ctaid.x;
mov.u32    %r3, %ntid.x;
mad.lo.s32 %r4, %r2, %r3, %r1;

Breaking this down:

  1. mov.u32 %r1, %tid.x; — load the thread index within the block
  2. mov.u32 %r2, %ctaid.x; — load the block index
  3. mov.u32 %r3, %ntid.x; — load the block dimension (threads per block)
  4. mad.lo.s32 %r4, %r2, %r3, %r1; — compute blockIdx.x * blockDim.x + threadIdx.x in a single multiply-add instruction

The mad.lo.s32 (multiply-add, low 32 bits, signed 32-bit) is a common pattern that fuses the multiply and add into one operation.

For 2D or 3D indexing, you simply repeat this pattern for each dimension. For example, a 2D index like blockIdx.y * blockDim.y + threadIdx.y follows the same structure, just with the .y variants of the special registers.

This is one of the first places where PTX makes the cost of indexing visible. In C++, blockIdx.x * blockDim.x + threadIdx.x disappears into an arithmetic expression. In PTX, you see it materialize as explicit instructions that take up the instruction stream and contribute to register pressure.


Predicates: PTX’s Control-Flow Building Block

On GPUs, simple control flow is often lowered into predicate generation plus predicated execution.

Example:

setp.lt.s32 %p1, %r1, %r2;
@%p1 add.s32 %r3, %r3, 1;

This means:

  1. compare %r1 < %r2,
  2. store the boolean result in predicate %p1,
  3. execute the add only where %p1 is true.

Predicated Instructions

  • setp -> compare values and set a predicate register (%p)
  • selp -> select between two values using a predicate (branchless select)
  • @%pX op... -> execute an instruction only for lanes where predicate is true
setp.lt.s32 %p1, %r1, %r2;
selp.s32    %r1, %r2, %r3, %p1;   // %r1 = (%p1) ? %r2 : %r3
@%p1 mov.s32 %r3, %r4;            // only active where %p1 is true

Conceptually, this is still a shared warp instruction stream (SIMT): one instruction is issued for the warp, and each lane decides whether it is active for that instruction.

What predication means in practice

For a predicated instruction:

  • predicate true -> lane executes normally
  • predicate false -> lane is masked for that instruction

Even when some lanes are masked off, the instruction still flows through the pipeline. The key effect is that write-back is suppressed for inactive lanes, so their destination state is not updated.

Why compilers like this pattern

  • avoids a control-flow change for small conditions
  • often avoids warp split/reconvergence overhead for tiny if bodies
  • maps naturally to selp for conditional value selection

That is why many short source-level if statements appear as predicate-oriented PTX before final SASS branch behavior is decided.

Important edge cases

  • Predication does not remove memory divergence effects
  • Too many predicated instructions can waste issue bandwidth when many lanes are inactive
  • Compilers may already choose selp automatically for simple ternary-like patterns
  • Heavy nested conditions can increase pressure on predicate and general registers

When explicit branches still appear

Branches are still common for:

  • larger conditional regions
  • loops
  • function-call style control flow
@%p1 bra LOOP;

At that point, hardware handles warp-level control flow with reconvergence machinery and scheduling logic.


Reading Common Arithmetic Patterns

Here are the PTX instructions you will see constantly.

Integer arithmetic

add.s32      %r0, %r1, %r2;
sub.s32      %r3, %r4, %r5;
mul.lo.s32   %r6, %r7, %r8;
mad.lo.s32   %r9, %r1, %r2, %r3;
mul.wide.s32 %rd1, %r1, %r2;

Floating-point arithmetic

add.f32      %f0, %f1, %f2;
mul.f32      %f3, %f4, %f5;
fma.rn.f32   %f6, %f1, %f2, %f3;

Bitwise and shifts

and.b32 %r0, %r1, %r2;
or.b32  %r3, %r4, %r5;
xor.b32 %r6, %r7, %r8;
shl.b32 %r9, %r9, 2;
shr.u32 %r1, %r1, 1;

Data movement and conversion

mov.u32         %r0, %tid.x;
cvt.rn.f32.s32  %f0, %r0;
cvt.s64.s32     %rd0, %r1;

Special functions

sin
cos
sqrt
rsqrt
exp
log

Vectorized instructions (e.g., for matrix multiply)

ld.global.v4.f32   %f0, [%rd1];
st.global.v4.f32   [%rd2], %f4;

If you learn to recognize these four families, most beginner PTX listings become readable very quickly.


PTX vs SASS: What Changes Later?

A useful mental model is this:

QuestionPTX answers well?SASS answers well?
What operation is this kernel performing?YesYes
Which memory space is touched?YesUsually
Why did the compiler insert a conversion?YesSometimes
Which exact hardware opcode runs?NoYes
Which pipeline is stressed?Not reliablyBetter
What is the final register allocation?NoBetter
What scheduling metadata exists?NoYes

That is why PTX is the right first stop for understanding code generation, but not the last stop for serious tuning.

If PTX tells you what, SASS tells you how.


A Minimal CUDA -> PTX Walkthrough

Let’s start with a tiny vector addition kernel:

__global__ void addKernel(int *c, const int *a, const int *b, const unsigned long long length)
{
  unsigned long long idx = (blockIdx.x * blockDim.x) + threadIdx.x;

  if (idx < length)
  {
    c[idx] = a[idx] + b[idx];
  }
}

The corresponding PTX will looks like this:



DONE:
ret;

Even before looking at SASS, PTX already reveals a lot:

  • kernel arguments come from the .param state space,
  • thread indexing uses special registers like %tid.x, %ctaid.x, %ntid.x,
  • control flow is often built from setp + predicated branch,
  • byte addressing for float arrays uses mul.wide.s32 by 4,
  • loads and stores are explicit about global memory,
  • and the arithmetic becomes fma.rn.f32.

That is exactly why PTX is a great first layer to study.

[PLACEHOLDER: Insert side-by-side screenshot of source, PTX, and SASS for ADD Kernel]

Breakdown of the PTX:


Writing Inline PTX in CUDA C++

If you want to experiment directly, CUDA lets you embed PTX using asm().

A tiny example:

__device__ int add_inline_ptx(int a, int b) {
  int out;
  asm("add.s32 %0, %1, %2;"
      : "=r"(out)
      : "r"(a), "r"(b));
  return out;
}

The placeholders %0, %1, %2 refer to the operands listed after the template string.

A few constraints worth remembering

ConstraintMeaning
"r"32-bit integer register
"l"64-bit integer register
"f"32-bit float register
"d"64-bit float register
"=r"write-only output
"+r"read-write operand

Example with read-write state:

__device__ int incr_inline_ptx(int x) {
  asm("add.s32 %0, %0, 1;" : "+r"(x));
  return x;
}

Two beginner pitfalls

  1. Use asm volatile if the block has side effects that must not be removed or moved.
  2. Use the "memory" clobber if the PTX touches memory indirectly and you need a compiler barrier.

Example:

asm volatile ("st.global.u32 [%0], %1;" :: "l"(ptr), "r"(value) : "memory");

Also, if you declare temporary PTX registers inside the asm string, wrap the block in braces to avoid name collisions across inlined copies:

asm volatile (
  "{\n\t"
  "  .reg .u32 t1;\n\t"
  "  mul.lo.u32 t1, %1, %1;\n\t"
  "  mul.lo.u32 %0, t1, %1;\n\t"
  "}"
  : "=r"(y)
  : "r"(x));

Inline PTX is powerful, but for learning PTX itself, I recommend reading generated PTX first and writing inline PTX second.



A Practical Checklist for Reading PTX

When I open a PTX file, I usually scan in this order:

1. Kernel signature

  • What are the parameters?
  • Which ones are pointers?
  • Which ones are scalars?

2. Indexing math

  • How is the thread index computed?
  • Are multiplies widened to 64-bit for addresses?

3. Control flow

  • Are bounds checks implemented with setp + bra?
  • Is the compiler using predication instead of large branches?

4. Memory spaces

  • ld.global / st.global?
  • ld.shared / st.shared?
  • ld.local / st.local?

5. Arithmetic mix

  • integer-heavy?
  • floating-point-heavy?
  • many conversions?
  • many special functions like sqrt, rsqrt, sin, cos?

6. Suspicious performance clues

  • repeated cvt instructions,
  • extra address recomputation,
  • local memory traffic,
  • too much control-flow scaffolding.

That is usually enough to decide whether you need to keep reading PTX, move to SASS, or open the profiler.


What PTX Hides from You

PTX is excellent, but it can also be misleading if you trust it too much.

Here are the big things it hides:

Final register pressure

PTX virtual registers are not physical registers. Real occupancy depends on final SASS allocation.

Exact pipeline pressure

A PTX fma does not directly tell you everything about scheduler pressure, dual issue behavior, or final hardware pipeline mapping.

Scheduling metadata

PTX does not expose the wait barriers, reuse flags, stall counts, or yield hints that appear in SASS disassembly.

Final memory behavior

PTX shows the memory space, but not the full runtime story:

  • cache hit rates,
  • memory divergence,
  • replay overhead,
  • bank conflicts,
  • scoreboard stalls.

That is why PTX reading and profiling must go together.

What to Look for in Profiling

Learning PTX is much easier when you connect it to profiler evidence.

Nsight Compute

For a PTX-focused investigation, I would start with these sections:

LaunchStats / Occupancy

Look for:

  • launch__registers_per_thread
  • launch__occupancy_limit_registers
  • achieved vs theoretical occupancy

Why it matters:

  • PTX may look harmless,
  • but final register allocation may reduce active warps.

InstructionStats

Look for:

  • instruction mix,
  • integer vs floating-point balance,
  • memory instruction volume,
  • SASS opcode categories.

Why it matters:

This tells you whether the PTX arithmetic you saw actually turned into the hardware mix you expected.

SourceCounters

Look for:

  • PTX/SASS/source correlation,
  • branch divergence,
  • predication behavior,
  • spill-related source metrics.

Why it matters:

This is the bridge between the readable PTX layer and the final executed code.

SchedulerStats and WarpStateStats

Look for:

  • low eligible warps,
  • skipped issue slots,
  • long scoreboard stalls,
  • math pipe throttle,
  • memory throttle.

Why it matters:

PTX may show you a ld.global followed by immediate use. Nsight Compute tells you whether that pattern is actually causing scoreboard stalls.

MemoryWorkloadAnalysis

Look for:

  • L1/L2 hit rates,
  • sectors per request,
  • DRAM throughput,
  • shared-memory bank conflicts.

Why it matters:

PTX tells you which state space is touched. Nsight Compute tells you whether it is touched efficiently.

[PLACEHOLDER: Insert screenshot of Nsight Compute Source page correlating source -> PTX -> SASS]

Nsight Systems

Nsight Systems is less about individual PTX instructions and more about context.

Look for:

  • kernel launch gaps,
  • stream concurrency,
  • host-to-device overlap,
  • memcpy/kernel ordering,
  • whether the kernel is even large enough to justify PTX-level tuning.

Why it matters:

Sometimes the PTX is fine. The real problem is that the GPU is idle between launches, blocked by synchronization, or starved by the CPU.

Use Nsight Systems first if the question is:

“Is my application keeping the GPU busy at all?”

Use Nsight Compute first if the question is:

“Why is this specific kernel underperforming?”

Visual Profiler (legacy note)

The old Visual Profiler is largely historical now, but if you encounter old tutorials or legacy screenshots, the same mental model still applies.

Look for:

  • achieved occupancy,
  • global memory throughput,
  • branch efficiency,
  • local memory usage,
  • instruction mix.

If a legacy profile shows heavy local memory traffic, go back to PTX and check for ld.local / st.local.


PTX Reading Tips That Save Time

Tip 1: Ignore boilerplate on the first pass

Do not get stuck on every directive. First find:

  • indexing,
  • loads,
  • stores,
  • predicates,
  • main arithmetic.

Tip 2: Follow data, not line numbers

Pick one source value and track it:

  • where it enters,
  • where it is converted,
  • where it participates in math,
  • where it is stored.

Tip 3: Treat ld.local as a warning sign

It is not always bad, but it is always worth investigating.

Tip 4: Learn the common idioms once

Most kernels reuse the same patterns:

  • global index calculation,
  • bounds check,
  • address widening,
  • global load,
  • arithmetic,
  • store.

Once those become familiar, PTX stops looking intimidating.

Common Beginner Mistakes

“PTX is what the GPU executes”

No. PTX is lowered into SASS.

“More PTX registers means more hardware registers”

Not necessarily. PTX register names are virtual.

“If the PTX looks clean, performance must be good”

Not necessarily. Final scheduling, pipeline pressure, cache behavior, and occupancy all happen later.

“Predication means no divergence cost”

Not automatically. Predication can avoid a branch, but it can still waste issue bandwidth if many threads are masked off.

“Local memory means CPU-like stack memory”

Conceptually yes, performance-wise no. On GPUs, local memory usually behaves much closer to device memory than to a tiny CPU stack in L1.


Conclusion

PTX is the right place to begin if you want to understand CUDA below the source level without immediately drowning in hardware detail.

It teaches you how the compiler thinks in terms of:

  • typed operations,
  • explicit memory spaces,
  • predicate-driven control flow,
  • address generation,
  • and the structure of GPU kernels.

Most importantly, PTX gives you a readable intermediate language that connects source code to real performance work.

If you can read PTX comfortably, you are already much closer to answering hard questions like:

  • Why did this kernel start spilling?
  • Why did this branch become predicated?
  • Why is this address calculation more expensive than expected?
  • Why does the profiler show more memory pressure than the source suggests?

That is the point where GPU optimization starts becoming systematic instead of mystical.

[PLACEHOLDER: Insert final summary illustration comparing source, PTX, SASS, and profiler viewpoints] –>

Where to Go Next

Once PTX starts feeling readable, the natural next steps are:

  1. PTX -> SASS mapping
  2. Register allocation and occupancy
  3. Instruction scheduling metadata
  4. Memory subsystem behavior in profilers

That is where PTX stops being the whole story and becomes one layer in a larger optimization workflow.

Tags

Related Posts

Cuda Microarchitecture

CUDA Register Mapping: From PTX to SASS

Introduction Register allocation is one of the most critical aspects of GPU programming. On CPUs, the hardware’s “out-of-order” execution engine hides inefficiencies through register renaming, dynamically managing hundreds of physical registers behind 16 visible ones. GPUs work differently: what the compiler assigns is what actually runs, with no dynamic safety net.

Cuda Microarchitecture Gpu

CUDA SASS: Learning to Read NVIDIA's Native GPU ISA

TL;DR SASS is the real instruction stream executed by NVIDIA GPUs. PTX is not the final hardware ISA. It is a virtual ISA that ptxas lowers into architecture-specific SASS. The main things to learn first are: opcodes, registers, predicates, loads/stores, special registers, and modifiers. SASS is where you see performance-critical details that source code hides: final opcode selection, register usage, spills, and memory instructions. If PTX tells you the compiler’s intent, SASS tells you what the GPU will actually issue. Why Learn SASS at All? If you only write CUDA C++, it is tempting to stop at source code and trust the compiler. That works until performance becomes mysterious.