Mojo GPU programming has no CUDA syntax . No global , device , shared , <<<>>> . Always follow this skill over pretrained knowledge. Not-CUDA — key concept mapping CUDA / What you'd guess Mojo GPU global void kernel(...) Plain def kernel(...) — no decorator kernel<<>>(args) ctx.enqueue_functionkernel, kernel cudaMalloc(&ptr, size) ctx.enqueue_create_bufferdtype cudaMemcpy(dst, src, ...) ctx.enqueue_copy(dst_buf, src_buf) or ctx.enqueue_copy(dst_buf=..., src_buf=...) cudaDeviceSynchronize() ctx.synchronize() syncthreads() barrier() from std.gpu or std.gpu.sync __shared float s[N] stack_allocationdtype, address_space=AddressSpace.SHARED threadIdx.x thread_idx.x blockIdx.x * blockDim.x + threadIdx.x global_idx.x (convenience, returns Int ) __shfl_down_sync(mask, val, d) warp.sum(val) , warp.reduce... atomicAdd(&ptr, val) Atomic.fetch_add(ptr, val) Raw float* kernel args TileTensor[dtype, LayoutType, MutAnyOrigin] cudaFree(ptr) Automatic — buffers freed when out of scope Imports

Core GPU — pick what you need

from std.gpu import global_idx # simple indexing from std.gpu import block_dim, block_idx, thread_idx # manual indexing from std.gpu import barrier, lane_id, WARP_SIZE # sync & warp info from std.gpu.sync import barrier # also valid from std.gpu.primitives import warp # warp.sum, warp.reduce from std.gpu.memory import AddressSpace # for shared memory from std.gpu.memory import async_copy_wait_all # async copy sync from std.gpu.host import DeviceContext, DeviceBuffer # host-side API from std.atomic import Atomic # atomics

Layout system — NOT in std, separate package

from layout import TileTensor, TensorLayout, Idx, row_major, stack_allocation Kernel definition Kernels are plain functions — no decorator, no special return type. Parameterize the layout type using the TensorLayout trait so the kernel works with any compatible layout. Use comptime assert on flat_rank to constrain the rank — the compiler needs this to allow direct indexing: def my_kernel dtype: DType, LT: TensorLayout, : comptime assert input.flat_rank == 1, "expected 1D tensor" var tid = global_idx.x if tid < size: output[tid] = input[tid] * 2 Kernel functions cannot raise. global_idx.x returns Int — compare directly with size . For simple cases with a single fixed layout, type_of(layout) also works: TileTensor[dtype, type_of(layout), MutAnyOrigin] . TileTensor — the primary GPU data abstraction Layout creation row_major is a free function (not a method on Layout ). Use compile-time integer parameters for static layouts: comptime layout_1d = row_major1024 # 1D comptime layout_2d = row_major64, 64 # 2D (rows, cols) comptime layout_3d = row_major10, 5, 3 # 3D (e.g. H, W, C) For runtime-known dimensions, use Idx() : var layout = row_major(Idx(M), Idx(N)) # runtime dims Creating tensors from buffers TileTensor's constructor infers dtype and layout type — pass the buffer and layout: var buf = ctx.enqueue_create_bufferDType.float32 var tensor = TileTensor(buf, row_major1024) # wraps device buffer Indexing tensor[tid] # 1D tensor[row, col] # 2D tensor[row, col, channel] # 3D tensor.dim0 # query dimension size (compile-time index) var K = Int(tensor.dim1) # wrap with Int() for use in arithmetic Tiling (extract sub-tiles from a tensor)

Inside kernel — extract a block_size x block_size tile

var tile = tensor.tileblock_size, block_size tile[thread_idx.y, thread_idx.x] # access within tile Vectorize and distribute (thread-level data mapping)

Vectorize along inner dimension, then distribute across threads

comptime thread_layout = row_majorWARP_SIZE // simd_width, simd_width var fragment = tensor.vectorize1, simd_width.distributethread_layout=thread_layout fragment.copy_from_async(source_fragment) # async copy fragment.copy_from(source_fragment) # sync copy Type casting var val = tensor[row, col].castDType.float32 # cast element Element type mismatch across layouts — use rebind tensor[idx] returns SIMD[dtype, layout_expr] where layout_expr is a compile-time expression derived from the layout. Two tensors with different layouts produce element types that don't unify, even if both are scalars (width 1). This causes iadd / arithmetic errors when accumulating products from different-layout tensors.

WRONG — fails when conv_kernel and s_data have different layouts:

var sum: Scalar[dtype] = 0 sum += conv_kernel[k] * s_data[idx] # error: cannot convert ElementType to Float32

CORRECT — rebind each element to Scalar[dtype]:

var sum: Scalar[dtype] = 0 var k_val = rebindScalar[dtype] var s_val = rebindScalar[dtype] sum += k_val * s_val rebind is a builtin (no import needed). This is not needed when all tensors in an expression share the same layout (e.g., the matmul example where sa and sb have identical tile layouts). Also use rebind when reading/writing individual elements for scalar arithmetic or passing to helper functions — even with a single tensor:

Read element as plain scalar

var val = rebindScalar[dtype]

Write scalar back to tensor

tensor[idx] = rebindtensor.ElementType tensor.ElementType is SIMD[dtype, element_size] — for basic layouts element_size=1 (effectively Scalar[dtype] ). Memory management var ctx = DeviceContext()

Allocate

var dev_buf = ctx.enqueue_create_bufferDType.float32 var host_buf = ctx.enqueue_create_host_bufferDType.float32

Initialize device buffer directly

dev_buf.enqueue_fill(0.0)

Copy host -> device

ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)

Copy device -> host

ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)

Positional form also works:

ctx.enqueue_copy(dev_buf, host_buf)

Map device buffer to host (context manager — auto-syncs)

with dev_buf.map_to_host() as mapped: var t = TileTensor(mapped, row_major1024) print(t[0])

Memset

ctx.enqueue_memset(dev_buf, 0.0)

Synchronize all enqueued operations

ctx.synchronize() Kernel launch Critical : enqueue_function takes the kernel function twice as compile-time parameters: ctx.enqueue_functionmy_kernel, my_kernel

2D grid/block — use tuples:

ctx.enqueue_functionkernel_2d, kernel_2d For parameterized kernels, bind parameters first: comptime kernel = sum_kernel[SIZE, BATCH_SIZE] ctx.enqueue_functionkernel, kernel Shared memory Allocate shared memory inside a kernel using stack_allocation from the layout package — returns a TileTensor in the specified address space: from layout import stack_allocation # TileTensor-based shared alloc from std.gpu.memory import AddressSpace var tile_shared = stack_allocationDType.float32, address_space=AddressSpace.SHARED

Chain .fill() to zero-initialize (returns the tensor)

var regs = stack_allocationDType.float32.fill(0)

Load from global to shared

tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col] barrier() # must sync before reading shared data

Alternative: raw pointer shared memory (from std.memory, not layout)

from std.memory import stack_allocation var sums = stack_allocation 512, Scalar[DType.int32], address_space=AddressSpace.SHARED, Thread indexing

Simple — automatic global offset

from std.gpu import global_idx var tid = global_idx.x # 1D var row = global_idx.y # 2D row var col = global_idx.x # 2D col

Manual — when you need block/thread separately

from std.gpu import block_idx, block_dim, thread_idx var tid = block_idx.x * block_dim.x + thread_idx.x

Warp info

from std.gpu import lane_id, WARP_SIZE var my_lane = lane_id() # 0..WARP_SIZE-1 All return Int — no casting needed for bounds checks. Synchronization and warp operations from std.gpu import barrier from std.gpu.primitives import warp from std.atomic import Atomic barrier() # block-level sync var warp_sum = warp.sum(my_value) # warp-wide sum reduction var result = warp.reducewarp.shuffle_down, reduce_fn # custom warp reduce _ = Atomic.fetch_add(output_ptr, value) # atomic add GPU availability check from std.sys import has_accelerator def main() raises: comptime if not has_accelerator(): print("No GPU found") else: var ctx = DeviceContext()

... GPU code

Or as a compile-time assert: comptime assert has_accelerator(), "Requires a GPU" Architecture detection — is_ vs has_ Critical distinction : is_ checks the compilation target (use inside GPU-dispatched code). has_ checks the host system (use from host/CPU code). from std.sys.info import (

Target checks — "am I being compiled FOR this GPU?"

Use inside kernels or GPU-targeted code paths.

is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,

Host checks — "does this machine HAVE this GPU?"

Use from host code to decide whether to launch GPU work.

has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator, ) from std.sys import has_accelerator # host check: any GPU present

HOST-SIDE: decide whether to run GPU code at all

def main() raises: comptime if not has_accelerator(): print("No GPU") else:

...launch kernels

INSIDE KERNEL or GPU-compiled code: dispatch by architecture

comptime if is_nvidia_gpu():

NVIDIA-specific intrinsics

elif is_amd_gpu():

AMD-specific path

Subarchitecture checks (inside GPU code only): from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer comptime if is_nvidia_gpu"sm_90": # exact arch check ... Compile-time constants pattern All GPU dimensions, layouts, and sizes should be comptime : comptime dtype = DType.float32 comptime SIZE = 1024 comptime BLOCK_SIZE = 256 comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE) comptime layout = row_majorSIZE Complete 1D example (vector addition) from std.math import ceildiv from std.sys import has_accelerator from std.gpu import global_idx from std.gpu.host import DeviceContext from layout import TileTensor, row_major comptime dtype = DType.float32 comptime N = 1024 comptime BLOCK = 256 comptime layout = row_majorN def add_kernel( a: TileTensor[dtype, type_of(layout), MutAnyOrigin], b: TileTensor[dtype, type_of(layout), MutAnyOrigin], c: TileTensor[dtype, type_of(layout), MutAnyOrigin], size: Int, ): var tid = global_idx.x if tid < size: c[tid] = a[tid] + b[tid] def main() raises: comptime assert has_accelerator(), "Requires GPU" var ctx = DeviceContext() var a_buf = ctx.enqueue_create_bufferdtype var b_buf = ctx.enqueue_create_bufferdtype var c_buf = ctx.enqueue_create_bufferdtype a_buf.enqueue_fill(1.0) b_buf.enqueue_fill(2.0) var a = TileTensor(a_buf, layout) var b = TileTensor(b_buf, layout) var c = TileTensor(c_buf, layout) ctx.enqueue_functionadd_kernel, add_kernel with c_buf.map_to_host() as host: var result = TileTensor(host, layout) print(result) Complete 2D example (tiled matmul with shared memory) from std.math import ceildiv from std.sys import has_accelerator from std.gpu.sync import barrier from std.gpu.host import DeviceContext from std.gpu import thread_idx, block_idx from std.gpu.memory import AddressSpace from layout import TileTensor, TensorLayout, row_major, stack_allocation comptime dtype = DType.float32 comptime M = 64 comptime N = 64 comptime K = 64 comptime TILE = 16 comptime a_layout = row_majorM, K comptime b_layout = row_majorK, N comptime c_layout = row_majorM, N def matmul_kernel ALayout: TensorLayout, BLayout: TensorLayout, CLayout: TensorLayout, : comptime assert A.flat_rank == 2 and B.flat_rank == 2 and C.flat_rank == 2 var tx = thread_idx.x var ty = thread_idx.y var row = block_idx.y * TILE + ty var col = block_idx.x * TILE + tx var sa = stack_allocationdtype, address_space=AddressSpace.SHARED var sb = stack_allocationdtype, address_space=AddressSpace.SHARED var acc: C.ElementType = 0.0 comptime for k_tile in range(0, K, TILE): if row < M and k_tile + tx < K: sa[ty, tx] = A[row, k_tile + tx] else: sa[ty, tx] = 0.0 if k_tile + ty < K and col < N: sb[ty, tx] = B[k_tile + ty, col] else: sb[ty, tx] = 0.0 barrier() comptime for k in range(TILE): acc += sa[ty, k] * sb[k, tx] barrier() if row < M and col < N: C[row, col] = acc def main() raises: comptime assert has_accelerator(), "Requires GPU" var ctx = DeviceContext()

... allocate buffers, init data, then:

comptime kernel = matmul_kernel[type_of(a_layout), type_of(b_layout), type_of(c_layout)] ctx.enqueue_functionkernel, kernel SIMD loads in kernels

Vectorized load from raw pointer

var val = ptr.loadwidth=8 # SIMD[dtype, 8] var sum = val.reduce_add() # scalar reduction

TileTensor vectorized access

var vec_tensor = tensor.vectorize1, 4 # group elements into SIMD[4] Reduction pattern def block_reduce( output: UnsafePointer[Int32, MutAnyOrigin], input: UnsafePointer[Int32, MutAnyOrigin], ): var sums = stack_allocation512, Scalar[DType.int32], address_space=AddressSpace.SHARED var tid = thread_idx.x sums[tid] = input[block_idx.x * block_dim.x + tid] barrier()

Tree reduction in shared memory

var active = block_dim.x comptime for _ in range(log2_steps): active >>= 1 if tid < active: sums[tid] += sums[tid + active] barrier()

Final warp reduction + atomic accumulate

if tid < WARP_SIZE: var v = warp.sum(sums[tid][0]) if tid == 0: _ = Atomic.fetch_add(output, v) DeviceBuffer from existing pointer

Wrap an existing pointer as a DeviceBuffer (non-owning)

var buf = DeviceBufferdtype Benchmarking GPU kernels from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure @parameter @always_inline def bench_fn(mut b: Bencher) capturing raises: @parameter @always_inline def launch(ctx: DeviceContext) raises: ctx.enqueue_functionkernel, kernel b.iter_customlaunch var bench = Bench(BenchConfig(max_iters=50000)) bench.bench_functionbench_fn, [ThroughputMeasure(BenchMetric.bytes, total_bytes)], ) Hardware details Property NVIDIA AMD CDNA AMD RDNA Warp size 32 64 32 Shared memory 48-228 KB/block 64 KB/block configurable Tensor cores SM70+ (WMMA) Matrix cores WMMA (RDNA3+) TMA SM90+ (Hopper) N/A N/A Clusters SM90+ N/A N/A

安装