GPU Glossary

TABLE OF CONTENTS

Device Hardware

CUDA (Device Architecture)

Streaming Multiprocessor

Special Function Unit

Load/Store Unit

Streaming Multiprocessor Architecture

Texture Processing Cluster

Graphics/GPU Processing Cluster

Device Software

CUDA (Programming Model)

Streaming ASSembler

Parallel Thread eXecution

Compute Capability

Cooperative Thread Array

Thread Block Grid

Thread Hierarchy

Memory Hierarchy

CUDA (Software Platform)

CUDA C++ (programming language)

NVIDIA GPU Drivers

CUDA Driver API

NVIDIA Management Library

CUDA Runtime API

NVIDIA CUDA Compiler Driver

NVIDIA Runtime Compiler

NVIDIA CUDA Profiling Tools Interface

NVIDIA Nsight Systems

CUDA Binary Utilities

/device-hardware/l1-data-cache

What is the L1 Data Cache?

The L1 data cache is the private memory of the Streaming Multiprocessor (SM).

Each SM partitions that memory among groups of threads scheduled onto it.

The L1 data cache is co-located with and nearly as fast as components that effect computations (e.g. the CUDA Cores ).

It is implemented with SRAM, the same basic semiconductor cell used in CPU caches and registers and in the memory subsystem of Groq LPUs . The L1 data cache is accessed by the Load/Store Units of the SM .

CPUs also maintain an L1 cache. In CPUs, that cache is fully hardware-managed. In GPUs that cache is mostly programmer-managed, even in high-level languages like CUDA C .

Each L1 data cache in an each of an H100's SMs can store 256 KiB (2,097,152 bits). Across the 132 SMs in an H100 SXM 5, that's 33 MiB (242,221,056 bits) of cache space.

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.