GPU Glossary

TABLE OF CONTENTS

Device Hardware

CUDA (Device Architecture)

Streaming Multiprocessor

Special Function Unit

Load/Store Unit

Tensor Memory Accelerator

Streaming Multiprocessor Architecture

Texture Processing Cluster

Graphics/GPU Processing Cluster

Device Software

CUDA (Programming Model)

Streaming ASSembler

Parallel Thread eXecution

Compute Capability

Cooperative Thread Array

Thread Block Grid

Thread Hierarchy

Memory Hierarchy

CUDA (Software Platform)

CUDA C++ (programming language)

NVIDIA GPU Drivers

CUDA Driver API

NVIDIA Management Library

CUDA Runtime API

NVIDIA CUDA Compiler Driver

NVIDIA Runtime Compiler

NVIDIA CUDA Profiling Tools Interface

NVIDIA Nsight Systems

CUDA Binary Utilities

Performance Bottleneck

Arithmetic Intensity

Memory Bandwidth

Arithmetic Bandwidth

Warp Execution State

Pipe Utilization

Issue Efficiency

Streaming Multiprocessor Utilization

Warp Divergence

Branch Efficiency

Memory Coalescing

Register Pressure

/device-software/shared-memory

What is Shared Memory?

Shared memory is the level of the memory hierarchy corresponding to the thread block level of the thread hierarchy in the CUDA programming model . It is generally expected to be much smaller but much faster (in throughput and latency) than the global memory .

A fairly typical kernel therefore looks something like this:

load data from global memory into shared memory
perform a number of arithmetic operations on that data via the CUDA Cores and Tensor Cores
optionally, synchronize threads within a thread block by means of barriers while performing those operations
write data back into global memory , optionally preventing races across thread blocks by means of atomics

Shared memory is stored in the L1 data cache of the GPU's Streaming Multiprocessor (SM) .

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Global Memory ?