TABLE OF CONTENTS

Home

README

Device Hardware

CUDA (Device Architecture)

Streaming Multiprocessor

Core

Special Function Unit

Tensor Memory Accelerator

TMA

Streaming Multiprocessor Architecture

Texture Processing Cluster

TPC

Graphics/GPU Processing Cluster

CUDA (Programming Model)

Streaming ASSembler

SASS

Parallel Thread eXecution

PTX

Compute Capability

Thread

Warp

Cooperative Thread Array

Kernel

CUDA (Software Platform)

CUDA C++ (programming language)

NVIDIA Management Library

NVIDIA CUDA Compiler Driver

nvcc

NVIDIA Runtime Compiler

NVIDIA CUDA Profiling Tools Interface

CUPTI

NVIDIA Nsight Systems

CUDA Binary Utilities

cuBLAS

cuDNN

Performance

Performance Bottleneck

Streaming Multiprocessor Utilization

/perf/littles-law

What is Little's Law?

Little's Law establishes the amount of concurrency required to fully hide latency with throughput.

concurrency (ops) = latency (s) * throughput (ops/s)

Little's Law is described as "the most important of the fundamental laws" of analysis in the classic quantitative systems textbook by Lazowska and others .

Little's Law determines how many instructions must be "in flight" for GPUs to hide latency through warp switching by warp schedulers (aka fine-grained thread-level parallelism, like simultaneous multi-threading in CPUs).

If a GPU has a peak throughput of 1 instruction per cycle and a memory access latency of 400 cycles, then 400 concurrent memory operations are needed across all active warps in a program. If the throughput goes up to 10 instructions per cycle, then the program needs 4000 concurrent memory operations to properly take advantage of the increase. For more detail, see the article on latency hiding .

For a non-trivial application of Little's Law, consider the following observation, from Section 4.3 of Vasily Volkov's PhD thesis on latency hiding : the number of warps required to hide pure memory access latency is not much higher than that required to hide pure arithmetic latency (30 vs 24, in his experiment). Intuitively, the longer latency of memory accesses would seem to require more concurrency. But the concurrency is determined not just by latency but also by throughput. And because memory bandwidth is so much lower than arithmetic bandwidth , the required concurrency turns out to be roughly the same — a useful form of balance for a latency hiding -oriented system that will mix arithmetic and memory operations.

Overhead

Something seem wrong?
Or want to contribute?

Click this button to
let us know on GitHub.

Memory Bandwidth ?