Compute Shader Spaces

Many times I found myself having difficulty with compute shaders and its mostly about terminology. As always, my way of overcoming this confusion is to make rules for it. This post is my personal rules about how I think about compute shaders in D3D, hopefully you will find this useful.

Naming Convention

Multi dimensional (integer) indices are suffixed with id. Flat version of the indices are suffixed with index.

// 1D
uint threadIndex; // one dimension thread index
// 2D (nD)
uint2 threadId; // two dimensional thread index
uint threadIndex = threadId.y * threadCountX + threadId.x; // flattened version of threadId

There are three spaces to think about when working with compute shaders:

  1. Global space
  2. Group space
  3. Thread space

Each space is just a partition of the larger space. When working in 2D domain, it’s helpful to map global space to the whole image, group space to the sub-rectangle/image tiles and thread space to an individual pixel.

D3D uses a semi weird (to me) naming for Compute Shader system value semantics. Naming the variable as follow helps me think better:

// thread space
uint2 tid;    // SV_GroupThreadID
uint2 gsize;  // Thread counts per group (available in shader since this is specified in [numthreads()])
uint  tindex; // SV_GroupIndex, tindex = tid.y * gsize.x + tid.x

// group space
uint2 gid;    // SV_GroupID
uint2 gcount; // Group counts (tile count in each dimension, passed from constant buffer/hardcoded)
uint  gindex; // gindex = gid.y * gcount.x + gid.x

// global space
uint2 id;     // SV_DispatchThreadID, id =  gid * gsize + tid
uint2 count;  // Thread (pixel) count, count = gcount * gsize
uint  index;  // index = id.y * count.x + id.x

Compute Shader Dispatch

When dispatching compute threads, it’s help to think that we are working in group space:

  1. CPU dispatch groups
// Total number of groups = gcount.x * gcount.y * gcount.z
gcount = subdivideWorkIntoGroups(); // this maps the work (input) space to group space
device->Dispatch(gcount.x, gcount.y, gcount.z);
  1. Compute shader works per-group and needs to declare number of threads per group.
// Total number of threads per group = gsize.x * gsize.y * gsize.z
void cs_main(uint3 id     : SV_DispatchThreadID,  // global space
             uint3 gid    : SV_GroupID,           // group space
             uint3 tid    : SV_GroupThreadID,     // thread space
             uint  tindex : SV_GroupIndex)        // thead space

Group size is usually correlated with groupshared memory size (16 KB per group). Usually, we want each thread to load data into groupshared memory, so:

max number of threads = 16KB / sizeof(per_thread_data).

If the problem has natural grouping (cluster) thats too big to fit into groupshared memory, the next best thing is to map this to wavefront size (64 for AMD). You will need to have a mapping from the application cluster space to group space.


Some useful limits when designing ComputeShader:

groupsharedCapacity = 16KB;


// D3D 11.x hardware
gsizeMax = uint3(1024, 1024, 64); // D3D11_CS_THREAD_GROUP_MAX_X{_Y}{_Z}
assert(gsizeMax.x * gsizeMax.y * gsizeMax.z < 1024); // D3D11_CS_THREAD_GROUP_MAX_THREADS_PER_GROUP

// D3D 10.x hardware
gsizeMax = uint3(768, 768, 1); // D3D11_CS_4_X_THREAD_GROUP_MAX_X{_Y}, D3D11_CS_4_X_DISPATCH_MAX_THREAD_GROUPS_IN_Z_DIMENSION
assert(gsizeMax.x * gsizeMax.y * gsizeMax.z < 768); //  D3D11_CS_4_X_THREAD_GROUP_MAX_THREADS_PER_GROUP