Compute Shader Spaces
Many times I found myself having difficulty with compute shaders and its mostly about terminology. As always, my way of overcoming this confusion is to make rules for it. This post is my personal rules about how I think about compute shaders in D3D, hopefully you will find this useful.
Naming Convention
Multi dimensional (integer) indices are suffixed with id. Flat version of the indices are suffixed with index.
// 1D
uint threadIndex; // one dimension thread index
// 2D (nD)
uint2 threadId; // two dimensional thread index
uint threadIndex = threadId.y * threadCountX + threadId.x; // flattened version of threadId
There are three spaces to think about when working with compute shaders:
- Global space
- Group space
- Thread space
Each space is just a partition of the larger space. When working in 2D domain, it’s helpful to map global space to the whole image, group space to the sub-rectangle/image tiles and thread space to an individual pixel.
D3D uses a semi weird (to me) naming for Compute Shader system value semantics. Naming the variable as follow helps me think better:
// thread space
uint2 tid; // SV_GroupThreadID
uint2 gsize; // Thread counts per group (available in shader since this is specified in [numthreads()])
uint tindex; // SV_GroupIndex, tindex = tid.y * gsize.x + tid.x
// group space
uint2 gid; // SV_GroupID
uint2 gcount; // Group counts (tile count in each dimension, passed from constant buffer/hardcoded)
uint gindex; // gindex = gid.y * gcount.x + gid.x
// global space
uint2 id; // SV_DispatchThreadID, id = gid * gsize + tid
uint2 count; // Thread (pixel) count, count = gcount * gsize
uint index; // index = id.y * count.x + id.x
Compute Shader Dispatch
When dispatching compute threads, it’s help to think that we are working in group space:
- CPU dispatch groups
// Total number of groups = gcount.x * gcount.y * gcount.z
gcount = subdivideWorkIntoGroups(); // this maps the work (input) space to group space
device->Dispatch(gcount.x, gcount.y, gcount.z);
- Compute shader works per-group and needs to declare number of threads per group.
// Total number of threads per group = gsize.x * gsize.y * gsize.z
// gsize = uint3(GROUP_SIZE_X, GROUP_SIZE_Y, GROUP_SIZE_Z)
[numthreads(GROUP_SIZE_X, GROUP_SIZE_Y, GROUP_SIZE_Z)]
void cs_main(uint3 id : SV_DispatchThreadID, // global space
uint3 gid : SV_GroupID, // group space
uint3 tid : SV_GroupThreadID, // thread space
uint tindex : SV_GroupIndex) // thead space
{
}
Group size is usually correlated with groupshared memory size (16 KB per group). Usually, we want each thread to load data into groupshared memory, so:
max number of threads = 16KB / sizeof(per_thread_data).
If the problem has natural grouping (cluster) thats too big to fit into groupshared memory, the next best thing is to map this to wavefront size (64 for AMD). You will need to have a mapping from the application cluster space to group space.
Limits
Some useful limits when designing ComputeShader:
groupsharedCapacity = 16KB;
gcountMax = uint3(64K - 1); // D3D11_CS_DISPATCH_MAX_THREAD_GROUPS_PER_DIMENSION
// D3D 11.x hardware
gsizeMax = uint3(1024, 1024, 64); // D3D11_CS_THREAD_GROUP_MAX_X{_Y}{_Z}
assert(gsizeMax.x * gsizeMax.y * gsizeMax.z < 1024); // D3D11_CS_THREAD_GROUP_MAX_THREADS_PER_GROUP
// D3D 10.x hardware
gsizeMax = uint3(768, 768, 1); // D3D11_CS_4_X_THREAD_GROUP_MAX_X{_Y}, D3D11_CS_4_X_DISPATCH_MAX_THREAD_GROUPS_IN_Z_DIMENSION
assert(gsizeMax.x * gsizeMax.y * gsizeMax.z < 768); // D3D11_CS_4_X_THREAD_GROUP_MAX_THREADS_PER_GROUP
Links
- Compute Shader Overview - https://docs.microsoft.com/en-us/windows/win32/direct3d11/direct3d-11-advanced-stages-compute-shader