More exclusive PlayStation 4 hUMA implementation and memory enhancements details

Tags: AMD, hUMA, Sony

As we promised in our previous article, we present new information about the enhancements in the memory system on PlayStation 4.

Bypass Bits

– If many of these sorts of compute shaders are being run simultaneously, there is “cross talk” in that one compute dispatch may forcé an invalidate or a premature flush of another dispatch’s SC memory

– As a result of this (and other factors), it may be optimal to bypass either the L1, or the L2, or both

Bypassing all caches for the accesses to the shared CPU-GPU memory (effectively making the data UC rather than SC) will remove the need for the invalidates and writebacks of L1 and L2
At the same time, there will be more – perhaps much more – traffic to and from system memory

– It is possible to change the V# and T# definitions on a dispatch by dispatch basis when exploring these issues and tuning the application

– However, in order to allow for a more stable and debugable programming approach

Two override bits have been added to the draw call and dispatch controls
The L1 bypass bit specifies that operations on GC and SC memory bypass the L1 and go directly to L2
The L2 bypass bit specifies that operations on SC memory bypass the L2, using the new “Onion+” bus
This allows the application programmer to use same shader code and V#/T# definitions, and then run the shaders with several different cache flush strategies. No recompilation or reconfiguration is required

Four Memory Buffer Usage Examples

1) Simple Rendering

– Vertex shader and pixel shader only; the pixel shader does not direct memory accesses

– Vertex buffers (RO)

– Textures (RO)

– Color and depth buffers are written using dedicated hardware mechanisms, not memory buffers

2) Raycast

– In order to compute visibility (“can the enemy see the player”) or sound effect volume (“is there a direct path from audio source to player”), sets of 64 rays are compared against large triangle databases

– Triangle databases (RO)

– Input rays (SC)

– Output collisions (SC)

– The raycast probably doesn’t use much SC data and could potentially entirely bypass L2

3) Procedural Geometry (e.g. water surface)

– The CPU maintains a high level state of the water (ripples, splashes coming for interactions with game objects). The GPU generates the detailed water mesh, with is used only for rendering

– Input: water state as maintained by CPU (SC)

– Output: detailed water surface (GC)

4) Chained compute shaders

– Compute shaders write semaphores for the CP to read, enabling other compute dispatches (and perhaps draw calls) to run. They also add packets to compute pipe queues (perhaps packets that kick off more compute dispatches)

– Various buffers (RO, PV, GC, SC)

– Semaphores (UC)

– Compute pipe queue (UC)

– NOTE that CP does not have access to the GPU L2, so semaphores and queue contents must either be assigned the SC memory type (visible to the CP only after a L2 writeback) or the UC memory type (which bypases the L2)

– Using UC can allow for greater flexibility, e.g. a compute dispatch can have several stages that send and receive semaphores. Using SC requires the dispatch to terminate before the semaphore is visible externally

Strategies for Scalar Loads

– In addition to the “gather read” and “scatter write” loads into VGPRs (Vector GPRs), the R10xx core also supports scalar reads and writes into SGPRs (Scalar GPRs)

Typically, scalar reads are used to load T#, V#, and S# structures, as well as any other data that applies to the wavefront as a whole (as opposed to the vector reads that load data on a thread-by-thread basis)

– These read operations use the L2, but instead of the L1 they use a different cache called the “K-cache”. There is one 16 KB K-cache for each three CU’s

The K-cache must be invalidated when there is the possibility that it may contain “stale” data, e.g. a later draw call or dispatch uses the same location in the T# (etc) ring buffer as an earlier call
K-cache invalidation takes 1 cycle but dumps all data, resulting in a high cost
The most straightforward way of reducing the invalidation count is to use larger ring buffers for the scalar input data to the draw calls and dispatches

Performance

– Performance of the L2 cache operations is much better on Liverpool than on R10xx

– The L2 invalidate typically takes 300-350 cycles

All in-flight memory transactions must settle before the invalidate can be completed
A small overhead (about 75 cycles) is required to locate and invalidate the lines
This results in the direct cost listed above. There is also an indirect cost, in that invalidated SC data must potentially be reloaded

– The cost of an L2 writeback depends on the amount of data that must be written back to system memory

The Onion bus can support 10GB/sec, which means 12.5 bytes/cycle (0.2 lines/cycle)
If we attribute 160 GB/sec of the Garlic bus to the GPU, the bus can support 200 bytes/cycle (3.125 lines/cycle)

– If there is only a little SC dirty data present in the L2, the writeback is fairly fast

4K bytes worth of dirty Onion SC lines will take perhaps 500 cycles (Onion bottleneck PLUS small overhead to locate lines PLUS latency to system memory)
20K bytes worth of dirty Garlic SC lines will take about the same time

– Worst case L2 writeback cost is basically the Onion or Garlic cost of writing 512 KB (about 40,000 cycles and 3,000 cycles respectively)

Additional Optimizations

– There are additional further optimizations in the L1 and L2 caches

– The L2 cache has dirty state tracking

If the L2 has performed no reads from SC memory since the last invalidate, it will ignore any requests to invalidate
If the L2 has performed no writes to SC memory since the last writeback, it will ignore any requests to perform a writeback
This will help performance in the situation where multiple pipes are requesting invalidates and writebacks, e.g. several compute pipes are separately dispatching compute shaders that use SC memory

– The L1 cache can be invalidated “once per CU”