More exclusive PlayStation 4 hUMA implementation and memory enhancements details


As we promised in our previous article, we present new information about the enhancements in the memory system on PlayStation 4.

 

Bypass Bits

– If many of these sorts of compute shaders are being run simultaneously, there is “cross talk” in that one compute dispatch may forcé an invalidate or a premature flush of another dispatch’s SC memory

– As a result of this (and other factors), it may be optimal to bypass either the L1, or the L2, or both

  • Bypassing all caches for the accesses to the shared CPU-GPU memory (effectively making the data UC rather than SC) will remove the need for the invalidates and writebacks of L1 and L2
  • At the same time, there will be more – perhaps much more – traffic to and from system memory

– It is possible to change the V# and T# definitions on a dispatch by dispatch basis when exploring these issues and tuning the application

– However, in order to allow for a more stable and debugable programming approach

  • Two override bits have been added to the draw call and dispatch controls
  • The L1 bypass bit specifies that operations on GC and SC memory bypass the L1 and go directly to L2
  • The L2 bypass bit specifies that operations on SC memory bypass the L2, using the new “Onion+” bus
  • This allows the application programmer to use same shader code and V#/T# definitions, and then run the shaders with several different cache flush strategies. No recompilation or reconfiguration is required

 

Four Memory Buffer Usage Examples

1)      Simple Rendering

– Vertex shader and pixel shader only; the pixel shader does not direct memory accesses

– Vertex buffers (RO)

– Textures (RO)

– Color and depth buffers are written using dedicated hardware mechanisms, not memory buffers

2)      Raycast

– In order to compute visibility (“can the enemy see the player”) or sound effect volume (“is there a direct path from audio source to player”), sets of 64 rays are compared against large triangle databases

– Triangle databases (RO)

– Input rays (SC)

– Output collisions (SC)

– The raycast probably doesn’t use much SC data and could potentially entirely bypass L2

3)      Procedural Geometry (e.g. water surface)

– The CPU maintains a high level state of the water (ripples, splashes coming for interactions with game objects). The GPU generates the detailed water mesh, with is used only for rendering

– Input: water state as maintained by CPU (SC)

– Output: detailed water surface (GC)

4)      Chained compute shaders

– Compute shaders write semaphores for the CP to read, enabling other compute dispatches (and perhaps draw calls) to run. They also add packets to compute pipe queues (perhaps packets that kick off more compute dispatches)

– Various buffers (RO, PV, GC, SC)

– Semaphores (UC)

– Compute pipe queue (UC)

– NOTE that CP does not have access to the GPU L2, so semaphores and queue contents must either be assigned the SC memory type (visible to the CP only after a L2 writeback) or the UC memory type (which bypases the L2)

– Using UC can allow for greater flexibility, e.g. a compute dispatch can have several stages that send and receive semaphores.  Using SC requires the dispatch to terminate before the semaphore is visible externally

Strategies for Scalar Loads

– In addition to the “gather read” and “scatter write” loads into VGPRs (Vector GPRs), the R10xx core also supports scalar reads and writes into SGPRs (Scalar GPRs)

  • Typically, scalar reads are used to load T#, V#, and S# structures, as well as any other data that applies to the wavefront as a whole (as opposed to the vector reads that load data on a thread-by-thread basis)

– These read operations use the L2, but instead of the L1 they use a different cache called the “K-cache”. There is one 16 KB K-cache for each three CU’s

  • The K-cache must be invalidated when there is the possibility that it may contain “stale” data, e.g. a later draw call or dispatch uses the same location in the T# (etc) ring buffer as an earlier call
  • K-cache invalidation takes 1 cycle but dumps all data, resulting in a high cost
  • The most straightforward way of reducing the invalidation count is to use larger ring buffers for the scalar input data to the draw calls and dispatches

 

Performance

– Performance of the L2 cache operations is much better on Liverpool than on R10xx

– The L2 invalidate typically takes 300-350 cycles

  • All in-flight memory transactions must settle before the invalidate can be completed
  • A small overhead (about 75 cycles) is required to locate and invalidate the lines
  • This results in the direct cost listed above. There is also an indirect cost, in that invalidated SC data must potentially be reloaded

– The cost of an L2 writeback depends on the amount of data that must be written back to system memory

  • The Onion bus can support 10GB/sec, which means 12.5 bytes/cycle (0.2 lines/cycle)
  • If we attribute 160 GB/sec of the Garlic bus to the GPU, the bus can support 200 bytes/cycle (3.125 lines/cycle)

– If there is only a little SC dirty data present in the L2, the writeback is fairly fast

  • 4K bytes worth of dirty Onion SC lines will take perhaps 500 cycles (Onion bottleneck PLUS small overhead to locate lines PLUS latency to system memory)
  • 20K bytes worth of dirty Garlic SC lines will take about the same time

– Worst case L2 writeback cost is basically the Onion or Garlic cost of writing 512 KB (about 40,000 cycles and 3,000 cycles respectively)

 

Additional Optimizations

– There are additional further optimizations in the L1 and L2 caches

– The L2 cache has dirty state tracking

  • If the L2 has performed no reads from SC memory since the last invalidate, it will ignore any requests to invalidate
  • If the L2 has performed no writes to SC memory since the last writeback, it will ignore any requests to perform a writeback
  • This will help performance in the situation where multiple pipes are requesting invalidates and writebacks, e.g. several compute pipes are separately dispatching compute shaders that use SC memory

– The L1 cache can be invalidated “once per CU”

  • A dispatch may send multiple wavefronts to a single CU
  • Using this option, the invalidate of GC/SC occurs only on the first wavefront of the dispatch


  • Joel

    Full huma support and huma like features are not the same. Both consoles do not officially support huma, they only contain features of it. The consoles do it differently.

    • Ellie

      PS4 has hUMA according to AMD. Xbox One can not have hUMA because of the eSRAM.

      • rasmithuk

        I think it depends on your definition of hUMA.

        The additions Sony have made do seem to give them the benefits of the hUMA system (easy sharing of the same data between CPU and GPU without requring copies), however if you look at AMDs slides on hUMA one of their ‘Key Features’ is full cache coherency between CPU and GPU.

        From what this and the last post suggests the PS4s caches are only coherent in a single direction (GPU to CPU) which requires the additional tagging and (possibly) manual flushes on the caches on the GPU.

        So while the end result, with additional software instructions, is very similar it’s not the transparent system that AMD descibe in their hUMA literature.

        Also, since they’re still relying on partial cache flushes to enforce coherency in some cases, and using the completely uncached access via Onion+ in others, it doesn’t look like it will have the same level of performance as a full hUMA implementation.

        Don’t get me wrong, it’s a good solution, far better then existing PC CPU/GPU interop, but it’s not quite hUMA as AMD talk about it.

        On a technical level, unless someone here wants to correct me, I can’t see any reason why the eSRAM addition would prevent the Xbox One from being hUMA enabled if it would be without it.

        If the eSRAM is mapped into the normal address space (which would make sense as both the CPU and GPU can access it) then the invalidation system that works on main memory would extend to it.

        The only difference would be that while in the case of a normal hUMA system you get uniform access speeds over the entire memory there would be a small part where the speeds are much higher.

        • Dantonir

          The original article on the Durango memory architecture states that the eSRAM is only accessible by GPU and its co-processors (most notably Data Move Engines). If the eSRAM is accessible by the CPU it would only be via the GPU’s MMU, it would certainly not be addressable through the CPU’s main bus. One of AMD’s requirements for hUMA is that “CPU and GPU process can dynamically allocate memory from the entire memory space”.

          While this could be a technical reason that the XBox is not classed as being hUMA capable, it doesn’t mean it is any less capable that it would be without the eSRAM.

          It would be very interesting to know what Microsoft’s solution to the coherency problem is. They appear to have started with largely the same components that Sony had for the PS4 so they will likely have had the same problems regarding a hUMA implementation (eSRAM not withstanding).

          We know that GPGPU was a major focus for Sony with their APU and their modifications like Onion+, the additional tags and extra ACEs all support asynchronous compute. It would be interesting to know how much Microsoft’s engineers have been thinking about it.

          • rasmithuk

            Ah, seems like I need to go back and read the early slides again.
            If it is only accessible via the GPU bus then that would proclude that RAM being used under hUMA.

            Hopefully the hotchips slides will reveal more when they get released. The videos of the presentations last year appeared on youtube in late december, but I’m not sure if the slides were available before this.

            I’m wondering if the eSRAM is supposed to be used as a tile cache. During all of the tiled resources demos Microsoft presented they kept mentioning that the cache they were using was only 16MB, which seems rather small for almost any modern GPU. While most people have probably seen the Mars demos there’s an interesting shadow map one in the Build 2013 presentation (http://channel9.msdn.com/Events/Build/2013/4-063 starts around 22 minutes in).
            Obviously tiled resources aren’t useful everywhere but it could be a good explanation for why the eSRAM was created.

            As for Microsoft thinking about GPGPU, I think their work on C++ AMP would suggest they’ve been thinking about this for quite a while so I’d be surprised if there wasn’t at least some focus on it in the APUs design.

      • Kaiser X

        xbox one has coherent connection between eSRAM and CPU cache according to black arrow at this pic:

        http://8pic.ir/images/55977444397976624597.jpg

  • Oldgen

    In a nutshell: all that means that PS4 and Xbox 360 are the first gen of HSA. There will be a second, third, gen of that.

    • Newgen

      No, Xbox360 has nothing to do with this.

      • Joel

        obvi he meant xbox one

    • TKSKM14

      Yes. There will be upgrades to HSA… But only for PCs through next gen graphics cards and APUs from AMD. The HSA on the PS4 will be remain the same throughout it’s life. Upgrading the HSA on a PS4 means upgrading the PS4’s entire hardware. Which is not financially feasible for Sony… And the result of constant hardware upgrades on the console will make it similar to the pc gaming market. You have to spend another 400 dollars to get the PS4 version 2.0. Another 400 on the PS4 3.2 version and so on.

      By the end of next year pc gamers will probably be getting a version of hUMA that’s several years ahead of the hUMA in the PS4.