PlayStation 4 includes hUMA technology

There has been a lot of controversy about this matter in the last days, but we will try to clarify that Playstation 4 supports hUMA technology or at least it implements a first revision of it. We have to remember that AMD haven’t released products with hUMA technology yet, so it is difficult to compare with something in the market. Besides, no finished specifications are settled yet, therefore PS4 implementation may differ a bit with finished hUMA implementations.

But first of all, what is hUMA? hUMA is the acronym for Heterogeneous Uniform Memory Access. In the case of hUMA both processors no longer distinguish between the CPU and GPU memory areas. Maybe this picture could explain the concept in a easy way:


If you want to learn more about this tech, this article explains how hUMA works.

PS4 has enhancements in the memory architecture that no other “retail” product has, as Mark Cerny pointed in different interviews. We will try to show the new parts in PS4 components in the next pages.

We need to put our diagram about PS4 memory architecture to explain how it works.


Mapping of memory in Liverpool

–   Adresses are 40 bit. This size allows pages of memory mapped on both CPU and GPU to have the same virtual address

–   Pages of memory are freely set up by the application

–   Pages of memory do not need to be both mapped on CPU and GPU

  • If only the CPU will use, the GPU does not need to have it mapped
  • If only the GPU will use, it will access via Garlic


–   If both the CPU and GPU will access the memory page, a determination needs to be made whether the GPU should access  it via Onion or Garlic

  • If the GPU needs very high bandwidth , the page should be accessed via Garlic; the CPU will need to access it as uncached memory
  • If the CPU needs  frequent access to the page,  it should be mapped as cached memory on the CPU; the GPU will need access it via Onion.


Five Type of Buffers

–   System memory buffers that the GPU uses are tagged as one of five memory types

–   These first three types have very limited CPU access; primary access is by the GPU

–   Read Only (RO)

  • A “RO” buffer is memory that is read by CU’s but never written to them, e.g a texture or vertex table
  • Access to RO buffers can never cause L1 caches to lose coherency with each other, as it is write operations that cause coherency problems.

–   Private (PV)

  • A “PV” buffer is private memory read from and written to by a single threadgroup, e.g. a  scratch buffer.
  • Access to PV buffers can never cause L1 caches to lose coherency, because it is writes to shared memory areas that cause the problems


–   GPU coherent (GC)

  • A “GC” buffer is memory read from and written to by the CU’s as a result of draw calls or dispatches, e.g. outputs from vertex/shaders that are later read by geometry shaders. Depth buffers and render targets are not GC memory as they are not written to by the CU, but by dedicated hardware in the DBs and CBs.
  • As writes are permitted to GC buffers, access to them can cause L1 caches to lose coherency with each other


–   The last two types are accessible by both CPU and GPU

–   System coherent (SC)

  • A “SC” buffer is memory read from and written to by both CPU and GPU, e.g. CPU structure GPU reads, or structures used for CPU-GPU communication
  • SC buffers present the largest coherency issues. Not only can L1 caches lose coherency with other, but both L1 and L2 can lose coherency with system memory and the CPU caches.


–   Uncached (UC)

  • A “UC” buffer is memory that is read from and written to by both CPU and GPU, just as the SC was
  • UC buffers are never cached in the GPU L1 or L2, so they present no coherency issues
  • UC accesses use the new Onion+ bus, a limited bandwidth bus similar to the Onion bus
  • UC accesses may have significant inefficiencies due to repeated reads of the same line, or incremental updates of lines


–   The first three types (RO, PV, GC) may also be accessed by the CPU, but care must be taken. For example, when copying a texture to a new location

  • The CPU can write the texture data in an uncached fashion, then manually flush the GPU caches. The GPU can then subsequently access the texture as RO memory through Garlic at high speed
  • Two dangers are avoided here. As the CPU worte the texture data using uncached writes, no data remains in the CPU caches and the GPU is free to use Garlic rather than Onion. As the CPU flushed the GPU caches after the texture setup, there is no possibility of stale data in the GPU L1 and L2.


Tracking of Type in Memory Accesses

–   Memory accesses are made via V# and T# definitions that contain the base address and other parameters of the buffer or texture

–   Three bits have been added to V# and T# to specify the memory type

–   And extra bit has been added to the L1 tags

  • It is set if the line was loaded from either GC or SC memory (as opposed to RO or PV memory)
  • A new type of packet-based L1 invalidate has been added that only invalidates the GC and SC lines
  • A simple strategy is for application code to use this invalidate before any draw call or dispatch that accesses GC or SC buffers


–   An extra bit has been added to the L2 tags

  • It indicates if the line was loaded from SC memory
  • A new L2 invalidate of just the SC lines has been added
  • A new L2 writeback of just the SC lines has been added. These both are packet-based.
  • A simple strategy is for application code to use the L2 invalidate before any draw call or dispatch that uses SC buffers, and use the L2 writeback after any draw call or dispatch that uses SC buffers
  • The combination of these features allows for efficient acquisition and release of buffers by draw calls and dispatches


Simple Example:

–   Let’s take the case where most of the GPU is being used for graphics (vertex shaders, pixel shaders and so on)

–   Additionally, let’s say that we have an asynchronous compute dispatch that uses a buffer SC memory for:

  • Dispatch inputs, with are created by the CPU and read by the GPU
  • Dispatch outputs, which are created by the GPU and read by the CPU


–   The GPU can:

1)      Acquire the SC buffer by performing an L1 invalidate (GC and SC) and an L2 invalidate (SC lines only). This eliminates the possibility of stale data in the caches. Any SC address encountered will properly go offchip (to either system memory or CPU caches) to fetch the data.

2)      Run the compute shader

3)      Release the SC buffer by performing an L2 writeback (SC lines only). This writes all dirty bytes back to system memory where the CPU can see them

–   The graphics processing is much less impacted by this strategy

  • On the R10xx, the complete L2 was flushed, so any data in use by the graphics shaders (e.g. the current textures) would need to be reloaded
  • On Liverpool, that RO data stays in place – as does PV and GC data


This technical information can be a bit overwhelming and confuse, therefore we will disclose more information and examples of use of this architecture in a new article this week.

  • Mark C.

    Reading in the Tub!

    • #yoloswaggins

      Hi gaf

  • MattS71

    Awesome, rock on PS4

  • Joel

    TitanFall and Dead RIsing 3 say hello

    • Jroo

      Hello, games I don’t care about. How you doin’?

      • Joel

        this we all know is a lie.

        • anon

          Don’t give a rat’s ass about DR3.

          Have PC for Titanfall.


          • frackiniscool

            Yeah DR2 was garbage as hell so a better graphics version of garbage is still garbage. and I got a sick PC for titanfall sooo yup

    • Menda

      And we say goodbye, they stink because they have no tub.

    • รєภรђเͽ

      Now GTFO

  • rasmithuk

    It seems at yesterdays hot chips conference during a presentation on the Microsoft Xbox One they stated that it has a similar system to hUMA.

    • nitro_feen

      oh yeah we have the same thing but different . Sure you do MS. IF they said the xbox one had gold in it people would believe them.

      • Kreten

        Sut da hell up fanboy who da hell do you believe? Vs leaks guesses and gets some info from microsoft so saying you believe vgleaks but not the source stupid ass

        • Maynard_VGL

          No need for that language, Kreten.

        • Laro

          STFU you fanboi

      • Dantonir

        Didn’t Cerny describe adding the Onion+ bus and the volatile flags for simultaneous graphics and compute as part of the modifications made to the PS4?


        If that is true then those features might not necessarily be present on the XBox APU. We know it has the Garlic and Onion buses present in Llano and Trinity but AMD did not consider that enough to be hUMA (most likely lacking bidirectional coherence between the CPU and GPU caches).

        • rasmithuk

          From what’s been leaked from the presentation so far (since the conferences proceedings are currently only available to people that attended) it would seem Microsoft did their own modifications to the memory system, so it’s possible that it allows similar features to hUMA without AMD wanting to call it that.

          One of the features of the custom memory controller is listed as ‘Memory coherency between CPU cores and GPU’, so that feature is there.

          • Dantonir

            I’m just trying to square the current leaks and remarks.

            From the Hot Chips slides it appears that the XBox APU is uni-directional cache coherent. The GPU can snoop the CPU cache but the CPU access to the GPU memory is limited to page table synchronisation. This means that the GPU caches need to be flushed to maintain coherence, causing a GPU stall.

            The PS4 appears to have the same basic structure but the addition of the Onion+ bus bypasses the GPU caches for some accesses, thus avoiding coherency issues, and the tags allow the GPU caches to be flushed without stalling any parallel graphics operations.

            It’s still not perfectly transparent hUMA, not sure what to expect from Kaveri, but it is usable for parallel compute and graphics operations.

          • rasmithuk

            Just checking that I’m understand this correctly. Access via Onion always enforces cache coherence. Access via Onion+ in the case of GPU writes bypasses the cache but tags the entries in the CPUs are modified so any future access from the CPU is forced to read.
            If that’s correct I can see how that would be a benefit to compute on the GPU.

          • Dantonir

            My understanding from what I’ve read is that access via Onion is always CPU cache coherent but is not automatically GPU cache coherent and requires the L1 and L2 caches to be flushed. Otherwise they could contain stale data that does not reflect changes made by the CPU.

            The tags help by marking the kind of allocation for each cache line, therefore the GPU can selectively flush system coherent cache lines (which could be stale) without impacting GPU only cache lines (which cannot). The flush is still required, I don’t know if it’s automatic or must be done manually, but it doesn’t stall the GPU completely and rendering continues.

            Onion+ works by bypassing the L1 and L2 caches in the GPU completely. It is less efficient because the GPU cannot take advantage of cached values when reading and writing but it does not need to enforce coherency.

            Without these two features any time the GPU needs to read or write a cache line from a location that is shared it will require a complete flush of the GPU caches, making simultaneous rendering and asynchronous compute tasks much less efficient.

          • rasmithuk

            Right. So the short version is that Onion+ and the new tagging options are there to provide basically the same uniform memory access as a full hUMA system would give without requiring full bi-direction cache coherency between the CPU/GPU.
            I’m guessing that the majority of the benefits come from the tagging, since the biggest benefits of hUMA seem to be the ability to do large amounts of shared data modifications on the GPU. The ability to partial flush only the shared memory changes while leaving all the GPU private cache lines intact seem the biggest performance benefit here.
            Since the uncached writing via Onion+ misses the GPU L1/L2 caches it’s probably not the way you want to do the majority of the changes, but I’m guessing it works really well for synchronization between the CPU and GPU as well as for short messaging parsing.

            I don’t think I’d describe the PS4’s current sytem, as far as we know it, to be hUMA, but it’s close enough to give them the majority of the benefits, it’ll just require a bit of software intervention to get similar results.

            It’s going to be interesting to see what the described memory coherency on the Xbox One’s APU actually consists of but I’m guessing the full slides won’t be out for a couple of months.

            In an odd way I think the most exciting part of the console launch is going to be the teardowns of the systems, both in terms of assembly and the cpu design. I’ll keep my fingers crossed that the guys at chipworks do some nice pictures of the dies.

            Thanks for the sensible discussion, it makes a nice change to have one on the topic of consoles without it turning into some name calling exercise.

    • john mitas

      XB1 has MMU’s for CPU and GPU that does lots of the heavy lifting, as well as the eSRAM .. That is the main difference to PS4, which has neither (mmu’s or eSRAM) ..

  • Elmer

    Does the xbone support this too? also can vgleaks do a proper rundown on what was shown yesterday at hotchips? the cpu seems to be much better than ps4. ps4 has a <20gb/s bus. while the xbone is 30gb/s coherent.

    • vgleakscom

      Both Cpu’s are the same (model and configuration). There is no differences there. Between 20 and 30 GB/s (theorical) there is no much difference. You can check our xbox one memory example to see “real” numbers.

      • nitro_feen

        NO they are not the same. NO oen knows that for sure. They are based off the same gpu but both are customized and not in the same ways. Also they are running at different frequencies.

      • Kreten

        Peak and Theoretical mean different things. Ps4 peak is 20gb/s(maximum) x1 maximum is 30gb/s. this is due to them actually doing stuff to bandwidth vs sony using off the shelf stuff. Also 176gb/s GDDR5 is also maximum as is 68gb/s ddr3+eSRAM 204gb/s

  • Elmer

    Vgleaks also please explain the massive confusion how microsoft is combining two types of ram at once, for example: esram theoretical maximum 204gb/s+ddr3 68gb/s= 272gb/s. two combinations of ram is pure pr bull right? why is microsoft advertising that.

  • john mitas

    What you have to do this all manually? No MMU to do the heavy liftying… This sux! Manual memory coherency … haha

  • Tragge

    The problem with huma on the PS4… Is that it’s a prototype. It will be obsolete compared to the official huma available for AMD’s graphics cards. It’s going to have issues and problems that have yet to be tweaked or perfected.

    The PS4 is going to suffer from these unrefinements 2 years into it’s life cycle when the devs that developed for the pc first and then scale down to consoles, find out that huma on AMD’s latest graphics cards work better than on the consoles. Now they have to tone down graphical quality; reduce lighting quality, model polygons, to compensate for console huma’s weaknesses.

    I don’t see next gen console tech. I see experimental technology that will be obsolete at the launch of the next gen consoles.

  • hassan

    Brand new Sony Ps4 original come with accessories $350usd.

    Serious buyer should contact us via email:

    Skype: hassan.inc1