World Exclusive: XBox One (Durango) GPU detailed


 

General Purpose Registers

Each SIMD contains 256 vector general purpose registers (VGPRs), and 512 scalar general purpose registers (SGPRs). Both types of GPR store 32-bit data: An SGPR contains a single 32-bit value shared across threads, while a VGPR represents an array of 32-bit values, one per thread within a vector. Each thread can only see its own entry within a VGPR.

GPRs record intermediate results between instructions of the shader. To each newly created vector, the GPU assigns a range of VGPRs and a range of SGPRs—as many as needed by the shader up to a limit of 256 VGPRs and 104 SGPRs. Some GPRs are consumed implicitly by the system—for instance, to hold literal constants, index inputs, barycentric coordinates, or metadata for debugging.

The number of available GPRs can be a limiting factor in the ability of the SIMD to hide latency by switching to other vectors. If all the GPRs for a SIMD are already assigned, then no new vector can begin executing. And then, if all active vectors stall, the SIMD goes idle until one of the stalls ends.

Like most modern GPUs, the Durango GPU uses a unified shader architecture (USA), which means that the same SCs are used interchangeably for all stages of the shader pipeline: vertex, hull, domain, geometry, pixel, and compute. On Durango, GPR usage is also unified; there is no longer any fixed allocation of GPRs to vertex or pixel shading as on Xbox 360.

Constants

The Durango GPU has no dedicated registers to hold shader constants. When a shader references a constant buffer, the compiler decides how these accesses will be implemented. The compiler can specify that constants be preloaded into GPRs. The compiler may fetch constants from memory by using scalar instructions. The compiler may cache constants in the LSM.

A shader constant may be either global (constant over the whole draw call) or indexed (immutable, but varying by thread). Indexed constants must be fetched using vector instructions, and are correspondingly more expensive than global constants. This cost is somewhat analogous to the constant waterfalling penalty from Xbox 360, although the mechanism is different.

Branches

Branch instructions are executed by the scheduler and have the same ideal cost as computation instructions. Just as they do on CPUs, however, branches may incur pipeline stalls while awaiting the result of the instruction which determines the branch direction. Not-taken branches introduce subsequent pipeline bubbles. Taken branches require a read from the instruction cache, which incurs an additional delay. All these potential costs are moot as long as there are enough active vectors to hide the stalls.

Branching is inherently problematic on a SIMD architecture where many threads execute in lockstep, and agreement about the branch direction is not guaranteed. The HLSL compiler can implement branch logic in one of several ways:

  • Predication – Both paths are executed; calculations that should not happen for a particular thread are masked out.
  • Predicated jump – If all threads decide the branch in the same way, only the correct path is executed; otherwise, both paths are executed.
  • Skip – Both paths are followed, but instructions that are executed by no threads are skipped over at a faster rate.

 

Interpolation

The Durango GPU has no fixed function interpolation units. Instead, a dedicated GPU component routes vertex shader output data to the LSM of whichever SC (or SCs) ends up running the pixel shader. This routing mechanism allows pixels to be shaded by a different SC than the one that shaded the associated vertices.

Before pixel shader startup, the GPU automatically populates two registers with interpolation metadata:

  • One SGPR is a bitfield that contains:
    • a pointer to the area of the LSM where vertex shader output was stored
    • a description of which pixels in the current vector came from which vertices
  • Two VGPRs contain barycentric coordinates for each pixel (the third barycentric coordinate is implicit)

 

It is the responsibility of the shader compiler to generate microcode prologues that perform the actual interpolation calculations. The SCs have special purpose multiply-add instructions that read some of their inputs directly from the LSM. A single float interpolation across a triangle can be accomplished by using two of these instructions.

This approach to interpolation has the advantage that there is no cost for unused interpolants—the instructions can be omitted or branched over. Conversely, there is no benefit from packing interpolants into float4’s. Nevertheless, for short shaders, interpolation can still significantly impact overall computation load.

Output

Pixel shading output goes through the DB and CB before being written to the depth/stencil and color render targets. Logically, these buffers represent screenspace arrays, with one value per sample. Physically, implementation of these buffers is much more complex, and involves a number of optimizations in hardware.

Both depth and color are stored in compressed formats. The purpose of compression is to save bandwidth, not memory, and, in fact, compressed render targets actually require slightly more memory than their uncompressed analogues. Compressed render targets provide for certain types of fast-path rendering. A clear operation, for example, is much faster in the presence of compression, because the GPU does not need to explicitly write the clear value to every sample. Similarly, for relatively large triangles, MSAA rendering to a compressed color buffer can run at nearly the same rate as non-MSAA rendering.

For performance reasons, it is important to keep depth and color data compressed as much as possible. Some examples of operations which can destroy compression are:

  • Rendering highly tessellated geometry
  • Heavy use of alpha-to-mask (sometimes called alpha-to-coverage)
  • Writing to depth or stencil from a pixel shader
  • Running the pixel shader per-sample (using the SV_SampleIndex semantic)
  • Sourcing the depth or color buffer as a texture in-place and then resuming use as a render target

 

Both the DB and the CB have substantial caches on die, and all depth and color operations are performed locally in the caches. Access to these caches is faster than access to ESRAM. For this reason, the peak GPU pixel rate can be larger than what raw memory throughput would indicate. The caches are not large enough, however, to fit entire render targets. Therefore, rendering that is localized to a particular area of the screen is more efficient than scattered rendering.

Fill

The GPU contains four physical instances of both the CB and the DB. Each is capable of handling one quad per clock cycle for a total throughput of 16 pixels per clock cycle, or 12.8 Gpixel/sec. The CB is optimized for 64-bit-per-pixel types, so there is no local performance advantage in using smaller color formats, although there may still be a substantial bandwidth savings.

Because alpha-blending requires both a read and a write, it potentially consumes twice the bandwidth of opaque rendering, and for some color formats, it also runs at half rate computationally. Likewise, because depth testing involves a read from the depth buffer, and depth update involves a write to the depth buffer, enabling either state can reduce overall performance.

Depth and Stencil

The depth block occurs near the end of the logical rendering pipeline, after the pixel shader. In the GPU implementation, however, the DB and the CB can interact with rendering both before and after pixel shading, and the pipeline supports several types of optimized early decision pathways. Durango implements both hierarchical Z (Hi-Z) and early Z (and the same for stencil). Using careful driver and hardware logic, certain depth and color operations can be moved before the pixel shader, and in some cases, part or all of the cost of shading and rasterization can be avoided.

Depth and stencil are stored and handled separately by the hardware, even though syntactically they are treated as a unit. A read of depth/stencil is really two distinct operations, as is a write to depth/stencil. The driver implements the mixed format DXGI_FORMAT_D24_UNORM_S8_UINT by using two separate allocations: a 32-bit depth surface (with 8 bits of padding per sample) and an 8-bit stencil surface.

Antialiasing

The Durango GPU supports 2x, 4x, and 8x MSAA levels. It also implements a modified type of MSAA known as compressed AA. Compressed AA decouples two notions of sample:

  • Coverage sample–One of several screenspace positions generated by rasterization of one pixel
  • Surface sample– One of several entries representing a single pixel in a color or depth/stencil surface

 

Traditionally, coverage samples and surface samples match up one to one. In standard 4xMSAA, for example, a triangle may cover from zero to four samples of any given pixel, and a depth and a color are recorded for each covered sample.

Under compressed AA, there can be more coverage samples than surface samples. In other words, a triangle may still cover several screenspace locations per pixel, but the GPU does not allocate enough render target space to store a unique depth and color for each location. Hardware logic determines how to combine data from multiple coverage samples. In areas of the screen with extensive subpixel detail, this data reduction process is lossy, but the errors are generally unobjectionable. Compressed AA combines most of the quality benefits of high MSAA levels with the relaxed space requirements of lower MSAA levels.

  • Anonymus

    LOOOOOOOOOOL

  • sony_f^^k_sega

    rrodbox 1.2 confirmed

  • Damon Tarklin

    Where did they get this 1.2 tflops from? My understanding was that the GPU had to be a 6870 OR 6950 (because of the DVI ports in the back) and not the Radeon 6670 as first reported.

  • wint3rmute

    How can it trump it when it isn’t out yet? Rumors pointed to ps4 having a 4gb/2gb configuration like a pc for memory and they came out and surprised us. MS might do the same.

    The theories that should get thrown under the bus are the ones based on rumors.

  • wint3rmute

    Its really a shame we can’t have a discussion about the tech instead of all you fucktards blathering nonsense about console wars.

  • Nicholas Gatewood

    The Wii U is an utter failure on a technical level. If we’re just looking at specs here, the Wii U isn’t much more impressive than the 360 – the next Xbox and PS4 absolutely demolish its performance. I really wish Nintendo fans would understand how horribly Nintendo’s been performing lately and ask them to improve, it kills me to see what they’ve become.

  • Nicholas Gatewood

    Um, no. Not even remotely accurate. The Wii U isn’t any more technically impressive than the 360, and the next-gen Xbox should blow it out of the water in every single regard. There’s just no contest, the Wii U only would’ve been impressive in 2007.

  • Nicholas Gatewood

    You have no idea what you’re talking about at all. In real-world conditions the PS4’s specs blow the next Xbox’s out of the water, no contest. In this case the PS4 is even the superior platform in regards to its hardware simplicity. If the next Xbox DID have a slightly better GPU than the PS4, the PS4 would still compare favorably against it because of its superior RAM solution.

    The next Xbox will likely launch at $350 while the PS4 launches at $400, but the PS4 ends up being around twice as powerful and a much better platform for developers. Just stating the facts here, to go against all the people who seem to lack any knowledge regarding hardware or game optimization.

  • Nicholas Gatewood

    Iwata and Fils-Aime are liars, don’t just mindlessly believe what they have to say about hardware. There’s a massive, MASSIVE difference in hardware capability between the true next-gen systems and the Wii U, claiming otherwise because someone who would profit from lying said so is silly.

  • Vince and always Vince

    each shared cores..is 4 normal cu….the diagram is correct ur interpretation is wrong look on wikipedia smx (nvidia) or cu array (amd)

  • Vince and always Vince

    this mean something like 3072 …alu..do ur counts

  • Joseph

    Guys, here are the PS4 and Xbox 720 comparisons I know so far based on what I know

    CPU:
    – Both are 8-core
    – Xbox is x64, PS4 is x86
    – Xbox has 8GB of DRAM/ESRAM, PS4 has 8GB of DDR5 RAM (DDR5 is faster)
    – Xbox goes at 170 GB/s (combined from DRAM/ESRAM), PS4 goes at 176 GB/s

    GPU:
    – Xbox is 1.23 TFLOPS, PS4 is 1.84 TFLOPS
    – Xbox has 12 or 14 compute units, PS4 has 18 compute units
    – Xbox has 12 shader cores, PS4 has about 14(estimated)

    So overall, PS4 is winning, but Xbox could prevail in software and graphics again. And besides, all these specs are rumors or the Xbox. We all thought PS4 would be 4GB, but it turned out 8, so xbox might change as well.

  • zybraisacock

    Production will have already started.

  • oh man.. ur theory is better than post. 😀

  • Fernando Almeida

    Are you some kind of prophet?

    • SlutMagnet

      Holy shit.