World Exclusive: XBox One (Durango) GPU detailed


 

Virtual Addressing

All GPU memory accesses on Xbox One-Durango use virtual addresses, and therefore pass through a translation table before being resolved to physical addresses. This layer of indirection solves the problem of resource memory fragmentation in hardware—a single resource can now occupy several noncontiguous pages of physical memory without penalty.

Virtual addresses can target pages in main RAM or ESRAM, or can be unmapped. Shader reads and writes to unmapped pages return well-defined results, including optional error codes, rather than crashing the GPU. This facility is important for support of tiled resources, which are only partially resident in physical memory

ESRAM

Xbox One-Durango has no video memory (VRAM) in the traditional sense, but the GPU does contain 32 MB of fast embedded SRAM (ESRAM). ESRAM on Xbox One-Durango is free from many of the restrictions that affect EDRAM on Xbox 360. Durango supports the following scenarios:

  • Texturing from ESRAM
  • Rendering to surfaces in main RAM
  • Read back from render targets without performing a resolve (in certain cases)

 

The difference in throughput between ESRAM and main RAM is moderate: 102.4 GB/sec versus 68 GB/sec. The advantages of ESRAM are lower latency and lack of contention from other memory clients—for instance the CPU, I/O, and display output. Low latency is particularly important for sustaining peak performance of the color blocks (CBs) and depth blocks (DBs).

Local Shared Memory and Global Shared Memory

Each shader core of the Durango GPU contains a 64-KB buffer of local shared memory (LSM). The LSM supplies scratch space for compute shader threadgroups. The LSM is also used implicitly for various purposes. The shader compiler can choose to allocate temporary arrays there, spill data from registers, or cache data that arrives from external memory. The LSM facilitates passing data from one pipeline stage to another (interpolants, patch control points, tessellation factors, stream out, etc.). In some cases, this usage implies that successive pipeline stages are restricted to run on the same SC.

The GPU also contains a single 64-KB buffer of global shared memory (GSM). The GSM contains temporary data referenced by an entire draw call. It is also used implicitly to enforce synchronization barriers, and to properly order accesses to Direct3D 11 append and consume buffers. The GSM is capable of acting as a destination for shader export, so the driver can choose to locate small render targets there for efficiency.

Cache

Durango has a two stage caching system, depicted below.

cache

L2 Cache

The GPU contains four separate 8-way L2 caches of 128 KB, each composed of 2048 64-byte cache lines. Each L2 cache owns a certain subset of address space. Texture tiling patterns are chosen to ensure all four caches are equally utilized. The L2 generally acts as a write-back cache—when the GPU modifies data in a cache line, the modifications are not written back to main memory until the cache line is evicted. The L2 cache mediates virtually all memory access across the entire chip, and supplies a variety of types of data, including shader code, constants, textures, vertices, etc., coming either from main RAM or from ESRAM. Shader atomic operations are implemented in the L2 cache.

L1 Cache

Each shader core has a local 64-way L1 cache of 16 KB, composed of 256 64-byte cache lines. The L1 generally acts as a write-through cache—when the SC modifies data in the cache, the modifications are pushed back to L2 without waiting until the cache line is evicted. The L1 cache is used exclusively for data read and written by shaders and is dedicated to coalescing memory requests over the lifetime of a single vector. Even this limited sort of caching is important, since memory accesses tend to be very spatially coherent, both within one thread and across neighboring threads.

The L1 cache guarantees consistent ordering per thread: A write followed by a read from the same address, for example, will give the updated value. The L1 cache does not, however, ensure consistency across threads or across vectors. Such requirements must be enforced explicitly—using barriers in the shader for example. Data is not shared between L1 caches or between SCs except via write-back to the L2 cache.

Unlike some earlier GPUs (including the Xbox 360 GPU), Durango leaves texture and buffer data in native compressed form in the L2 and L1 caches. Compressed data implies a longer fetch pipeline—every L1 cache must now have decoder hardware in it that repeats the same calculation each time the same data is fetched. On the other hand, by keeping data compressed longer, the GPU limits cache footprint and intermediate bandwidth. Following the same principle, sRGB textures are left in gamma space in the cache, and, therefore, have the same footprint as linear textures.

To see how this policy affects cache efficiency, consider an sRGB BC1 texture—perhaps the most commonly encountered texture type in games. BC1 is a 4-bit per texel format; on Durango, this texture occupies 4 bits per texel in the L1 cache. On Xbox 360, the same texture is decompressed and gamma corrected before it reaches the cache, and therefore occupies 8 bytes per texel, or 16 times the Durango footprint. For this reason, the Durango L1 cache behaves like a much larger cache when compared against previous architectures.

Just as SCs can hide fetch latency by switching to other vectors, L1 texture caches on Durango are capable of hiding L2 cache latency by continuing to process fetch instructions after a miss. In other words, when a cache miss is followed by one or more cache hits, the hits can be satisfied during the stall for the miss.

Fetch

Durango supports two types of fetch operation—image fetches and buffer fetches. Image fetches correspond to the Sample method in high-level shader language (HLSL) and require both a texture register and a sampler register. Features such as filtering, wrapping, mipmapping, gamma correction, and block compression require image fetches. Buffer fetches correspond to the Load method in HLSL and require only a texture register, without a sampler register. Examples of buffer fetches are:

  • Vertex fetches
  • Direct3D 10-style gather4 operations (which fetch a single unfiltered channel from 4 texels, rather than multiple filtered channels from a single texel)
  • Fetches from formats that are natively unfilterable, such as integer formats

 

Image fetches and buffer fetches have different performance characteristics. Image fetches are generally bound by the speed of the texture pipeline and operate at a peak rate of four texels per clock. Buffer fetches are generally bound by the write bandwidth into the destination registers and operate at a peak rate of 16 GPRs per clock. In the typical case of an 8-bit four-channel texture, these two rates are identical. In other cases, such as a 32-bit one-channel texture, buffer fetch can be up to four times faster.

Many factors can reduce effective fetch rate. For instance, trilinear filtering, anisotropic filtering, and fetches from volume maps all translate internally to iterations over multiple bilinear fetches. Bilinear filtering of data formats wider than 32-bits per texel also operates at a reduced rate. Floating point formats that have more than three channels operate at half rate. Use of per-pixel gradients causes fetches to operate at quarter rate.

By contrast, fetches from sRGB textures are full rate. Gamma conversion internally uses a modified 7e4 floating-point representation. This format is large enough to be bitwise exact according to the DirectX 10 spec, yet still small enough to fit through a single filtering pipe.

The Durango GPU supports all standard Direct3D 11 DXGI formats, as well as some custom formats.

Compute

Each of the 12 Durango SCs has its own L1 cache, LSM (Local Shared Memory), and scheduler, and four SIMD units. O represents a single thread of the currently executing shader.

sc_durango

SIMD

Each of the four SIMDs in the shader core is a vector processor in the sense of operating on vectors of threads. A SIMD executes a vector instruction on 64 threads at once in lockstep. Per thread, however, the SIMDs are scalar processors, in the sense of using float operands rather than float4 operands. Because the instruction set is scalar in this sense, shaders no longer waste processing power when they operate on fewer than four components at a time. Analysis of Xbox 360 shaders suggests that of the five available lanes (a float4 operation, co-issued with a float operation), only three are used on average.

The SIMD instruction set is extensive, and supports 32-bit and 64-bit integer and float data types. Operations on wider data types occupy multiple processor pipes, and therefore run at slower rates—for example, 64-bit adds are one-eighth rate, and 64-bit multiplies are 1/16-rate. Transcendental operations, such as square root, reciprocal, exponential, logarithm, sine, and cosine, are non-pipelined and run at quarter rate. These operations should be used sparingly on Durango because they are more expensive relative to arithmetic operations than they are on Xbox 360.

Scheduler

The scheduler of the SC is responsible for loading shader code from memory and controlling execution of the four SIMDs. In addition to managing the SIMDs, the scheduler also executes certain types of instructions on its own. These instructions come from a separate scalar instruction set; they perform an operation per vector rather an operation than per thread. A scalar instruction might be employed, for example, to add two shader constants. In microcode, scalar instructions have names beginning with s_, while vector instructions have names beginning with v_.

The scheduler tracks dependencies within a vector, keeping track of when the next instruction is safe to run. In addition, the scheduler handles dynamic branch logic and loops.

On each clock cycle, the scheduler considers one of the four SIMDs, iterating over them in a round-robin fashion. Most instructions have a four cycle throughput, so each SIMD only needs attention once every four clocks. A SIMD can have up to 10 vectors in flight at any time. The scheduler selects one or more of these 10 candidate vectors to execute an instruction. The scheduler can simultaneously issue multiple instructions of different types—for instance, a vector operation, a scalar operation, a global memory operation, a local memory operation, and a branch operation—but each operation must act on a different vector.

 

This article ends at Page 3.

  • Anonymus

    LOOOOOOOOOOL

  • sony_f^^k_sega

    rrodbox 1.2 confirmed

  • Damon Tarklin

    Where did they get this 1.2 tflops from? My understanding was that the GPU had to be a 6870 OR 6950 (because of the DVI ports in the back) and not the Radeon 6670 as first reported.

  • wint3rmute

    How can it trump it when it isn’t out yet? Rumors pointed to ps4 having a 4gb/2gb configuration like a pc for memory and they came out and surprised us. MS might do the same.

    The theories that should get thrown under the bus are the ones based on rumors.

  • wint3rmute

    Its really a shame we can’t have a discussion about the tech instead of all you fucktards blathering nonsense about console wars.

  • Nicholas Gatewood

    The Wii U is an utter failure on a technical level. If we’re just looking at specs here, the Wii U isn’t much more impressive than the 360 – the next Xbox and PS4 absolutely demolish its performance. I really wish Nintendo fans would understand how horribly Nintendo’s been performing lately and ask them to improve, it kills me to see what they’ve become.

  • Nicholas Gatewood

    Um, no. Not even remotely accurate. The Wii U isn’t any more technically impressive than the 360, and the next-gen Xbox should blow it out of the water in every single regard. There’s just no contest, the Wii U only would’ve been impressive in 2007.

  • Nicholas Gatewood

    You have no idea what you’re talking about at all. In real-world conditions the PS4’s specs blow the next Xbox’s out of the water, no contest. In this case the PS4 is even the superior platform in regards to its hardware simplicity. If the next Xbox DID have a slightly better GPU than the PS4, the PS4 would still compare favorably against it because of its superior RAM solution.

    The next Xbox will likely launch at $350 while the PS4 launches at $400, but the PS4 ends up being around twice as powerful and a much better platform for developers. Just stating the facts here, to go against all the people who seem to lack any knowledge regarding hardware or game optimization.

  • Nicholas Gatewood

    Iwata and Fils-Aime are liars, don’t just mindlessly believe what they have to say about hardware. There’s a massive, MASSIVE difference in hardware capability between the true next-gen systems and the Wii U, claiming otherwise because someone who would profit from lying said so is silly.

  • Vince and always Vince

    each shared cores..is 4 normal cu….the diagram is correct ur interpretation is wrong look on wikipedia smx (nvidia) or cu array (amd)

  • Vince and always Vince

    this mean something like 3072 …alu..do ur counts

  • Joseph

    Guys, here are the PS4 and Xbox 720 comparisons I know so far based on what I know

    CPU:
    – Both are 8-core
    – Xbox is x64, PS4 is x86
    – Xbox has 8GB of DRAM/ESRAM, PS4 has 8GB of DDR5 RAM (DDR5 is faster)
    – Xbox goes at 170 GB/s (combined from DRAM/ESRAM), PS4 goes at 176 GB/s

    GPU:
    – Xbox is 1.23 TFLOPS, PS4 is 1.84 TFLOPS
    – Xbox has 12 or 14 compute units, PS4 has 18 compute units
    – Xbox has 12 shader cores, PS4 has about 14(estimated)

    So overall, PS4 is winning, but Xbox could prevail in software and graphics again. And besides, all these specs are rumors or the Xbox. We all thought PS4 would be 4GB, but it turned out 8, so xbox might change as well.

  • zybraisacock

    Production will have already started.

  • oh man.. ur theory is better than post. 😀

  • Fernando Almeida

    Are you some kind of prophet?

    • SlutMagnet

      Holy shit.