One of the few components that remain unveiled in Xbox One (Durango) is the sound block. This article is intended to describe this important part inside the system.
Xbox One (Durango) audio architecture seeks a balance between the successes and tradeoffs of previous generation platforms while anticipating the increasing technical needs of next-generation implementations. It provides hardware-accelerated pathways for the most common aspects of audio rendering—compression, mixing, filtering, and so on—on a large number of concurrent voices. The architecture also provides a shared resource model for software processing consumption, allowing each individual title to select what and how much custom signal manipulation to apply in CPU utilization.
Audio Architectural Overview
In addition to general CPU power (which can be used for decoding, synthesis, rendering, and so on), Durango provides several hardware components dedicated to audio processing. The audio hardware components can address the entire unified memory space. Here you can see the hardware-accelerated audio components:
The SHAPE (Scalable Hardware Audio Processing Engine) block comprises the majority of audio functionality, although the other processors also contribute significant features.
SHAPE (Scalable Hardware Audio Processing Engine)
The core hardware dedicated to audio processing is SHAPE. It is designed to perform many of the basic operations commonly required on a per-voice basis. This hardware allows a developer to reduce CPU impact—even for high polyphony and complex-signal routings—and still provide the flexibility of SHAPE/CPU data interchange if a title chooses to perform custom digital signal processing, analysis, or software synthesis.
SHAPE operates on blocks of 128 samples, where each sample supports 24-bit integer resolution (or 32-bit float when used by the CPU). At 48 kHz, this represents a 2.67 ms audio frame, providing increased timing resolution and decreased latency compared with the Xbox 360 256 sample block size. SHAPE offers six fixed function blocks focused on common audio tasks:
1. XMA Decoder: Concurrent decodes of 512 XMA format voices. XMA is a perceptual codec developed for Xbox 360 offering user-tunable quality and typically providing between 6:1 and 14:1 compression.
2. SRC: A high-quality dedicated polyphase sample rate conversion block allowing for high performance and high-quality frequency resampling of 512 mono channels of audio data (whether for format conversion, Doppler effect, or pitch variation).
3. Mix Buffers: Dedicated accumulators for 128 in-place mix channels without needing to access memory, and with additional channels available virtually. These mix buffers also provide coarse metering and clipping detection for debugging and monitoring.
4. FLT/VOL: A module providing both volume scaling and a state variable filter implementation for more than 2,500 voices/mixes, analogous to the software-exposed XAudio2 per-voice filter available on Windows and Xbox 360. The filter can provide low pass, high pass, band pass, or notch filtering, and exposes Q and cutoff/center frequency parameters. It is used most commonly for distance and occlusion modeling.
5. EQ/CMP: A module providing up to 512 channels of 3-band equalization and dynamic range compression. The EQ is comprised of three serially cascaded biquad filters. The compressor has a hard-knee response, and supports both side chain and expander functionality.
6. DMA: SHAPE has dedicated DMA hardware for transferring audio data to and from the unified memory space. This enables scenarios that include transfer without a sample-rate converter, transferring final mix channels, and CPU-based processing in the middle of a SHAPE-based audio graph.
Playback of a typical audio graph is expected to use each of these processors extensively.
ACP (Audio Control Processor)
The ACP provides state management and scheduling of all other audio hardware components on the North Bridge. CPU involvement in intra-frame processing and the synchronization/latency it might introduce is unnecessary.
ASP (Audio Scalar Processor)
The ASP supports scalar float and vector integer operations. Voice chat codecs—both those that manage wireless communication between a voice chat headset and the console, and those that are used to compress/decompress voice data for networked voice communication—are provided in hardware. Additionally, this processor supports xWMA format decompression in hardware; on Xbox 360, xWMA was solely a CPU-side decode option
AVP (Audio Vector Processor)
The AVP supports vector float operations, and is designed primarily for MEC (multichannel echo cancellation) and other noise reduction for the next-generation Kinect audio input. It supports both speech recognition and chat/arbitrary audio input use. MEC and other noise reduction processing allow for a more intelligible stream of the player’s spoken audio data even from a far talk microphone that is typically positioned closer to the output speakers than to the player.
Audio and Durango Hardware
Durango’s audio output pipeline eliminates the DAC (digital-to-analog converter) found in previous generation consoles. All audio is output strictly in the digital realm either through HDMI 1.4a or as S/PDIF optical output. HDMI 1.4a allows for high-fidelity linear 7.1-channel PCM to be transmitted from the console; titles default to an output sampling rate of 48 kHz and a bit depth of 24 bits. Durango is also designed to support up to four simultaneous stereo headset outputs, each of which can represent unique multichannel mixes that are downmixed as required by the output format (for instance, a headset or the S/PDIF output).
Durango accepts audio input from a variety of sources: the next-generation Kinect microphone array, voice chat headsets, other audio input peripherals, and storage media (whether HDD, flash or from cloud storage). Audio also can be algorithmically generated through CPU-based computation and manipulated in real time on a CPU, through the aforementioned SHAPE hardware components, or both.
Compression Formats
Durango offers hardware decompression support for both XMA2 and xWMA, both of which provide significant storage, bandwidth, and memory reductions over uncompressed PCM. XAudio2 also offers software support for ADPCM (Adaptive Differential Pulse Code Modulation). Although the computation for the ADPCM format is low overhead, as a non-perceptual codec ADPCM can express noticeable artifacts at lower sampling rates.
| Compression (approximate) |
|
|
|
PCM | None | Yes | Yes | Arbitrary |
ADPCM | 3.5-4:1 | (Software) | Block aligned | |
XMA2 | 6-14:1 | Yes (512 Hardware) | Yes (320 Hardware) | 128 sample-aligned |
xWMA | 20-40:1 | Yes (Hardware + Software support) | Yes (Software only) | End to end only, may gap |
Additional audio formats—for instance, MP3 or OGG—for game assets can be provided through title or middleware software codecs running on a CPU.
Audio and the Durango App Model
While in the foreground, an application has full access to the SHAPE hardware. When that application is pushed to the background—pinned, picture-in-picture, or other scenarios—it relinquishes hardware control. By default, its hardware state is suspended, and resumes when the title returns to the foreground. This also is true for Exclusive Resource Applications [ERAs] where the software graph is suspended.
A title may optionally choose to tear down its audio graph and reconstruct it upon resume. Some titles, particularly Shared Resource Applications [SRAs] that play background music such as streaming radio, may choose to have some aspects of audio continue to play even while paused. For these scenarios, titles should closely evaluate whether to attempt a seamless transition from hardware to software rendering, or to always play audio intended for background playback via a software-only pipeline. This has implications for compression formats and CPU costs. XMA-compressed assets, for example, require the use of SHAPE hardware, and thus will not be decodable for a background application.
The XAudio2 audio engine does provide software pathways for many functions if a title chooses to allocate CPU resources. Where practical, these functions mimic hardware capabilities, but some compute intensive processing is either unavailable or is differently implemented in software. Titles transitioning from hardware processing to software processing based on an app’s state may want to consider these differences when planning their audio pipelines.
| Durango Hardware Capability |
|
|
Sample Rate Conversion | (SRC) polyphase | XAudio2 linear interpolation | No |
Parametric EQ | (EQ/CMP) 3-band EQ | 3-band EQ, simple one-band, or custom DSP | Yes |
Compressor/Limiter | (EQ/CMP) Hard-knee, side chain, and expander capabilities | Hard-knee, side chain, and expander capabilities | Yes |
Filtering | (FLT/VOL) State variable filter | XAudio2 state variable filter, single-pole LPF, or custom DSP | Yes |
Mixing | (Mix Buffers) Includes clip detection, metering | Software mixing; custom DSP for clip detection or metering | Yes (for mixing) |
Durango Audio Libraries
Durango supports two audio rendering APIs for typical game use along with a variant of the Windows 8 Media Foundation API for playback of user music:
1. XAudio2, a game-focused audio library already available on Xbox 360 and Windows operating systems (Windows XP to Windows 8), is generally recommended for most title development.
2. WASAPI (Windows Audio Session API) can be used for any custom, exclusively software-implemented pipeline. WASAPI provides audio endpoint functionality only. Decompression, sample-rate conversion, mixing, and digital-signal processing, as well as interactions with Durango’s audio hardware components, must be implemented by the client. WASAPI is most typically used by audio middleware solutions.
The Microsoft Cross-Platform Audio Creation Tool (XACT) and DirectSound are not supported in the Durango environment. Titles that previously used these technologies should consider the solutions identified above, or use approved Durango audio middleware options.