Thursday 17 February 2011

Firefox 4 video decoder architecture

To assist others coming up to speed on the architecture of the video decoder, I've put together a diagram of Firefox 4's video playback engine. We rewrote our video architecture for Firefox 4 in order to give us better control over the complete stack.

Click on the image for a larger diagram.

The key classes in our architecture are:
  • nsHTMLMediaElement - This manages the JavaScript/HTML accessible HTMLMediaElement interface, and implements the resource selection, load, and preload logic.
  • nsBuiltinDecoder - Manages a main thread accessible snapshot of the state of the underlying decoder. The decoders run on non-main threads, and we don't want to block the main thread to dig into the decoders when JS queries playback state, so we maintain a snapshot of the playback engine's state in this class. This inherits from nsMediaDecoder. You can also implement playback support for a new format by inheriting and implementing nsMediaDecoder. nsWaveDecoder is currently implemented this way, but we're in the process of reimplementing that as a sublcass of nsBuiltinDecoderReader.
  • nsBuiltinDecoderStateMachine - Manages the decode, state machine, audio-push threads, frame queueing, A/V sync, and buffering logic. This ensures that all the HTML5 events get dispatched at the appropriate time, and that behaviour is consistent and sane across different media types. Demuxing is handled abstractly by subclasses of nsBuiltinDecoderReader. This way all media types can share as much playback logic as possible, reducing our maintenance overhead.
  • nsOggReader/nsWebMReader - Demuxing and codec specific functionality is implemented by subclassing nsBuiltinDecoderReader. This reduces the amount of work required to implement and maintain support for new codecs. When a new codec is implemented as a nsBuiltinDecoderReader subclass, support for HTML events, buffering, and playback logic does not need to be reimplemented, since it already exists in nsBuiltinDecoderStateMachine. To add support for a new codec, it's easiest to implement support as a new nsBuiltinDecoderReader subclass.
  • nsAudioStream - Our cross platform audio API wrapper. It is based on libsydneyaudio, which operates on a push model rather than a (more commonly used) callback-based model, which has brought in a whole raft of headaches. Matthew Gregan is in the process of rewriting our audio layer to a more sane callback based model. We also provide a cross-process nsAudioStreamRemote, which proxies audio commands to an audio stream in another process. This is required on mobile.
  • ImageContainer - When it comes time to present a video frame, nsBuiltinDecoderStateMachine sets it as the "current image" of the video element's ImageContainer object. This then propagates through the Layers/2D scene rendering system, and it eventually gets rendered on the screen. The Layers compositing runs on the main thread, and ImageContainer provides a thread-safe wrapper. The images contained in the ImageContainer can be in OpenGL/D3D surfaces, so we can take advantage of hardware accelerated scaling, rendering, and YCbCr to RGB conversion.
  • nsVideoFrame - This resides in layout, and manages the dimensions/reflow of the video, as well as its poster image.
  • nsMediaStream - Our network code runs on the main thread, but the underlying libraries we use for media decoding (libvpx, libtheora, etc) assume synchronous reads. We can't afford to do blocking reads on the main thread, so we cache the media data downloaded into the nsMediaCache, and provide a thread-safe wrapper synchronous wrapper for reading in the nsMediaStream class. We use Necko for our networking, so we can take advantage of all the existing security and load-group functionality it implements.
The advantage of controlling the entire playback engine are many. We can easily control frame dropping, memory allocation, the threading model, what, when, and how we decode, and we can integrate more tightly with our network stack.


physicow said...

My primary laptop is a netbook, and so doesn't have the horsepower to decode video without the help of hardware. To overcome this, I have a broadcom crystal HD card, which has initial support in Linux.

What I don't see, however, is how Firefox would use the hardware to decode the video at an acceptable rate. Do you have any insight into this?

Manoj Mehta said...

What has switching to this new architecture improved over FF3.6? Performance? Memory Management? Provided hardware acceleration?

Chris Pearce said...

@physicow: Demuxing and decoding is handled by the nsBuiltinDecoderReader subclasses. So if you wanted demuxing and decoding to be hardware accelerated, you'd need to change the appropriate reader (and probably the underlying libraries) to take advantage of your hardware.

@Manoj: The new architecture now uses the layers system for rendering, so "we can take advantage of hardware accelerated scaling, rendering, and YCbCr to RGB conversion." This improves performance significantly.

Because we now have finer control over decoding, we can now degrade performance more gracefully on low powered hardware. This also makes it possible to implement the @buffered attribute, which would have been much harder if we didn't have such control over the decoding pipeline.

physicow said...

Cool, thanks!

Manoj Mehta said...