H.264 Video Coding - Overview
H.264/MPEG-4 Part 10 or AVC (Advanced Video Coding) is a standard for video compression. The final drafting work on the first version of the standard was completed in May 2003.
H.264/MPEG-4 AVC is a block-oriented motion-compensation-based codec standard developed by the ITU-T Video Coding Experts Group (VCEG) together with the ISO/IEC Moving Picture Experts Group (MPEG). It was the product of a partnership effort known as the Joint Video Team (JVT). The ITU-T H.264 standard and the ISO/IEC MPEG-4 AVC standard (formally, ISO/IEC 14496-10 - MPEG-4 Part 10, Advanced Video Coding) are jointly maintained so that they have identical technical content. H.264 is used in such applications as players for Blu-ray Discs, videos from YouTube and the iTunes Store, web software such as the Adobe Flash Player and Microsoft Silverlight, broadcast services for DVB and SBTVD, direct-broadcast satellite television services, cable television services, and real-time videoconferencing.
The block diagram for H.264 codec is shown in Figure 1. The Encoder [Figure 1(a)] includes two dataflow paths, a "forward" path (left to right, shown in blue) and a "reconstruction" path (right to left, shown in magenta). The dataflow path in the Decoder [Figure 1(b)] is shown from right to left to illustrate the similarities between Encoder and Decoder.
Figure 1: Block diagram of H.264 Coding
Encoder: Forward Path
An input frame Fn is presented for encoding. The frame is processed in units of a macroblock (corresponding to 16x16 pixels in the original image). Each macroblock is encoded in intra or inter mode. In either case, a prediction macroblock P is formed based on a reconstructed frame. Intra coding provide access points to the coded sequence where decoding can begin correctly. Intra coding uses various spatial prediction modes to reduce spatial redundancy in the source signal for a single picture. In Intra mode, P is formed from samples in the current frame Fn that have previously encoded, decoded and reconstructed (uF'n in the Figures; note that the unfiltered samples are used to form P). Inter coding (predictive or bi-predictive) is more efficient, where prediction of each block of sample values is done from from one or more reference frame(s) using motion vectors. In the Figures, the reference frame is shown as the previous encoded frame F'n-1; however, the predicton for each macroblock may be formed from one or two past or future frames (in time order) that have already been encoded and reconstructed. The prediction P is subtracted from the current macroblock to produce a residual or difference macroblock Dn. The prediction residual is then further compressed using a transform (using a block transform) to remove spatial correlation in the block before it is quantized. This is transformed and quantized to give X, a set of quantized transform coefficients. These coefficients are re-ordered and encoded using entropy code such as context-adaptive variable length codes (CAVLC) or context adaptive binary arithmetic coding (CABAC). The entropy encoded coefficients, together with side information required to decode the macroblock (such as the macroblock prediction mode, quantizer step size, motion vector information describing how the macroblock was motion-compensated, etc) form the compressed bitstream. This is passed to a Network Abstraction Layer (NAL) for transmission or storage.
Encoder: Reconstruction Path
Figure 2: Typical Structure of an H.264/AVC Encoder