H.264 Video Codec - Inter Prediction
Inter prediction is to reduce the temporal correlation with help of motion estimation and compensation. In H.264, the current picture can be partitioned into the macroblocks or the Submacroblock. In addition to the intra macroblock coding types, various predictive or motion-compensated coding types are allowed in P slices. Each P-type macroblock is partitioned into fixed size blocks used for motion description. Partitionings with luma block sizes of 16 × 16, 16 × 8, 8 × 16, and 8 × 8 samples are supported by the syntax. When the macroblock is partitioned into four so-called sub-macroblocks each of size 8 × 8 luma samples, one additional syntax element is transmitted for each 8 × 8 sub-macroblock. This syntax element specifies whether the corresponding sub-macroblock is coded using motion-compensated prediction with luma block sizes of 8 × 8, 8 × 4, 4 × 8, or 4 × 4 samples. Figure 1 illustrates the partitioning. The prediction signal for each predictivecoded M × N luma block is obtained by displacing a corresponding area of a previously decoded reference picture, where the displacement is specified by a translational motion vector and a picture reference index. Thus, if the macroblock is coded using four 8 × 8 sub-macroblocks, and each sub-macroblock is coded using four 4 × 4 luma blocks, a maximum of 16 motion vectors may be transmitted for a single P-slice macroblock.A macroblock of 16X16 luma samples can be partitioned into smaller block sizes up to 4X4.
Figure 1: Portioning of Macroblock an Submacroblock for Inter Prediction
The smaller block size requires larger number of bits to signal the motion vectors and extra data of the type of partition, however the motion compensated residual data can be reduced. Therefore, the choice of partition size depends on the input video characteristics. In general, a large partition size is appropriate for homogeneous areas of the frame and a small partition size may be beneficial for detailed areas. In former standards as MPEG-4 or H.263, only blocks of the size 16×16 and 8×8 are supported. The inter prediction process can form segmentations for motion representation as small as 4X4 luma samples in size, using motion vector accuracy of one-quarter of the luma sample.
A displacement vector is estimated and transmitted for each block, refers to the corresponding position of its image signal in an already transmitted reference image. In former MPEG standards this reference image is the most recent preceding image. In H.264/AVC it is possible to refer to several preceding images. This technique is denoted as motion-compensated prediction with multiple reference frames. For multi-frame motion-compensated prediction, the encoder stores decoded reference pictures in a multi-picture buffer. The decoder replicates the multi-picture buffer of the encoder according to the reference picture buffering type and memory management control operations (MMCO) specified in the bitstream. Unless the size of the multi-picture buffer is set to one picture, the index at which the reference picture is located inside the multi-picture buffer has to be signaled. For this purpose, an additional picture reference index parameter has to be transmitted together with each motion vector of 16 × 16, 16 × 8, or 8 × 16 macroblock partition or 8 × 8 submacroblock.
The process for inter prediction of a sample block can also involve the selection of the pictures to be used as the reference pictures from a number of stored previously-decoded pictures. Reference pictures for motion compensation are stored in the picture buffer. With respect to the current picture, the pictures before and after the current picture, in the display order are stored into the picture buffer. These are classified as 'short-term' and 'long-term' reference pictures. Long-term reference pictures are introduced to extend the motion search range by using multiple decoded pictures, instead of using just one decoded short-term picture. Memory management is required to take care of marking some stored pictures as 'unused' and deciding which pictures to delete from the buffer for efficient memory management.
Sub-pixel Motion Vector:
The motion vector precision is at the granularity of one quarter of the distance between luma samples. If the motion vector points to an integer-sample position, the prediction signal is formed by the corresponding samples of the reference picture; otherwise, the prediction signal is obtained using interpolation between integer-sample positions. Sub-pel motion compensation can provide significantly better compression performance than integer-pel compensation, at the expense of increased complexity. Quarter-pel accuracy outperforms half-pel accuracy. Especially, sub-pel accuracy would increase the coding efficiency at high bitrates and high video resolutions. In the luma component, the sub-pel samples at half-pel positions are generated first and are interpolated from neighboring integer-pel samples using a one-dimensional 6-tap FIR filter with weights (1, -5, 20, 20, -5, 1)/32, horizontally and/or vertically, which was designed to reduce aliasing components that deteriorate the interpolation and the motion compensated prediction. Once all the half-pel samples are available, each quarter-pel sample is produced using bilinear interpolation (horizontally, vertically, or diagonally) between neighboring half- or integer-pel samples. For 4:2:0 video source sampling, 1/8 pel samples are required in the chroma components (depending on the chroma format, corresponding to 1/4 pel samples in the luma). These samples are interpolated (linear interpolation) between integer-pel chroma samples. Sub-pel motion vectors are encoded differentially with respect to predicted values formed from nearby encoded motion vectors. After interpolation, block-based motion compensation is applied. As noted, however, a variety of block sizes can be considered, and a motion estimation scheme that optimizes the trade-off between the number of bits necessary to represent the video and the fidelity of the result is desirable.
Figure 2: Example of Integer and Sub-Pel Prediction
In addition to the macroblock modes described above, a P-slice macroblock can also be coded in the so-called skip mode. If a macroblock has motion characteristics that allow its motion to be effectively predicted from the motion of neighboring macroblocks, and it contains no non-zero quantized transform coefficients, then it is flagged as skipped. For this mode, neither a quantized prediction error signal nor a motion vector or reference index parameter are transmitted. The reconstructed signal is computed in a manner similar to the prediction of a macroblock with partition size 16 × 16 and fixed reference picture index equal to 0. In contrast to previous video coding standards, the motion vector used for reconstructing a skipped macroblock is inferred from motion properties of neighboring macroblocks rather than being inferred as zero (i.e., no motion).
In addition to the use of motion compensation and reference picture selection for prediction of the current picture content, weighted prediction can be used in P slices. When weighted prediction is used, customized weights can be applied as a scaling and offset to the motion-compensated prediction value prior to its use as a predictor for the current picture samples. Weighted prediction can be especially effective for such phenomena as "fade-in" and "fade-out" scenes.
Motion Vector Prediction:
After the temporal prediction, the steps of transform, quantization, scanning, and entropy coding are conceptually the same as those for I-slices for the coding of residual data (the original minus the predicted pixel values). The motion vectors and reference picture indexes representing the estimated motion are also compressed. Because, Encoding a motion vector for each partition can take a significant number of bits, especially if small partition sizes are chosen. Motion vectors for neighbouring partitions are often highly correlated and so each motion vector is predicted from vectors of nearby, previously coded partitions. A predicted vector, MVp, is formed based on previously calculated motion vectors. MVD, the difference between the current vector and the predicted vector, is encoded and transmitted. The method of forming the prediction MVp depends on the motion compensation partition size and on the availability of nearby vectors. The "basic" predictor is the median of the motion vectors of the macroblock partitions or subpartitions immediately above, diagonally above and to the right, and immediately left of the current partition or sub-partition. The predictor is modified if (a) 16x8 or 8x16 partitions are chosen and/or (b) if some of the neighbouring partitions are not available as predictors. If the current macroblock is skipped (not transmitted), a predicted vector is generated as if the MB was coded in 16x16 partition mode.At the decoder, the predicted vector MVp is formed in the same way and added to the decoded vector difference MVD. In the case of a skipped macroblock, there is no decoded vector and so a motioncompensated macroblock is produced according to the magnitude of MVp.
Figure 3: Multiframe Motion Compensation, I addition to the motionvector picture reference parameter also transmit.