Version: v 4.1 96/06/02
See reader questions & answers on this topic! - Help others by sharing your knowledge
frame Field predicted 1. a low-cost encoder which only possesses frame motion estimation may use dct_type to decorrelate the prediction error of a prediction which is inherently field by characteristic 2. an intelligent encoder realizes that it is more bit efficient to signal frame prediction with field dct_type for the prediction error, than it is to signal a field prediction. field Field predicted A typical scenario. A field prediction tends to form a field-correlated prediction error. frame Frame predicted A typical scenario. A frame prediction tends to form a frame-correlated prediction error. field Frame predicted Makes little sense. If the encoder went through the trouble of finding a field prediction in the first place, why select frame organization for the prediction error? prediction modes now include field, frame, Dual Prime, and 16x8 MC. The combinations for Main Profile and Simple Profile are shown below. Frame pictures motion_type motion vectors per MB fundamental prediction block size (after half- pel) interpretation Frame 1 16x16 same as MPEG-1, with possibly different treatment of prediction error via dct_type Field 2 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Dual Prime 1 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Uses averaging of two 16x8 prediction blocks from fields of opposite parity to form a prediction for the top and bottom 8 lines. A second vector is derived from the first vector coded in the bitstream. Field pictures motion_type motion vectors per MB fundamental prediction block size (after half- pel) interpretation Field 1 16x16 same as MPEG-1, with possibly different treatment of prediction error via dct_type 16x8 2 16x8 Two independently coded predictions are made: one for the 8 lines which correspond to the top field, another for the 8 bottom field lines. Dual Prime 1 16x16 A single prediction is constructed from the average of two 16x16 predictions taken from fields of opposite parity. concealment motion vectors can be transmitted in the headers of intra macroblocks to help error recovery. When the macroblock data that the concealment motion vectors are intended for becomes corrupt, these vectors can be used to specify a concealment 16x16 area to be extracted from the previous picture. These vectors do not affect the normal decoding process, except for motion vector predictions. Additional chroma_format for 4:2:2 and 4:4:4 pictures. Like MPEG-1, Main Profile syntax is strictly limited to 4:2:0 format, however, the 4:2:2 format is the basis of the 4:2:2 Profile (aka Studio Profile). In 4:2:2 mode, all syntax essentially remains the same except where matters of block count are concerned. A coded_block_pattern extension was added to handle signaling of the extra two prediction error blocks. The 4:4:4 format is currently undefined in any Profile. chroma_format multiplex order within Macroblock Application 4:2:0 (6 blocks) YYYYCbCr main stream television, consumer entertainment. 4:2:2 (8 blocks) YYYYCbCrCbCr studio production environments, professional editing equipment, distribution and servers 4:4:4 (12 blocks) YYYYCbCrCbCrCbCrCbCr computer graphics Non-linear macroblock quantization was introduced in MPEG-2 to increase the precision of quantization at high bit rates, while increasing the dynamic range for low bit rate use where larger step size is needed. The quantization_scale_code may be selected between a linear (MPEG-1 style) or non-linear scale on a picture (frame or field) basis. The new non-linear range corresponds to a dynamic range of 0.5 to 54 with respect to the linear (MPEG-1 style) range of 1 to 31. Block: alternate scan introduced a new run-length entropy scanning pattern generally more efficient for the statistics of interlaced video signals. Zig-zag scan is the appropriate choice for progressive pictures. intra_dc_precision: the MPEG-1 DC value is mandatory quantized to a precision of 8 bits. MPEG-2 introduced 9, 10, and 11 bit precision set on a picture basis to increase the accuracy of the DC component, which by very nature, has the most significant contribution towards picture quality. Particularly useful at high bit rates to reduce posterization. Main and Simple Profiles are limited to 8, 9, or 10 bits of precision. The 4:2:2 High Profile, which is geared towards higher bitrate applications (up to 50 Mbits/sec), permits all values (up to 11 bits). separate quantization matrices for Y and C: luminance (Y) and chrominance (Cb,Cr) share a common intra and non-intra DCT coefficient quantization 8x8 matrix in MPEG-1 and MPEG-2 Main and Simple Profiles. The 4:2:2 Profile permits separate quantization matrices to be downloaded for the luminance and chrominance blocks. Cb and Cr still share a common matrix. intra_vlc_format: one of two tables may now be selected at the picture layer for variable length codes (VLCs) of AC run-length symbols in Intra blocks. The first table is identical to that specified for MPEG-1 (dc_coef_next). The newer second table is more suited to the statistics of Intra coded blocks, especially in I- frames. The best illustration between Table 0 and Table 1is the length of the symbol which represents End of Block (EOB). In Table zero, EOB is 2 bits. In Table one, it is 4 bits. The implication is that the EOB symbol is 2^-n probable within the block, or from an alternative perspective, there are an average of 3 to 4 non-zero AC coefficients in Non-intra blocks, and 9 to 16 coefficients in Intra blocks. The VLC tree of Table 1 was intended to be a subset of Table 0, to aid hardware implementations. Both tables have 113 VLC entries (or events). escape: When no entry in the VLC exists for a AC Run-Level symbol, an escape code can be used to represent the symbol. Since there are only 63 positions within an 8x8 block following the first coefficient, and the dynamic range of the quantized DCT coefficients is [-2047,+2048], there are (63*2047), or 128,961 possible combinations of Run and Level (the sign bit of the Level follows the VLC). Only the 113 most common Run-Level symbols are represented in Table 0 or Table 1. The length of the escape symbol (which is always 6 bits) plus the Run and Level values in MPEG-1 could be 20 or 28 bits in length. The 20 bit escape describes levels in the range [-127,+127]. The 28 bit double escape has a range of [-255, +255]. MPEG-2 increased the span to the full dynamic range of quantized IDCT coefficients, [-2047, +2047] and simplified the escape mechanism with a single representation for this event. The total length of the MPEG-2 escape codeword is 24 bits (6 bit VLC followed by a 6-bit Run value, and 12 bit Level value). It was an assumption by MPEG-1 designers that no quantized DCT coefficient would need greater representation than 10 bits [-255,+255]. Note: MPEG-2 escape mechanism does not permit the value -2048 to be represented. mismatch control: The arithmetic results of all stages are defined exactly by the normative MPEG decoding process, with the single exception of the Inverse Discrete Cosine Transform (IDCT). This stage can be implemented with a wide variety of IDCT implementations. Some are more suited for software, others for programmable hardware, and others still for hardwired hardware designs. The IDCT reference formula in the MPEG specification would, if directly implemented, consume at least 1024 multiply and 1024 addition operations for every block. A wide variety of fast algorithms exist which can reduce the count to less than 200 multiplies and 500 adds per block by exploiting the innate symmetry of the cosine basis functions. A typical fast IDCT algorithm would be dwarfed by the cost of the other decoder stages combined. Each fast IDCT algorithm has different quantization error statistics (fingerprint), although subtle when the precision of the arithmetic is, for example, at least 16-bits for the transform coefficients and 24-bits for intermediate dot product values. Therefore, MPEG cannot standardize a single fast IDCT algorithm. The accuracy can be defined only statistically. The IEEE 1180 recommendation (December 1990) defines the error tolerance between an ideal direct-matrix floating point implementation (a direct implementation of the MPEG reference formula) and the test IDCT. Mismatch control attempts to reduce the drift between different IDCT algorithms by eliminating bit patterns which statistically have the greatest contribution towards mismatches between the variety of methods. The reconstructions of two decoders will begin to diverge over time since their respective IDCT designs will reconstruct occasional, slightly different 8x8 blocks. MPEG-1s mismatch control method is known canonicially as Oddification, since it forces all quantized DCT coefficients to negative values. It is a slight improvement over its predecessor in H.261. MPEG-2 adopted a different method called, again canonically, LSB Toggling, further reducing the likelihood of mismatch. Toggling affects only the Least Significant Bit (LSB) of the 63rd AC DCT coefficient (the highest frequency in the DCT matrix). Another significant difference between MPEG-1 and MPEG-2 mismatch control is, in MPEG-1, oddification is performed on the quantized DCT coefficients, whereas in MPEG-2, toggling is performed on the DCT coefficients after inverse quantization. MPEG-1s mismatch control method favors programmable implementation since a block of DCT coefficients when quantized. Sample: The two chrominace pictures (Cb, Cr) possess only half the resolution in both the horizontal and vertical direction as the luminance picture (Y). This is the definition of the 4:2:0 chroma format. Most television displays require that at least the vertical chrominance resolution matches the luminance (4:2:2 chroma format). Computer displays may further still demand that the horizontal resolution also be equivalent (4:4:4 chroma format). There are a variety of filtering methods for interpolating the chrominance samples to match the sample density of luminance. However, the official location or center of the lower resolution chrominance sample should influence the filter design (relative taps weights), otherwise the chrominance plane can appear to be shifted by a fractional sample in the wrong direction. The subsampled MPEG-1 chroma position has a center exactly half way between the four nearest neighboring luminance samples. To be consistent with the subsampled chrominance positions of 4:2:2 television signals, MPEG-2 moved the center of the chrominance samples to be co-located horizontally with the luminance samples. Misc.: copyright_id extension can identify whether a sequence or subset of frames within the sequence is copyrighted, and provides a unique 64-bit copyright_id_number registered with the ISO/IEC. Syntax can now signal frame sizes as large as 16383 x 16383. Since MPEG-1 employed a meager 12-bits to describe horizontal_size and vertical_size , the range was limited to 4095x4095. However, MPEGs Levels prescribe important interoperability points for practical decoders. Constrained Parameters MPEG-1 and MPEG-2 Low Level limit the sample rate to 352x240x30 Hz. MPEG-2s Main Level defines the limit at 720x480x30 Hz. Of course, this is simply the restriction of the dot product of horizontal_size, vertical_size, and frame_rate. The Level also places separate restrictions on each of the these three variables. Reflecting the more television oriented manner of MPEG-2, the optional sequence_display_extension() header can specify the chromaticy of the source video signal as it was prior to representation by MPEG syntax. This information includes: whether the original video_format was composite or component, the opto-electronic transfer_characteristics, and RGB->YCbCr matrix_coefficients. The picture_display_extension() provides more localized source composite video characteristics on a frame by frame basis (not field-by-field), with the syntax elements: field_sequence, sub_carrier_phase, and burst_amplitude. This information can be used by the displays post-processing stage to reproduce a more refined display sequence. Optional pan & scan syntax was introduced which tells a decoder on a frame-by-frame basis how to, for example, window a 4:3 image within the wider 16:9 aspect ratio of the coded frame. The vertical pan offset can be specified to within 1/16th pixel accuracy. <IMG SRC="mpeg2pan.gif"> How does MPEG syntax facilitate parallelism ? For MPEG-1, slices may consist of an arbitrary number of macroblocks. They can be independently decoded once the picture header side information is known. For parallelism below the slice level, the coded bitstream must first be mapped into fixed-length elements. Further, since macroblocks have coding dependencies on previous macroblocks within the same slice, the data hierarchy must be pre-processed down to the layer of DC DCT coefficients. After this, blocks may be independently inverse transformed and quantized, temporally predicted, and reconstructed to buffer memory. Parallelism is usually more of a concern for encoders. In many encoders today, block matching (motion estimation) and some rate control stages (such as activity and/or complexity measures) are processed for macroblocks independently. Finally, with the exception that all macroblock rows in Main Profile MPEG-2 bitstreams must contain at least one slice, an encoder has the freedom to choose the slice structure. What is the MPEG color space and sample precision? MPEG strictly specifies the YCbCr color space, not YUV or YIQ or YPbPr or YDrDb or any other many fine varieties of color difference spaces. Regardless of any bitstream parameters, MPEG-1 and MPEG-2 Video Main Profile specify the 4:2:0 chroma_format, where the color difference channels (Cb, Cr) have half the "resolution" or sample grid density in both the horizontal and vertical direction with respect to luminance. MPEG-2 High Profile includes an option for 4:2:2 chroma_format, as does the MPEG 4:2:2 Profile (a.k.a. Studio Profile) naturally. Applications for the 4:2:2 format can be found in professional broadcasting, editing, and contribution-quality distribution environments. The drawback of the 4:2:2 format is simply that it increases the size of the macroblock from six 8x8 blocks (4:2:0) to eight, while increasing the frame buffer size and decoding bandwidth by the same amount (33 %). This increase places the buffering memories well past the magic 16-Mbit limit for semiconductor DRAM devices, assuming the pictures are stored with a maximum of 414,720 pixels (720 pixels/line x 576 lines/frame). The maximum allowable pixel resolution could be reduced by 1/3 to compensate (e.g. 544 x 576). However, if a hardware decoders operate on a macroblock basis in the pipeline, on-chip static memories (SRAM) will increase by 1/3. The benefits offered by 1/3 more pixels generally outweighs full vertical chrominance resolution. Other arguments favoring 4:2:0 over 4:2:2 include: Vertical decimation increases compression efficiency by reducing syntax overhead posed in an 8 block (4:2:2) macroblock structure. You're compressing the hell out of the video signal, so what possible difference can the 0:0:2 chromiance high-pass make? Is 4:2:0 the same as 4:1:1 ? No, no, definitely no. The following table illustrates the nuances between the different chroma formats for a frame with pixel dimensions of 720 pixels/line x 480 lines/frame. CCIR 601 (60 Hz) image Chroma sub-sampling factors format Y Cb, Cr Vertical Horizontal chroma format pixels/ line Y lines/ frame Y pixels/ line Cb, Cr lines/ frame Cb, Cr horizontal subsampling factor vertical subsampling factor 4:4:4 720 480 720 480 none none 4:2:2 720 480 360 480 2:1 none 4:2:0 720 480 360 240 2:1 2:1 4:1:1 720 480 180 480 4:1 none 4:1:0 720 480 180 120 4:1 4:1 3:2:2, 3:1:1, and 3:1:0 are less common variations, but have been documented. As shocking as it may seem, the 4:1:0 ratio was used by Intels DVI for several years. The 130 microsecond gap between successive 4:2:0 lines in progressive frames, and 260 microsecond gap in interlaced frames, can introduce some difficult vertical frequencies, but most can be alleviated through pre- processing. What is the sample precision of MPEG ? How many colors can MPEG represent ? By definition, MPEG samples have no more and no less than 8-bits uniform sample precision (256 quantization levels). For luminance (which is unsigned) data, black corresponds to level 0, white is level 255. However, in CCIR recommendation 601 chromaticy, luminance (Y) levels 0 through 14 and 236 through 255 are reserved for blanking signal excursions. MPEG currently has no such clipped excursion restrictions, although decoder might take care to insure active samples do not exceed these limits. With three color components per pixel, the total combination is roughly 16.8 million colors (i.e. 24-bits). How are the subsampled chroma samples cited ? It is moderately important to properly co-site chroma samples, otherwise a sort of chroma shifting effect (exhibited as a halo) may result when the reconstructed video is displayed. In MPEG-1 video, the chroma samples are exactly centered between the 4 luminance samples (Fig 1.) To maintain compatibility with the CCIR 601 horizontal chroma locations and simplify implementation (eliminate need for phase shift), MPEG-2 chroma samples are arranged as per Fig.2. Y Y Y Y Y Y Y Y YC Y YC Y C C C C Y Y X Y Y Y Y Y YC Y YC Y Y Y Y Y Y Y Y Y YC Y YC Y C C C C Y Y Y Y Y Y Y Y YC Y YC Y Fig.1 MPEG-1 Fig.2 MPEG-2 Fig.3 MPEG-2 and 4:2:0 organization 4:2:0 organization CCIR Rec. 601 4:2:2 organization How do you tell an MPEG-1 bitstream from an MPEG-2 bitstream ? A. All MPEG-2 bitstreams must contain specific extension headers that immediately follow MPEG-1 headers. At the highest layer, for example, the MPEG-1 style sequence_header() is followed by sequence_extension(). Some extension headers are specific to MPEG-2 profiles. For example, sequence_scalable_extension() is not allowed in Main Profile bitstreams. A simple program need only scan the coded bitstream for byte-aligned start codes to determine whether the stream is MPEG-1 or MPEG-2. What are start codes? These 32-bit byte-aligned codes provide a mechanism for cheaply searching coded bitstreams for commencement of various layers of video without having to actually parse variable-length codes or perform any decoder arithmetic. Start codes also provide a mechanism for resynchronization in the presence of bit errors. A start code may be preceded by an arbitrary number of zero bytes. The zero bytes can be use to guarantee that a start code occurs within a certain location, or by rate control to increase the bitrate of a coded bitstream. Coded block pattern Coded block pattern: (CBP --not to be confused with Constrained Parameters!) When the frame prediction is particularly good, the displaced frame difference(DFD, or temporal macroblock prediction error) tends to be small, often with entire block energy being reduced to zero after quantization. This usually happens only at low bit rates. Coded block patterns prevent the need for transmitting EOB symbols in those zero coded blocks. Coded block patterns are transmitted in the macroblock header only if the macrobock_type flag indicates so. Why is the DC value always divided by 8 ? Clarification point: The DC value of Intra coded blocks is quantized by a constant stepsize of 8 only in MPEG-1, rendering the 11-bit dynamic range of the IDCT DC coefficient to 8-bits of accuracy. MPEG-2 allows for DC precision of 8, 9, 10, or 11 bits. The quantization stepsize is fixed for the duration of the picture, set by the intra_dc_precision flag in the picture_extension_header(). Why is there a special VLC for DCT_coefficient_first:? Since the coded_block_pattern in NON-INTRA macroblocks signals every possible combination of all-zero valued and non-zero blocks, the dct_coef_first mechanism assigns a different meaning to the VLC codeword (run = 0, level =+/- 1) that would otherwise represent EOB (10) as the first coefficient in the zig-zag ordered Run-Level token list. What’s the deal with End of Block ? Saves unnecessary run-length codes. At optimal bitrates, there tends to be few AC coefficients concentrated in the early stages of the zig-zag vector. In MPEG-1, the 2-bit length of EOB implies that there is an average of only 3 or 4 non-zero AC coefficients per block. In MPEG-2 Intra (I) pictures, with a 4-bit EOB code in Table 1, this estimate is between 9 and 16 coefficients. Since EOB is required for all coded blocks, its absence can signal that a syntax error has occurred in the bitstream. What’s this “Macroblock stuffing,” dammit ?: A genuine pain for VLSI implementations, macroblock stuffing was included in MPEG-1 to maintain smoother, constant bitrate control for encoders. However, with normalized complexity/activity measures and buffer management performed a priori (before coding of the macroblock, for example) and local monitoring of coded data buffer levels now a common operation in encoders, (e.g. MPEG-2 encoder Test Model), the need for such localized bitrate smoothing evaporated. Stuffing can be achieved through slice start code padding if required. A good rule of thumb is: if you find often yourself wishing for stuffing more than once per slice, you probably don't have a very good rate control algorithm. Nonetheless, to avoid any temptation, macroblock stuffing is now illegal in MPEG-2 (A general syntax restriction brought to you by the Implementation Studies Subgroup!) What’s the deal with slice_vertical_position and macroblock_address_increment? The absolute position of the first macroblock within a slice is known by the combination of slice_vertical_position and the macroblock_address_increment. Therefore, the proper place of a lost slice found in a highly corrupt bitstream can be located exactly within the picture. These two syntax elements are also the only known means of detecting slice gaps----areas of the picture which are not represented with any information (including skipped macroblocks). A slice gap occurs when the current macroblock address of the first macroblock in a slice is greater than the previous macroblock address by more than 1 macroblock unit. A slice overlap occurs when the current macroblock address is less than or equal to the previous macroblocks address. The previous macroblock in both instances is the last known macroblock within the previous slice. Because of the semantic interpretation of slice gaps and overlaps, and because of the syntactic restrictions for slice_vertical_position and macroblock_address_increment, it is not syntactically possible for a skipped macroblock to be represented in the first and last positions of a slice. In the past, some (bad) encoders would attempt to signal a run of skipped macroblocks to the end of the slice. These evil skipped macroblocks should be interpreted by a compliant decoder as a gap, not as a string of skipped macroblocks. What is meant by modified Huffman VLC tables: The VLC tables in MPEG are not Huffman tables in the true sense of Huffman coding, but are more like the tables used in Group 3 fax. They are entropy constrained, that is, non-downloadable and optimized for a limited range of bit rates (sweet spots). A better way would be to say that the tables are optimized for a range of ratios of bit rate to sample rate (e.g. 0.25 bits/pixel to 1.0 bits/pixel). With the exception of a few codewords, the larger tables were carried over from the H.261 standard drafted in the year 1990. This includes the AC run-level symbols, coded_block_pattern, and macroblock_address_increment. MPEG-2 added an "Intra table," also called "Table 1". Note that the dct_coefficient tables assume positive/negative coefficient PMF symmetry. How does MPEG handle 3:2 pulldown? MPEG-1 video decoders had to decide for themselves when to perform 3:2 pulldown if it was not indicated in the presentation time stamps (PTS) of the Systems layer bitstream. MPEG-2 provides two flags (repeat_first_field, and top_field_first) which explicitly describe whether a frame or field is to be repeated. In progressive sequences, frames can be repeated 2 or 3 times. Simple and Main Profile limit are limited to repeated fields only. It is a general syntactic restriction that repeat_first_field can only be signaled (value ==1) in a frame structured picture. It makes little sense to repeat field pictures in an interlaced video signal since the whole process of 3:2 pulldown conversion was meant to convert progressive, film sequences to the display frame rate of interlaced television. In the most common scenario, a film sequence will contain 24 frames every second. The bit_rate element in the sequence header will indicate 30 frames/sec, however. On average, every other coded frame will signal a repeat field (repeat_first_field==1) to pad the frame rate from 24 Hz to 30 Hz: (24 coded frames/sec)*(2 fields/coded frame)*(5 display fields/4 coded fields) = 30 display frames/sec After all this standardization, what’s left for research? A . Despite the fact that a comprehensive worldwide standard now exists for digital video, many areas remain wide open for research: advanced encoding and pre-processing, motion estimation, macroblock decision models, rate control and buffer management in editing environments, implementation complexity reduction, etc. Many areas have yet to be solved ... (and discovered).. Are some encoders better than others ? A. Definitely. For example, the motion estimation search range of a has great influence over final picture quality. At a certain point a very large range can actually become detrimental (it may encourage large differential motion vectors). Practical ranges are usually between +/- 15 and +/- 32. As the range doubles, for instance, the search area quadruples. (like the classic relationship between in increase in linear vs. area). Rate control marks a second tell-tale area where some encoders perform significantly better than others. And finally, the degree of "pre-processing" (now a popular buzzword in the business) signals that the encoder belongs to an elite marketing class. Is the encoder standardized ? A. The encoder rests just outside the normative scope of the standard, as long as the bitstreams it produces are compliant. The decoder, however, is almost deterministic: a given bitstream should reconstruct to a unique set of pictures. However, since the IDCT function is the ONLY non-normative stage in the decoder, an occasional error of a Least Significant Bit per prediction iteration is permitted. The designer is free to choose among many DCT algorithms and implementations. The IEEE 1180 test referenced in Annex A of the MPEG-1 (ISO/IEC 11172-2) and MPEG-2 (ISO/IEC 13818-2) Video specifications spells out the statistical mismatch tolerance between the Reference IDCT, which is a separable 8x1 "Direct Matrix" DCT implemented with 64-bit floating point accuracy, and the IDCT you are testing for compliance. What is the TM (Test Model) ? What is the TM rate control and adaptive quantization technique ? A. The Test model (MPEG-2) and Simulation Model (MPEG-1) were not, by any stretch of the imagination, meant to epitomize state-of-the art encoding quality. They were, however, designed to exercise the syntax, verify proposals, and test the relative compression performance of proposals in a timely manner that could be duplicated by co-experimenters. Without simplicity, there would have been no doubt endless debates over model interpretation. Regardless of all else, more advanced techniques would probably trespass into proprietary territory. The final test model for MPEG-2 is TM version 5b, a.k.a. TM version 6, produced in March 1993 (the time when the MPEG-2 video syntax was frozen). The final MPEG-1 simulation model is version 3 (SM-3). The MPEG-2 TM rate control method offers a dramatic improvement over the SM method. TM adds more accurate estimation of macroblock complexity through use of limited a priori information. Macroblock quantization adjustments are computed on a macroblock basis, instead of once-per-macroblock row (which in the SM-3 case consisted of an entire slice). How does the TM work? Rate control and adaptive quantization are divided into three steps: Step One: Target Bit Allocation In Complexity Estimation, the global complexity measures assign relative weights to each picture type (I,P,B). These weights (Xi, Xp, Xb) are reflected by the typical coded frame size of I, P, and B pictures (see typical frame size discussion). I pictures are usually assigned the largest weight since they have the greatest stability factor in an image sequence and contain the most new information in a sequence. B pictures are assigned the smallest weight since B energy do not propagate into other pictures and are usually more highly correlated with neighboring P and I pictures than P pictures are. The bit target for a frame is based on the frame type, the remaining number of bits left in the Group of Pictures (GOP) allocation, and the immediate statistical history of previously coded pictures (sort of a moving average global rate control, if you will). Step Two: Rate Control via Buffer Monitoring Rate control attempts to adjust bit allocation if there is significant difference between the target bits (anticipated bits) and actual coded bits for a block of data. If the virtual buffer begins to overflow, the macroblock quantization step size is increased, resulting in a smaller yield of coded bits in subsequent macroblocks. Likewise, if underflow begins, the step size is decreased. The Test Model approximates that the target picture has spatially uniform distribution of bits. This is a safe approximation since spatial activity and perceived quantization noise are almost inversely proportional. Of course, the user is free to design a custom distribution, perhaps targeting more bits in areas that contain more complex yet highly perceptible data such as text. Step Three: Adaptive Quantization The final step modulates the macroblock quantization step size obtained in Step 2 by a local activity measure. The activity measure itself is normalized against the most recently coded picture of the same type (I, P, or B). The activity for a macroblock is chosen as the minimum among the four 8x8 block luminance variances. Choosing the minimum block is part of the concept that a macroblock is no better than the block of highest visible distortion (weakest link in the chain). Decision: [deferred to later date] Can motion vectors be used to determine object velocity? Motion vector information cannot be reliably used as a means of determining object velocity unless the encoder model specifically set out to do so. First, encoder models that optimize picture quality generate vectors that typically minimize prediction error and, consequently, the vectors often do not represent true object translation from picture-to-picture. Standards converters that resample one frame rate to another (as in NTSC to PAL) use different methods (motion vector field estimation, edge detection, et al) that are not concerned with Rate-Distortion theory. Second, motion vectors are not transmitted for all macroblocks anyway. Is it possible to code interlaced video with MPEG-1 syntax? A. Two methods can be applied to interlaced video that maintain syntactic compatibility with MPEG-1 (which was originally designed for progressive frames only). In the field concatenation method, the encoder model can carefully construct predictions and prediction errors that realize good compression but maintain field integrity (distinction between adjacent fields of opposite parity). Some pre-processing techniques can also be applied to the interlaced source video that would, e.g., lessen sharp vertical frequencies. This technique is not terribly efficient of course. On the other hand, if the original source was progressive (e.g. film), then it is more trivial to convert the interlaced source to a progressive format before encoding. (MPEG-2 would then only offer slightly superior performance through such MPEG-2 enhancements as greater DC coefficient precision, non-linear mquant, intra VLC, etc.) Reconstructed frames are usually re- interlaced in the Display process following the decoding stages. The second syntactically compatible method codes fields as separate pictures. Rumors have spread that this approach does not quiet work nearly as well as the pretend its really a frame method. Can MPEG be used to code still frames ? Yes. MPEG Intra pictures are similar to baseline sequential JPEG pictures. There are, of course, advantages and disadvantages to using MPEG over JPEG to represent still pictures. Disadvantages: 1. MPEG has only one color space (YCbCr) 2. MPEG-1 and MPEG-2 Main Profile luma and chroma share quanitzation and VLC tables (4:2:0 chroma_format) 3. MPEG-1 is syntactically limited to 4k x 4k images, and 16k x 16k for MPEG-2. Advantages: 1. MPEG possesses adaptive quantization which permits better rate control and spatial masking. 2. With its limited still image syntax, MPEG averts any temptation to use unnecessary, expensive, and academic encoding methods that have little impact on the overall picture quality (you know who you are). 3. Philips' CD-I spec. has a requirement for a MPEG still frame mode, with double SIF image resolution. This is technically feasible mostly thanks to the fact that only one picture buffer is needed to decode a still image instead of the 2.5 to 3 buffers needed for IPB sequences. Why was the 8x8 DCT size chosen? A. Experiments showed little compaction gains could be achieved with larger transform sizes, especially in light of the increased implementation complexity. A fast DCT algorithm will require roughly double the number of arithmetic operations per sample when the linear transform point size is doubled. Naturally, the best compaction efficiency has been demonstrated using locally adaptive block sizes (e.g. 16x16, 16x8, 8x8, 8x4, and 4x4) [See Gary Sullivan and Rich Baker "Efficient Quadtree Coding of Images and Video," ICASSP 91, pp 2661-2664.]. Inevitably, adaptive block transformation sizes introduce additional side information overhead while forcing the decoder to implement programmable or hardwired recursive DCT algorithms. If the DCT size becomes too large, then more edges (local discontinuities) and the like become absorbed into the transform block, resulting in wider propagation of Gibbs (ringing) and other unpleasant phenomena. Finally, with larger transform sizes, the DC term is even more critically sensitive to quantization noise. Why was the 16x16 prediction size chosen? The 16x16 area corresponds to the Least Common Multiple (LCM) of 8x8 blocks, given the normative 4:2:0 chroma ratio. Starting with medium size images, the 16x16 area provides a good balance between side information overhead & complexity and motion compensated prediction accuracy. In gist, experiments showed that the 16x16 was a good trade-off between complexity and coding efficiency. What do B-pictures buy you? A. Since bi-directional macroblock predictions are an average of two macroblock areas, noise is reduced at low bit rates (like a 3-D filter, if you will). At nominal MPEG-1 video (352 x 240 x 30, 1.15 Mbit/sec) rates, it is said that B-frames improves SNR by as much as 2 dB. (0.5 dB gain is usually considered worth-while in MPEG). However, at higher bit rates, B- frames become less useful since they inherently do not contribute to the progressive refinement of an image sequence (i.e. not used as prediction by subsequent coded frames). Regardless, B-frames are still politically controversial. B pictures are interpolative in two ways: 1. predictions in the bi-directional macroblocks are an average from block areas of two pictures 2. B pictures "fill in" like a digital spackle the immediate 3-D video signal without contributing to the overall signal quality beyond that immediate point in time. In other words, a B picture, regardless of its internal make-up of macroblock types, has a life limited only to itself. As mentioned before, B picture energy does not propagate into other frames. In a sense, bits spent on B pictures are wasted. Why do some people hate B-frames? A. Computational complexity, bandwidth, end-to-end delay, and picture buffer size are the four B-frame Pet Peeves. Computational complexity in the decoder is increased since some macroblock modes require averaging between two block predictions (macroblock_motion_forward==1 && macroblock_motion_backward==1). Worst case, memory bandwidth is increased an extra 15.2 MByte/s (assuming 4:2:0 chroma_format at Main Level), not including any half pel or page-mode overhead) for this extra directional prediction. To really rub it in, an extra picture buffer is needed to store the future reference picture (backwards prediction frame). Finally, an extra picture delay is introduced in the decoder since the frame used for backwards prediction needs to be transmitted to the decoder and reconstructed before the intermediate B-pictures in display order can be decoded. Cable television have been particularly adverse to B-frames since, for CCIR 601 rate video, the extra picture buffer pushes the decoder DRAM memory requirements past the magic 8- Mbit (1 Mbyte) threshold into the evil realm of 16 Mbits (2 Mbyte).---- although 8-Mbits is fine for 352 x 480 B picture sequence. However, cable often forgets that DRAM does not come in convenient high-volume (low cost) 8- Mbit packages as does friendly 4-Mbit and 16-Mbit packages. In a few years, the cost difference between 16 Mbit and 8 Mbit will become insignificant compared to the bandwidth savings gain through higher compression. For the time being, some cable boxes will start with 8-Mbit and allow future drop-in upgrades to the full 16-Mbit. How are interlaced and progressive pictures indicated in MPEG? The following tree may help illustrate the possible layers of progressive and interlaced coding modes: MPEG-2 sequence / \ progressive interlaced sequence sequence / \ Field picture Frame picture / \ / \ Frame or field prediction Frame MB prediction only / \ Field dct Frame dct What does it mean to be compliant with MPEG ? There are two areas of conformance/compliance in MPEG: 1. Compliant bitstreams 2. Compliant decoders Technically speaking, video bitstreams consisting entirely of I-frames are syntactically compliant with the MPEG specification. The I-frame sequence simply utilizes a rather limited subset of the full syntax. Compliant bitstreams must obey the range limits (e.g. motion vectors ranges, bit rates, frame rates, buffer sizes) and permitted syntax elements in the bitstream (e.g. chroma_format, B-pictures, etc). Decoders, however, must be able to decode all combinations of legal bitstreams.. For example, a decoder which is incapable of decoding P or B frames is definitely not a Main Profile or Constrained Parameters decoder! Likewise, full arithmetic precision must be obeyed before any decoder can be called "MPEG compliant." The IDCT, inverse quantizer, and motion compensated predictor must meet the accuracy requirements defined in the MPEG document. Real-time conformance is more complicated to measure than arithmetic precision, but it reasonable to expect that decoders that skip frames on reasonable bitstreams are not likely to be considered compliant. What are Profiles and Levels? A. MPEG-2 Video Main Profile and Main Level is analogous to MPEG-1's CPB, with sampling limits at CCIR 601 parameters (720x480x30 Hz or 720x576x24 Hz). "Profiles" limit syntax (i.e. algorithms), whereas "Levels" limit coding parameters (sample rates, frame dimensions, coded bitrates, etc.). Together, Video Main Profile and Main Level (abbreviated as MP@ML) normalize complexity within feasible limits of 1994 VLSI technology (0.5 micron), yet still meet the needs of the majority of applications. MP@ML is the conformance point for most cable and satellite TV systems. [insert a description of each Profiles and Levels here] Can MPEG-1 encode higher sample rates than 352 x 240 x 30 Hz ? A. Yes. The MPEG-1 syntax permits sampling dimensions as high as 4095 x 4095 x 60 frames per second. The MPEG most people think of as "MPEG-1" is really a kind of subset known as Constrained Parameters bitstream (CPB). What are Constrained Parameters Bitstreams? MPEG-1 CPB are a limited set of sampling and bitrate parameters designed to normalize decoder computational complexity, buffer size, and memory bandwidth while still addressing the widest possible range of applications. The parameter limits were intentionally designed to permit decoder implementations integrated with 4 Megabits (512 Kbytes) of DRAM. Bitstream Parameter Limit pixels/line 704 lines/frame 480 or 576 pixels/frame 101,376 pixels pixels/second 2,534,400 frames/sec 30 Hz bit rate 1.86 Mbit/sec buffer size 40 Kbytes The sampling limits of CPB are bounded at the ever popular SIF rate: 396 macroblocks (101,376 pixels) per picture if the picture rate is less than or equal to 25 Hz, and 330 macroblocks (84,480 pixels) per picture if the picture rate is 30 Hz. The MPEG nomenclature loosely defines a pixel or "pel" as a unit vector containing a complete luminance sample and one fractional (0.25 in 4:2:0 format) sample from each of the two chrominance (Cb and Cr) channels. Thus, the corresponding bandwidth figure can be computed as: 352 samples/line x 240 lines/picture x 30 pictures/sec x 1.5 samples/pixel or 3.8 Ms/s (million samples/sec) including chroma, but not including blanking intervals. Since most decoders are capable of sustaining VLC decoding at a faster rate than 1.8 Mbit/sec, the coded video bitrate has become the most often waived parameter of CPB. An encoder which intelligently employs the syntax tools should achieve SIF quality saturation at about 2 Mbit/sec, whereas an encoder producing streams containing only I (Intra) pictures might require as much as 8 Mbit/sec to achieve the same video quality. Why is Constrained Parameters so important? A. It is an optimum point that allows (just barely) cost effective VLSI implementations in 1992 technology (0.8 microns). It also implies a nominal guarantee of interoperability for decoders and a reasonable class of performance for encoders. Since CPB is the most popular canonical MPEG-1 conformance point, MPEG devices which are not capable of at least meeting SIF rates are usually not considered to be true MPEG by industry. Picture buffers (i.e. "frame stores") and coded data buffering requirements for MPEG-1 CPB fit just snugly into 4 Mbit of memory (DRAM). Who uses constrained parameters bitstreams? A. Principal CPB applications are Compact Disc video (White Book or CD-I) and desktop video. Set-top TV decoders fall into a higher sampling rate category known as "CCIR 601" or "Broadcast rate," which as a rule of thumb, has sampling dimensions and bandwidth 4 times that of SIF (Constrained Parameter sample rate limit). Are there ways of circumventing constrained parameters bitstreams for SIF class applications and decoders ? A. Yes, some. Remember that CPB limits pictures by macroblock count (or pixels/frame). 416 x 240 x 24 Hz sampling rates are still within these constraints. Deviating from 352 samples/line could throw off many decoder implementations which possess limited horizontal sample rate conversion abilities. Some decoders do in fact include a few rate conversion modes, with a filter usually implemented via binary taps (shifts and adds). Likewise, the target sample rates are usually limited or ratios (e.g. 640, 540, 480 pixels/line, etc.). Future MPEG decoders will likely include on-chip arbitrary sample rate converters, perhaps capable of operating in the vertical direction (although there is little need of this in applications using standard TV monitors where line count is constant, with the possible exception of windowing in cable box graphical user interfaces). Also, many CD videos are letterboxed at the 16:9 aspect ratio. The actual coded and display sampling dimensions are 384 x 216 (note 384/216 = 16/9). These programs are typically movies coded at the more manageable 24 frames/sec. Are there any other conformance points like CPB for MPEG-1? A. Undocumented ones, yes. A second generation of decoder chips emerged on the market about 1 year after the first wave of SIF-class decoders. Both LSI Logic and SGS-Thomson introduced CCIR 601 class MPEG-1 video decoders to fill in the gap between canonical MPEG-1 (SIF) and the emergence of Main Profile at Main Level (CCIR 601) MPEG-2 decoders. Under non-disclosure agreement, C-Cube had the CL- 950, although since Q2'94, the CL-9100 is now the full MPEG-2 successor in production. MPEG-1 decoders in the CCIR 601 class, or Main Level, were all too often called MPEG-1.5 or MPEG-1++ decoders. For the first year of operation, the Direct Broadcasting Satellite service in the United States (Hughes Direct TV and Hubbards USSB) called only upon MPEG-1 syntax to represent interlaced video before switching to full MPEG-2 syntax. What frame rates are permitted in MPEG? A limited set is available for the choosing in MPEG-1 and the currently defined set of Profiles and Levels of MPEG-2, although "tricks" could be played with Systems-layer Time Stamps to convey non-standard picture rates. The set is: 23.976 Hz (3-2 pulldown NTSC), 24 Hz (Film), 25 Hz (PAL/SECAM or 625/60 video), 29.97 (NTSC), 30 Hz (drop-frame NTSC or component 525/60), 50 Hz (double-rate PAL), 59.97 Hz (double rate NTSC), and 60 Hz (double-rate, drop-frame NTSC/component 525/60 video). Only 23.976, 24, 25, 29.97, and 30 Hz are within the conformance space of Constrained Parameter Bitstreams and Main Level. What areas can be improved upon to create a better syntax than MPEG? Several improvements can be made to the MPEG syntax while remaining within the framework of block based coding. As implementation technology improves with time, the ratio of computation to sample rate can be increased for the same implementation cost. With each evolutionary stage in the shrinking of the semiconductor lithography process (line width), more complex coding methods become economically realizable. Some of the well-known or well-anticipated areas for improvement are described below: Intra coding: For intra pictures, subband methods such as wavelets combined with improved quantization and entropy coders could gain as much as 2-4 dB over MPEG Intra pictures. The problem becomes more complex when considering the coding of Intra Macroblocks in mixed pictures, such as P or B, since the extend of a subband must, in the simplest of schemes, be limited to the dimensions of a macroblock. Prediction error coding One of the strongest gripes against MPEG is the use of the DCT for decorrelation of prediction error blocks. One explanation is that the DCT is suited for the statistical correlation of intra signals, but less suited for the statistics of prediction error (Non-Intra) signals. One common proposal is to replace the DCT with a Vector Quantizer. Prediction error (Non-intra) blocks typically contain far fewer bits than intra blocks. (The bits that comprise a Non-intra blocks can be thought of as having been previously distributed over previous blocks in previous pictures in the form of coefficients and side information...) Finer coding unit granularity’s: The size of the transform block could be made smaller, larger, or both (myriad of different sizes). Likewise, the size of the motion compensation block can be made larger or smaller. The cost is more complex semantics (more decoder complexity) and the overhead bits to select the block size. Instead of sharing the same side information, the blocks within the macroblock could be assigned their own motion vectors, macroblock quantization scale factors, etc. Many advanced techniques were in investigated by MPEG during the formative stages of the specification, but were eventually eliminated for falling below a threshold set for coding gain vs. implementation complexity. Often, proposals presented a significant departure from the main stream algorithms under consideration. Each bit added to the syntax, or rule added to the semantics represents several gates to a silicon implementation, or from a software perspective, an extra table, if-then or case statement at multiple points in the decoding program. What are the similarities and differences between MPEG and H.263 During its formative stages, H.263 was known as "H.26P" or "H.26X". It is an ITU-T standard for low-bitrate video and audio teleconferencing. It is designed to be more efficient (at least 2dB) than H.261 for bit rates below 64 kbits/sec (ISDN B channel). The primary target bit rate, approximately 27,000 bits/sec, is the payload rate of the V.34 (a.k.a "V.Fast" or "V.Last") modem standard. In a typical scenario, 20 kbit/sec would be allocated for the video portion, and 6.5 kbit/sec for the speech portion. Since the H.261 syntax was defined in 1990, techniques and implementation power have naturally improved. H.263 collects many of the advanced methods proposed during MPEGs formative stages into a syntax which shares a common basis more with MPEG-1 video than with H.261. The detailed differences and similarities are summarized below: Sample rate, precision, and color space: H.263 pictures are transmitted with QCIF dimensions. MPEG and JPEG allow nearly any picture size to be described in the headers. A fixed picture size promotes interoperability by forcing all implementors to operate at a common rate, rather than by allowing implementors to get away with whatever lowest sample rate the consumer can be tricked into buying. Another reason for a fixed sample rate is that, unlike MPEG which is generic, H.263 is geared towards a specific application (teleconferencing). Other MPEG applications such as CD Video and Cable TV define their own fixed parameters. Chromaticy is again YCbCr, 4:2:0 macroblock structure, and 8 bits of uniform sample precision. [details deferred] How would you describe MPEG to the Data Compression expert? A. MPEG video is a block-based coding scheme. How does MPEG video really compare to TV, VHS, laserdisc ? A. VHS picture quality can be achieved for film source video at about 1 million bits per second (with careful application of proprietary encoding methods). Objective comparison of MPEG to VHS is complex. The luminance response curve of VHS places -3 dB (50% response, the common definition of bandlimit) at around analog 2 MHz (digital equivalent to 200 samples/line). VHS chroma is considerably less dense in the horizontal direction than MPEG's 4:2:0 signal (compare 80 samples/line equivalent to 176 !!). From a sampling density perspective, VHS is superior only in the vertical direction (480 luminance lines compared to 240). When other analog factors are taken into account, such as interfield crosstalk and the TV monitor Kell factor, the perceptual vertical advantage becomes much less than 2:1. VHS is also prone to such inconveniences as timing errors (an annoyance addressed by time base correctors), whereas digital video is fully discretized. Duplication processes for pre-recorded VHS tapes at high speeds (5 to 15 times real time playback speed) introduces additional handicaps. In gist, MPEG-1 at its nominal parameters can match VHSs sexy low-pass-filtered look, but for critical sequences, is probably overall inferior to a well mastered, well duplicated VHS tape. With careful coding schemes, broadcast NTSC quality can be approximated at about 3 Mbit/sec, and PAL quality at about 4 Mbit/sec for film source video. Of course, sports sequences with complex spatial- temporal activity should be treated with higher bit rates, in the neighborhood of 5 and 6 Mbit/sec. Laserdisc is perhaps the most difficult medium to make comparisons with. First, the video signal encoded onto a laserdisc is composite, which lends the signal to the familiar set of artifacts (reduced color accuracy of YIQ, moirse patterns, crosstalk, etc). The medium's bandlimited signal is often defined by laserdisc player manufacturers and main stream publications as capable of rendering up to 425 TVL (or frequencies with Nyquist at 567 samples/line). An equivalent component digital representation would therefore have sampling dimensions of 567 x 480 x 30 Hz. The carrier-to-noise ratio of a laserdisc video signal is typically better than 48 dB. Timing accuracy is excellent, certainly better than VHS. Yet some of the clean characteristics of laserdisc can be simulated with MPEG-1 signals as low as 1.15 Mbit/sec (SIF rates), especially for those areas of medium detail (low spatial activity) in the presence of uniform motion (affine motion vector fields). The appearance of laserdisc or Super VHS quality can therefore be obtained for many video sequences with low bit rates, but for the more general class of images sequences, a bit rate ranging from 3 to 6 Mbit/sec is necessary. What are the typical coded sizes for the MPEG frames? Typical bit sizes for the three different picture types: Level I P B Average 30 Hz SIF @ 1.15 Mbit/sec 150,000 50,000 20,000 38,000 30 Hz CCIR 601 @ 4 Mbit/sec 400,000 200,000 80,000 130,000 Note: the above example is taken from a standard test sequence coded by the Test Model method, with an I frame distance of 15 (N = 15), and a P frame distance of 3 (M = 3). Of course, among differing source material, scene changes, and use of advanced encoder models these numbers can be significantly different. At what bitrates is MPEG-2 video optimal? The Test subgroup has defined a few example "Sweet spot" sampling dimensions and bit rates for MPEG-2: Dimensions Coded rate Application 352x480x24 Hz (progressive) 2 Mbit/sec Equivalent to VHS quality. Intended for film source video. Half horizontal 601(HHR). Looks almost broadcast NTSC quality 544x480x30 Hz (interlaced). 4 Mbit/sec PAL broadcast quality (nearly full capture of 5.4 MHz luminance signal). 544 samples matches the width of a 4:3 picture windowed within 720 sample/line 16:9 aspect ratio via pan&scan 704x480x30 Hz.(interlaced) 6 Mbit/sec Full CCIR 601 sampling dimensions These numbers may be too ambitious. Bit rates of 3, 6, and 8 Mbit/sec respectively provide transparent quality for the above application examples when generated by a reasonably sophisticated encoder. Why does film perform so well with MPEG ? 1. The frame rate is 24 Hz (instead of 30 Hz) which is a savings of some 20%. 2. Film source video is inherently progressive. Hence no fussy interlaced spectral frequencies.