Second generation audio to video synchronization
contact site map home


Abstract

This paper presents a tutorial on second generation audio to video synchronization error correction for television systems. Some of the more common sources of errors are described; the problems which the errors create and solutions to the error causes are outlined.

Viewer perception problems

The most obvious result of audio to video mismatch is visible "lip sync" errors. This problem certainly can and does happen in today's systems, with the frequency of occurrence becoming a significant concern to advertisers and station management. The mistiming of audio and video will always cause a subconscious degradation of the program's entertainment quality as perceived by the home viewer when the audio is advanced with respect to the video. The cause of this effect is believed to be the unnatural sound relationship which the television program presents. In our natural environment we are used to hearing audio slightly delayed with respect to video due to the slower speed of propagation of sound waves as compared to light. For example, we are used to hearing a racquet striking after we see the ball hit and hearing a commercial actor after we see them talking. In today's television systems however, it is the video which is delayed thus causing the sound to arrive at the viewer's ears before the corresponding visual sensation.

Viewing a television program with advanced audio is unnatural for the viewer, and is believed to cause subconscious stress. Psychological tests at Stanford University (1) demonstrate that viewers who watch television commercials having advanced audio "evaluate people on television more negatively (e.g. less interesting, more unpleasant, less influential, more agitated, less successful)" than the same commercials which were played with the audio in sync with the video. It was also discovered that this effect takes place with relatively small audio advances where the mere existence of an audio problem was detected by very few average viewers.

In addition to the negative perception of the commercials in the presence of advanced audio, there was also evidence this caused the test subject's memory of the negative aspects of the commercial to be remembered longer than normal. The worst possible scenario takes place, the viewer perceives the advanced audio commercial in a bad light, and also remembers it longer than a commercial which is properly presented. Obviously, such problems can cause a great deal of concern for television advertisers.

CCD camera generated vision delays

Audio to video synchronization errors are becoming more troublesome as television technology progresses. The wide use of cameras having CCD sensors is aggravating this synchronization problem. All CCD sensors have an inherent visual delay mechanism. Depending on the sensor type, the visual delay may be several fields for newer camera types. In particular, the liberal use of digital frame store based image processing in newer cameras is creating previously unknown vision delays of several fields, with a four field delay not being uncommon.

Variable temporal resolution in the CCD

It would be worthwhile to mention the effect that variable shutter speeds has on temporally sampling the image. At maximum exposure, that is a 1 frame shutter speed, the image is integrated over the entire frame, tending to blur any motion in the image and making it difficult for the viewer to distinguish precisely such events as lip movement. This blurring was normal with tube based cameras which were continuously exposed to light.

With a fast shutter speeds of CCDs, the image is integrated over a relatively short time, for example 100As for a 1/10,000 second exposure. In television systems, the frame rate (assuming a frame rate CCD exposure) is equivalent to the sampling rate in sampling theory. The exposure time is equivalent to aperture time. The ratio of exposure time to frame rate is the aperture ratio. It is known from sampling theory that the aperture ratio effect on frequency response, which in this case is the ability to accurately convey motion. For short exposures, the ability to convey motion to the viewer increases dramatically. The shorter exposure time gives brighter and less blurred moving edges which result in the viewer's improved ability to perceive motion. The CCD camera induced improved motion perception aggravates the corresponding increased image delay time, and makes any audio to image mismatch easier for the viewer to consciously or subconsciously detect.

Video processing delays

Video signals are often passed through a special effects generators, color correctors, noise reducers, frame synchronizers and a variety of other editing and image processing functions. As memory costs continue to decline, these devices increase in complexity, and many incorporate frame based processing functions which add delays which are switched in and out. Unlike the past where video delays slowly drifted due to differing sync generator phases, the video delay in many of today's systems take instant jumps of one or more frames, as editors and other operators select different processing modes. This situation is especially true of many current noise reduction and color correction products where extra frames of delay are added for each additional selected function. This instant change of delay length poses special challenges for the corresponding audio synchronizer which must keep up with these instant large changes in video delay.

Setting performance standards

Several standards committees have set standards or guidelines for audio to video synchronization errors. The Radiocommunication Study Groups of The International Telecommunication Union states (2):

"Given the operating practices employed in the United States and the requirement that a single picture and sound service may reach the consumer in different forms and via different paths, the list of preferred points should be as noted above and the tolerances required at each of the points should be the same (+1lfield, -2 fields) with the understanding that these tolerances are absolute, are not accumulative, and apply to the overall system".

The International Telecommunication Union in the Draft New Recommendation [DOC. 11/59] (3) reports that errors of and greater than +20 and -40 ms are detectable and errors of +40 and -160 ms are "subjectively annoying" (+ numbers indicate sound advanced with respect to video). The draft recommendation states:

A tighter tolerance on the range of values in the studio and production paths would be required to allow this (partitioning of tolerances]. The situation might look something line this:

+20 ms -40 ms Overall tolerance

+10 ms -30 ms Production/presentation

+10 ms -10 ms Distribution/transmission

+2 ms -2 ms Per codec

EIA/TIA-250-C standards call for a +25 to -40 ms specification end to end for transmission facilities. Given the inherent video delays in CCD cameras, very little additional delay can be tolerated in the rest of the system.

Measeuring the videp delay

Clearly, television facilities need to be designed with audio synchronization in mind. It is impractical to remove the offending video delays, so the only remaining solution is to ensure that the program audio receives the same delay as the associated video.

Part of the solution is to measure the video delay at each significant delaying device so that a corresponding audio delay can be inserted at that point. Several video synchronizer manufacturers have a digital delay output (DDO) which provide a current video delay value signal for use by a companion audio synchronizer. Additionally, video delay detectors are available for devices which do not provide DDO signals. The audio synchronizer receives the DDO signal and automatically delays the audio signal by a corresponding amount.

Delay detectors for video devices without DDOs operate by storing a given input video frame and comparing all output frames to the stored frame. By counting the number of frames which pass until the previously input frame is output, the video delay is obtained. These devices are easy to add to an existing system, requiring only that input and output video be looped through their inputs. They provide a DDO signal which may be utilized by a companion audio synchronizer to make appropriate corrections.

The second generation audio synchronizers

It should be noted that all currently viable solutions to the audio to video synchronization problem utilize adjustable audio delays at some point in the system to delay the audio to match the delayed video. The adjustable audio delay remains a key element in system designs, and second generation synchronizers are challenged with the problem of making adjustments to the delay length which are imperceptible to the viewer.

As video delay values take jumps of one or more frames, the audio delay is required to take on the new, greatly different delay value without disrupting the audio. old style audio delays often operated by dropping or repeating audio samples, and relied on slowly changing video delays to operate properly. The occasional sample manipulation usually went unnoticed by the home viewer. When faced with instant delay jumps of a frame or more, these old devices required several seconds or even minutes to attain new delay values, with the sample manipulation creating noticeable distortion the whole time. Consequently, the audio would be both out of sync and noticeably degraded for the duration of the time to make the change. In systems where large jumps in delay are frequently made, this is unacceptable performance.

In over to overcome the problems inherent with sample manipulation, and more importantly to preserve the integrity of AES/EBU digital audio, it is necessary to have 1:1 correspondence between input and output samples in the audio synchronizer.

The audio delay memory must store every audio sample which is taken by the A-D, or received on the digital input, and read every stored audio sample once and only once. In order to accomplish this task, the memory must have completely decoupled and asynchronous reading and writing, so that the reading rate can be faster or slower than the storing rate. By varying the reading rate with respect to the storing rate the delay time can be controlled, by causing the reading to catch up with the storing (to decrease delay) or to lag behind the storing (to increase the delay). In digital systems, this must be performed with the obviously inconsistent requirement of maintaining the clock rate at the correct frequency.

Varying the reading rate with respect to the storing rate creates an annoying pitch change artifact, and requires re-clocking audio to maintain the proper output clock rate for digital audio.

In theory, to make the pitch change resulting from the memory read rate change indistinguishable to the viewer, it is necessary to limit the differential rate between memory storing and reading to keep the associated audio pitch change very small. Unfortunately, if the differential rate between memory storing and reading is small, the amount of time required to change delay settings is correspondingly large.

It would be possible to modulate the relative reading rate in response to the audio signal content since larger ratios may be tolerated if no high frequency audio is present, or if there are periods of silence. Modulating the rate with the audio content does not provide a consistent significant improvement however, and frequently is of no advantage for any program material having a musical background.

In order to minimize perceptible pitch shifts during delay changes, to facilitate rapid large delay changes and to maintain proper clock frequencies for correction of AES/EBU digital audio, it is necessary that the audio delay incorporate a pitch correction circuit. With pitch correction, it is possible to make rapid delay changes and maintain proper output clock frequency with the pitch correction circuit removing corresponding audio pitch artifacts so they are unnoticed by the viewer.

One commercial product which incorporates pitch correction is the AD3100 manufactured by Pixel Instruments Corp. of Los Gatos, CA. This device has selectable analog and AES/EBU digital inputs and simultaneous analog and digital outputs. It receives a DDO signal from a video instrument and adjusts the reading rate of the internal memory to increase or decrease the delay while at the same time providing digital signal processing pitch correction to maintain both proper pitch and output sample rate. In this device, multiple frame delay changes can be made in a matter of milliseconds without introducing artifacts or losing proper synchronization.








(1) Dr. Byron Reeves & Dave Voelker, research report Effects of Audio-Video Asynchrony on viewer's Memory, Evaluation of Content and Detection Ability (1993)
(2) International Telecommunication Union Document 10OC/32-E, 11A/43-E, 11C/40E, CMTT-C/18-E 5 October 1993
(3) International Telecommunication Union Document 11A/47-E, 13 October 1993
(4) NAB Engineering Handbook, Television signal Transmission Standards (Washington, D.C.: National Association of Broadcasters), 621,