Audio to video synchronization errors, sources, measurement and correction
contact site map home


Prepared by:
J. Carl Cooper
Chris Smith
Mirko Vojnovic

Abstract

This paper provides an overview of audio to video synchronization errors and their correction. Some of the more common sources of errors are described; the problems which the errors create and solutions to their measurement and correction are outlined.

Audio to video synchronization errors in video systems are usually quite subtle, and are often caused by the buildup of timing errors from several locations, anyone of which will often go unnoticed.

Television system audio delays

In the typical television system the video and audio are carried on separate paths, leading to differential processing delays. The audio processing path is typically fairly straight- forward in respect to the separate delays it generates, often being composed only of the distance from the sound source to the microphone, a preamp and a couple of relatively simple (in respect to delays) mixing boards.

The microphone distance from the sound source may add a significant delay component to the audio, for example as in a sporting event or outdoor environment. For a sporting event such as a tennis match, the delay might be in the order of 50 milliseconds or 1.5 NTSC frames. By contrast, for typical microphone placement on an announcer's lapel, the sound delay in reaching the microphone might be around 1 millisecond.

Fortunately, in many instances the viewer does not normally perceive the delay resulting from microphone placement as a problem. When the viewer is present in the same environment which is being televised, he receives the same or worse delayed sound with respect to the viewing of the image which creates the sound. A spectator in a sporting event will consciously or subconsciously perceive the sound as being slightly delayed with respect to the vision; this being a natural occurrence which the television viewer will subconsciously relate to his personal experience. Unfortunately, the addition of even small amounts of additional delay will cause the normal delayed sound sensation to become an annoying over delayed sensation.

The rest of the television sound channel, depending on the complexity of the television system, will likely range in the order of one or two milliseconds per processing device. The total additional delay rarely goes over a few tens of milliseconds.

Vision delays

Figure 1 (below) shows a typical television system from production studio to home receiver.



The diagram indicates the various delays for each of the several typical components of the video and audio signal paths, with the cumulative delay to that point shown in parenthesis. It is important to recognize that for even a simple system such as the one shown that audio advances of up to 11 frames can occur. It is also important to notice that much of the mismatch actually occurs after the television signal leaves the production environment, and thus can not be seen by the production engineering staff.

Virtually all cameras sold and in use today utilize CCD sensors. The visual portion of the television program which originates in front of the camera lens arrives at the light sensor in only a few nanoseconds at even the most extreme distances which can be associated with any sound. At the CCD sensor, the image light is integrated for some period of time, depending on the exposure setting. As indicated in Figure 2 (below), for short exposures with shuttered CCDs, the time delay for converting the light energy into video signals will create an image delay of almost one frame.



For maximum (1 frame) exposures, the corresponding image delay will be averaged over almost one frame. The improved temporal resolution of CCD cameras at short exposures improves the viewer's ability to accurately perceive motion, especially motion related to audio.

In addition to the visual delay associated with the CCD integrating the image another delay occurs when the image is being shifted out of the CCD. Figure 3 (below) shows a simplified CCD sensor and shift register arrangement.


After vertical blanking, the charges which are accumulated in the sensor are transferred to the storage register where the charges they contain are clocked out line by line. Even though the entire image is taken in a "snapshot" in the CCD array, each pixel of the image receives a different delay due to the shifting mechanism. This shift out delay will be partially compensated by the subsequent scanning in the television receiver, if the receiver is of a scanning type. If the receiver is an LCD or other matrix type, the compensation will not necessarily be present.

Variable temporal resolution in the CCD

It would be worthwhile to mention the effect that variable shutter speeds has on temporally sampling the image. At maximum exposure, that is a 1 frame shutter speed, the image is integrated over the entire frame. This long exposure blurs the image motion making it difficult for the viewer to accurately perceive the motion. With a fast shutter speed, the image is integrated over a relatively short time, for example 100~s for a 1/10,000 second exposure. The short exposure makes it much easier for the viewer to perceive motion because the video system has a greater temporal resolution that is the least amount of blurring. The increased temporal resolution also improves the viewer's ability to perceive the motion related to audio and video timing. In addition, one may recall from the psycho perceptual nature of television images that the response time of the viewer's eye is reduced for brighter objects. In other words, the ability to perceive motion is increased for bright objects. This is the same perceptual phenomena which causes bright areas of television displays to flicker when the darker portions do not, simply because the viewer's eye's response time is quicker for brighter objects. Assuming proper exposure levels, the edges of less blurred objects appear instantaneously brighter because during the shorter exposure time the light from any point or, edge on an object, is spread over relatively fewer CCD sensor diodes. The brighter and less blurred moving edges result in the viewer's improved ability to perceive higher temporal resolution. The reduced image smear on the CCD and increased viewer temporal perception at short exposures aggravates the correspondingly increased image delay time. Compared to tube cameras, CCD cameras make any audio to image mismatch easier for the viewer to consciously or subconsciously detect.

Video processing delays

In many production environments the video may also pass through several other delaying devices such as frame synchronizers, color correctors, noise reducers and a variety of nonlinear editing and image processing functions. Memory costs have declined allowing these devices to increase in complexity, including by use of frame based processing functions which add delays. As shown in Figure 4 (below), video frame synchronizers are a common source of synchronization errors.



The correct audio and video sync is destroyed by the variable delay of the video synchronizer. Currently, 8 field frame synchronizers are in common use for high end applications. The video delays in a system may be continuously variable, such as shown for frame synchronizers, or may be switched in and out as the operator selects different modes of operation. The problem of variability of delay is especially true of many current noise reduction and color correction products where extra frames of delay are added for each additional selected function. The diagram of Figure 1 shows how the use of cascaded switchers and video synchronizers can bring total system delay errors to over 10 frames.

Home receivers

While there are a wide variety of home receivers available, the ones which are of largest concern are large screen types which often contain internal digital processing of the video. Due to the introduction of extremely low cost A-D converters many receiver manufacturers have found that it is possible to incorporate a considerable amount of field and frame based digital signal processing into their designs. These circuits can introduce yet another frame of image delay. At this time there are very few consumer TVs which provide any sort of compensating audio delay.

As the cost of digital image processing circuitry continues to decrease, more and more frame based circuits will be found in consumer TVs. If the past is any indication, many products which are now found exclusively in production houses and TV stations will begin to appear as features of high end consumer products. All of these additional features may very well add additional frames of delay to the video signal before the viewer sees it.

Viewer preception problems


The most obvious result of audio to video mismatch is visible "lip sync" errors. The frequency of these visible errors used to be relatively small but seems to be increasing significantly, and the need for correction is apparent. Additionally, when there is a large but only slightly discernible error, viewers are often not precisely aware that the problem exists.

Unfortunately, just because the viewer does not identify the problem, does not mean he is not affected by the problem. When the audio is advanced with respect to the video, the mistiming will cause a subconscious degradation of the program's entertainment quality as perceived by the viewer. The cause of this effect is believed to be the unnatural sound relationship which the television program presents. In our natural environment we are used to hearing audio slightly delayed with respect to video due to the slower speed of propagation of sound waves as compared to light. We are used to hearing the sound of a hammer after we see it hit, hearing a racquet striking after we see the ball hit and hearing a commercial actor after we see them talking. In television systems it is most often the video which is delayed, thus causing the sound to arrive at the viewer's ears before the visual sensation to which it corresponds.

Viewing a television program with advanced audio is very unnatural for the viewer, and therefore believed to cause subconscious stress to that viewer. It has been demonstrated in psychological tests at Stanford University (1) that viewers who watch television commercials having an audio advance "evaluate people on television more negatively (e.g. less interesting, more unpleasant, less influential, more agitated, less successful)" than the same commercials which were played with the audio in sync with the video. It was also discovered that this effect takes place with relatively small audio advances of 2.5 fields, where the existence of an audio problem could only be detected by very few average viewers.

It was also found in the Stanford tests that even when specifically asked, several of the test subjects which were negatively affected were completely unable to detect the delayed video. These test subjects were completely unaware that there were any audio-to-video synchronization problems at all, much less that their enjoyment of the program was being affected.

Watching a program where we hear action before we see it, or hearing commercial actors before we see them talking can be subconsciously annoying or stressful. Audio sync errors can cause the viewer to perceive a program or commercial less favorably than if the timing were correct, thus preventing the viewer from receiving the entertainment and messages intended by the advertisers. At current advertising rates, this becomes a potentially serious financial problem when the advertisers realize that they are not getting what they pay for due to audio sync errors.

Setting performance standards

Several standards committees in various countries have addressed the problem and have set standards or guidelines for audio to video synchronization errors. For example, the Radio-communication Study Groups of The International Telecommunication Union states (2):

"Given the operating practices employed in the United States and the requirement that a single picture and sound service may reach the consumer in different forms and via different paths, the list of preferred points should be as noted above and the tolerances required at each of the points should be the same (+1 field, -2 fields) with the understanding that these tolerances are absolute, are not accumulative, and apply to the overall system".

The International Telecommunication Union in the Draft New Recommendation [DOC. 11/59] (3) reports that in tests in Australia that errors of and greater than +20 and -40 ms were "detectable" and errors of +40 and -160 ms were "subjectively annoying" (+ numbers indicate sound advanced with respect to video). While not deciding on a recommended specification, the draft recommendation did state:

A tighter tolerance on the range of values in the 10 studio and production paths would be required to allow this [partitioning of tolerances]. The situation might look something line this:
+20 ms … -40 ms Overall tolerance
+10 ms … -30 ms Production/presentation
+10 ms … -10 ms Distribution/transmission
+2 ms ... -2 ms per codec

The ITU draft also reports that ABC, NBC and CBS have specified relative timing limits of +16 to -33 ms. The EIA/TIA-250-C standard (4) calls for a +25 to -40 ms specification end to end for transmission facilities.

What is to be done

Clearly, television facilities need to be designed with audio to video synchronization problem in mind. Since it is impractical to remove all of the offending video delay mechanisms, the only remaining solution is to ensure that the program audio signal receives the same delay as the associated video signal.

Since the delay path of video signals is constantly changing as shown in Figure 1, it is unreasonable to expect that a single compensating audio delay located at the transmitter or elsewhere will solve the problem. The most practical solution to the problem is to measure the video delay at every delaying device and correct the corresponding audio at that point with an audio synchronizer (variable audio delay) such as shown in Figure 5 (below).



Recently designed video synchronizers recognize this problem and have a delay control signal, known as a digital delay output (DDO) which provides a current delay value signal for use by a companion audio synchronizer. The audio synchronizer receives the DDO signal and automatically delays the audio signal by a corresponding amount thus keeping audio and video synchronized even though the total delay is changing.

Video delay measurement

Unfortunately, many other video delay devices such as noise reducers, color correctors, etc. do not have a DDO and until recently there has been no practical way to couple a companion audio delay to these devices.

Measurement of video delays is a fairly complex problem if you want to build a delay detector which can operate with a wide range of video equipment. By measuring the relative delay of input vertical sync and output vertical sync, it is possible to determine delays of up to one frame period. Many video devices, however, can have delays which range from zero to exactly one frame, or multiple frames. It is therefore difficult to determine if a delay is one frame, two frames or no frames when the input and output sync are near coincidence. Similarly, it is difficult to determine if the delay is ¼, 1¼, 2¼, etc. or to distinguish between any multiple frame delay ambiguities.

It would be possible to resolve multiple frame ambiguities by inserting a special code or signal in the video, for example: in the vertical blanking interval. By checking video to determine when the code goes in and later comes out of the video device, the delay can be measured. Unfortunately, this system will not operate with those many older devices which store only active video. For these older devices, the special code or signal would have to be inserted in active video which is generally unacceptable.

One manufacturer sells a device which can measure the relative audio to video delay by gating a test tone into the audio channel at the same time a specific video pattern is gated into the video path. At the end of the transmission path the relative delay of the two is measured. This system however cannot be used when the video or audio path is active, and since many video paths are constantly changing there is no guarantee that any measured value will remain correct after a few minutes, let alone throughout an entire program.

PIXEL INSTRUMENTS’ approach to video delay measurements


LipTracker™ is highly sophisticated computer application program that utilizes cutting edge technology in machine vision and machine hearing. It is a non-invasive measurement tool for lip sync analysis which operates in the same way as a human observer by listening to the audio and looking at the video to measure the lip sync error. LipTracker™ recognizes and compares selected sounds in the audio stream with the mouth shapes that create them in the video stream. Statistical analysis of these sounds and mouth shapes (called Mutual Events or MuEvs) produces a direct measurement of the lip sync error. This unique approach of analyzing real time video and audio does not require the insertion of cues, watermarks or codes into the program material. It connects to a system only as a node, and it is not within the main audio-video path. Therefore, LipTracker™ can be used for in-service testing at any point in the transmission path.

DD2100 – This delay detector takes the unique approach of comparing frames of active video which are input to the device with frames of video which come out of the device. Key samples of input and output active video frames are taken and correlated in a high speed 20 bit DSP circuit. By correlating input and output frame samples, the relative delay is measured.

Figure 6 (below) shows a conceptual block diagram of the Pixel Instruments DD2100 delay detector.



A given input frame is stored in memory and from that time onward all output frames are compared to the stored frame until an output frame which matches the stored frame is found. By counting the time which passes from the storing of the input frame until the matching output frame is detected, a very accurate delay measurement is obtained. In addition, the phase of input and output vertical sync is used to determine the fine delay. By adding the fine and coarse delays, a very accurate overall delay measurement is achieved to an accuracy of 102 µs.

As shown in Figure 7 (below), the DD2100 delay detector is very easy to add to an existing system, requiring only that input and output video be looped through the inputs.



A front panel LCD display provides delay and update information as well as alarm messages in the event of problems. No modification to the video signal or the video device is required. The DD2100 detects delays up to 8.99 fields with 525 or 625 line signals and provides a DDO signal which is fed to the companion AD2100 or AD3100 audio synchronizer to make the appropriate audio corrections.

The variable audio delay

All of the currently available solutions to the audio to video synchronization problem use adjustable audio delays at some point in the system to delay the audio to match the delayed video. While the methods of determining the needed delay vary, the adjustable audio delay remains a key element. Several problems arise with these audio delays, centering mainly around the problem of making adjustments to the delay length which are imperceptible to the viewer.

It should be remembered that all video synchronizers operate in a fashion which sooner or later either repeats or deletes a field or frame of video. This is known as a pointer crossing, and occurs when the video memory either fills or empties. At the pointer crossing, the video delay value takes a memory sized jump (usually a frame or two) which requires the audio delay to take on this new, greatly different value. Unfortunately, there are no audio frames which can be repeated or deleted, since the audio signal is continuous. Therefore, a continuously variable delay change is needed. Past variable audio delays operated to adjust the delay by jumping memory addresses. The corresponding pops and clicks were then masked either by limiting the jumps to periods of relative silence (5) or by using two delay lines, jumping the length of the one not currently supplying audio, and fading from the delay line in use to the newly updated delay line (6). These variable delay methods still generated a considerable number of undesirable artifacts for normal program audio.

The memory technology needed to obtain a continuously variable and a continuous audio signal is not trivial. As previously described, it is not possible to change the delay by simply jumping to a new memory read or write address as is done in the video synchronizer. To do so would cause an audio sample to be lost or repeated and a corresponding pop or click is created in the audio. These unwanted artifacts are impossible to hide.

The audio delay technology which is most successful in obtaining quality performance uses a memory which stores every audio sample which is taken by the A-D and reads every audio sample which is stored once and only once. In order to accomplish this task, the memory must have completely decoupled and asynchronous reading and writing functions, so that the reading rate can be faster or slower than the storing rate. By varying the reading rate with respect to the storing rate the delay time can be controlled, by causing the reading to catch up with the storing (to decrease delay) or to lag behind the storing (to increase the delay). This method works very well for making slowly changing delay adjustments.

Unfortunately if the reading is consistently faster or slower than the storing, a pitch change will occur in the audio, similar to when an audio tape is played off speed. To minimize the pitch change and make it unnoticeable to the viewer, it is necessary to limit the differential rate between memory writing and reading to keep the associated audio pitch change very small. Typically, limiting these changes to around .5% limits any pitch changes to amounts which the majority of viewers do not notice.

Unfortunately, with a small differential, the amount of time required to change large delay settings is correspondingly large, and the audio can be out of sync for several seconds, or even minutes after a pointer crossing.

It is possible to improve the rate of delay adjustment by changing the differential rate in response to the audio signal content, since larger ratios may be tolerated if no high frequency audio is present, or if there are periods of silence. Modulating the rate with the audio content does not provide a consistent significant improvement, and frequently is of no advantage if the program material has a musical background.

In order to both minimize perceptible pitch shifts during normal small delay changes, and to allow rapid delay change after pointer crossings, it is necessary that the audio synchronizer incorporate a pitch correction circuit. With pitch correction capability, it is possible to make rapid delay changes with the pitch correction circuit removing corresponding audio pitch artifacts to a level where they go unnoticed by the viewer. One audio synchronizer that incorporates pitch correction is the Pixel Instruments AD3100.
The Pixel Instruments AD3100 provides a high performance pitch shifting capability in addition to the audio synchronizing functions. The AD3100 can provide make a one minute delay change in 10 seconds with no perceptible pitch artifacts being generated for normal program audio with background music. In typical applications, a 4 frame pointer crossing change can be achieved in under ½ second.

In addition to achieving fast delay corrections for use in conjunction with pointer crossings of video synchronizers, the AD3100 can also be used on line to compensate for instant delay changes in video production systems. For example the AD3100 can be tied to a production switcher to automatically correct for the change in video delay when various video effects are switched in and out of the path. It can be used to compensate for color corrector and noise reducer delays, or for the constantly changing delay of MPEG compression and decompression channels. It can also be tied to the tally system to compensate for different camera to microphone distances, thus maintaining the psychological effect of receiving delayed sound at longer camera distances from the source.

Summary

Audio to video synchronization has been proven to affect audience perception of the quality of programming and thus is a critical performance parameter for television facilities. System complexities make it impractical to totally prevent such problems and create a considerable number of associated technical challenges. The difficult problems of measuring video delays and correcting audio signals to keep proper synchronization are made easier by the use of video delay detectors and pitch correcting audio synchronizers.










(1) Dr. Bryon Reeves & Dave Voelker, research report Effects of Audio-Video Asynchrony on viewer's Memory, Evaluation of Content and Detection Ability (1993)
(2) International Telecommunication Union Document 10C/32-E, 11A/43-E, 11C/40E, CMTT-C/18-E 5 October 1993
(3) International Telecommunication Union Document 11A/47-E, 13 October 1993
(4) NAB Engineering Handbook, Television signal Transmission Standards (Washington, D.C.: National Association of Broadcasters), 621,
(5) U.S. Patent 4,618,890
(6) U.S. Patent 4,644,400