The objective of audio-visual separation is to separate different sounds from the corresponding objects, while audio-visual localization mainly focuses on localizing a sound in a visual context. As shown in Fig. 2, we classify types of this task by different identities: speakers (Fig. 2(a)) and objects (Fig. 2(b)). The former concentrates on a person′s speech that can be used for television programs to enhance the target speakers′ voice, while the latter is a more general and challenging task that separates arbitrary objects rather than speakers only. In this section, we provide an overview of these two tasks, examining the motivations, network architectures, advantages, and disadvantages as shown in Tables 1 and 2.
Figure 2. Illustration of audio-visual separation and localization task. Paths 1 and 2 denote separation and localization tasks, respectively.
Category Method Ideas & strengths Weaknesses Speaker separation Gabbay et al. Predict speaker′s voice based on faces in video used as a filter Can only be used in controlled environments Afouras et al. Generate a soft mask for filtering in the wild − Lu et al. Distinguish the correspondence between speech and human speech lip movements Two speakers only; Hardly applied for background noise Ephrat et al. Predict a complex spectrogram mask for each speaker; Trained once, applicable to any speaker The model is too complicated and lacks explanation Gu et al. All information of speakers; Robustness Complex network; Plenty of preparation Zhu and Rahtu Strong capacity of sub-network; Single image Small scope of application Morrone et al. Use landmarks to generate time-frequency masks Additional landmark detection required Separate and localize objects′ sounds Gao et al. Disentangle audio frequencies related to visual objects Separated audio only Senocak et al. Focus on the primary area by using attention Localized sound source only Tian et al. Joint modeling of auditory and visual modalities Localized sound source only Pu et al. Use low rank to extract the sparsely correlated components Not for the in-the-wild environment Zhao et al. Mix and separate a given audio; Without traditional supervision Motion information is not considered Zhao et al. Introduce motion trajectory and curriculum learning Only suitable for synchronized video and audio input Sharma et al. State-of-the-art for detection unconstrained videos entertainment media Additional audio visual detection localize sound source only Sun et al. 3D space; Low computational complexity − Rouditchenko et al. Separation and localization use only one modality input Does not fully utilize temporal information Parekh et al. Weakly supervised learning via multiple-instance learning Only a bounding box proposed on the image
Table 1. Summary of recent audio-visual separation and localization approaches
Category Method Dataset Result Speaker separation Gabbay et al. GRID and TCD TIMIT SAR: 9.49 (on GRID) Afouras et al. LRS2 and VoxCeleb2 − Lu et al. WSJ0 and GRID SAR: 10.11 (on GRID) Morrone et al. GRID and TCD TIMIT PESQ: 2.45 (on TCD TIMIT) Separate and localize
Gao et al. AudioSet SDR: 2.53 Senocak et al. Base on filckr-SoundNet − Tian et al. Subset of AudioSet Prediction accuracy: 0.727 Sharma et al. Movies Recall: 0.5129
Table 2. A quantitative study on audio-visual separation and localization
The speaker separation task is a challenging task and is also known as the cocktail party problem. It aims to isolate a single speech signal in a noisy scene. Some studies tried to solve the problem of audio separation with only the audio modality and achieved exciting results[18, 19]. Advanced approaches[9, 11] tried to utilize visual information to aid the speaker separation task and significantly surpassed single modality-based methods. The early attempts leveraged mutual information to learn the joint distribution between the audio and the video[20, 21]. Subsequently, several methods focused on analyzing videos containing salient motion signals and the corresponding audio events (e.g., a mouth starting to move or a hand on piano suddenly accelerating)[22, 23].
Gabbay et al. proposed isolating the voice of a specific speaker and eliminating other sounds in an audio-visual manner. Instead of directly extracting the target speaker′s voice from the noisy sound, which may bias the training model, the researchers first fed the video frames into a video-to-speech model and then predicted the speaker′s voice by the facial movements captured in the video. Afterwards, the predicted voice was used to filter the mixtures of sounds, as shown in Fig. 3. Although Gabbay et al. improved the quality of separated voice by adding the visual modality, their approach was only applicable in controlled environments.
In contrast to previous approaches that require training a separate model for each speaker of interest (speaker-dependent models), recent researches focus on obtaining intelligible speech in an unconstrained environment. Afouras et al. proposed a deep audio-visual speech enhancement network to separate the speaker′s voice of the given lip region by predicting both the magnitude and phase of the target signal. They treated the spectrograms as temporal signals rather than images for a network. Additionally, instead of directly predicting clean signal magnitudes, they also tried to generate a more effective soft mask for filtering. Ephrat et al. proposed a speaker-independent model that was only trained once and was then applicable to any speaker. This approach even outperformed the state-of-the-art speaker-dependent audio-visual speech separation methods. The relevant model consists of multiple visual streams and one audio stream, concatenating the features from different streams into a joint audio-visual representation. This feature is further processed by a bidirectional long short-term memory (LSTM) and three fully connected layers. Finally, an elaborate spectrogram mask is learned for each speaker to be multiplied by the noisy input. Finally, the researchers converted it back to waveforms to obtain an isolated speech signal for each speaker. Lu et al. designed a network similar to that of . The difference is that Lu et al. enforced an audio-visual matching network to distinguish the correspondence between speech and human lip movements. Therefore, they could obtain clear speech.
Instead of directly utilizing video as a condition, Morrone et al. further introduced landmarks as a fine-grained feature to generate time-frequency masks to filter mixed-speech spectrogram.
Instead of matching a specific lip movement from a noisy environment, as in the speaker separation task, humans focus more on objects while dealing with sound separation and localization. It is difficult to find a clear correspondence between audio and visual modalities due to the challenge of exploring the prior sounds from different objects.
The early attempt to solve this localization problem can be traced back to 2000 and a study that synchronized low-level features of sounds and videos. Fisher et al. later proposed using a nonparametric approach to learn a joint distribution of visual and audio signals and then project both of them to a learned subspace. Furthermore, several acoustic-based methods[28, 29] were described that required specific devices for surveillance and instrument engineering, such as microphone arrays used to capture the differences in the arrival of sounds.
To learn audio source separation from large-scale in-the-wild videos containing multiple audio sources per video, Gao et al. suggested learning an audio-visual localization model from unlabeled videos and then exploiting the visual context for audio source separation. Researchers′ approach relied on a multi-instance multilabel learning framework to disentangle the audio frequencies related to individual visual objects even without observing or hearing them in isolation. The multilabel learning framework was fed by a bag of audio basis vectors for each video, and then, the bag-level prediction of the objects presented in the audio was obtained.
The information of the target speaker is helpful for sound separation tasks such as lip movement, tone, and spatial location etc. Therefore, Gu et al. exploited all this information by obtaining semantic information of each modal via a fusion method based on factorized attention. Similarly, Zhu and Rahtu attempted to add extra information (appearance) by introducing an appearance attention module to separate the different semantic representations.
Instead of only separating audio, can machines localize the sound source merely by observing sound and visual scene pairs as a human can? There is evidence both in physiology and psychology that sound localization of acoustic signals is strongly influenced by the synchronicity of their visual signals. The past efforts in this domain were limited to requiring specific devices or additional features. Izadinia et al. proposed utilizing the velocity and acceleration of moving objects as visual features to assign sounds to them. Zunino et al. presented a new hybrid device for sound and optical imaging that was primarily suitable for automatic monitoring.
As the number of unlabeled videos on the Internet has been increasing dramatically, recent methods mainly focus on unsupervised learning. Additionally, modeling audio and visual modalities simultaneously tend to outperform independent modeling. Senocak et al. learned to localize sound sources by merely watching and listening to videos. The relevant model mainly consisted of three networks, namely, sound and visual networks and an attention network trained via the distance ratio unsupervised loss.
Attention mechanisms cause the model to focus on the primary area. They provide prior knowledge in a semi-supervised setting. As a result, the network can be converted into a unified one that can learn better from data without additional annotations. To enable cross-modality localization, Tian et al. proposed capturing the semantics of sound-emitting objects via the learned attention and leveraging temporal alignment to discover the correlations between the two modalities.
To make full use of media content in multiple modalities, Sharma et al. proposed a novel network consisting of 3D convolution neural networks (3D CNNs) and bidirectional long short term memory (BiLSTMs) to fuse the complementary information of two modalities in a weaker-than-full supervision fashion.
Sound source separation and localization can be strongly associated with each other by assigning one modality′s information to another. Therefore, several researchers attempted to perform localization and separation simultaneously. Pu et al. used a low-rank and sparse framework to model the background. The researchers extracted components with sparse correlations between the audio and visual modalities. However, the scenario of this method had a major limitation: It could only be applied to videos with a few sound-generating objects. Therefore, Zhao et al. introduced a system called PixelPlayer that used a two-stream network and presented a mix-and-separate framework to train the entire network. In this framework, audio signals from two different videos were added to produce a mixed signal as input. The input was then fed into the network that was trained to separate the audio source signals based on the corresponding video frames. The two separated sound signals were treated as outputs. The system thus learned to separate individual sources without traditional supervision.
Instead of merely relying on image semantics while ignoring the temporal motion information in the video, Zhao et al. subsequently proposed an end-to-end network called deep dense trajectory to learn the motion information for audio-visual sound separation. Furthermore, due to the lack of training samples, directly separating sound for a single class of instruments tend to lead to overfitting. Therefore, Zhao et al. further proposed a curriculum strategy, starting by separating sounds from different instruments and proceeding to sounds from the same instrument. This gradual approach provided a good start for the network to converge better on the separation and localization tasks.
The methods of previous studies[23, 39, 40] could only be applied to videos with synchronized audio. Hence, Rouditchenko et al. tried to perform localization and separation tasks using only video frames or sound by disentangling concepts learned by neural networks. The researchers proposed an approach to produce sparse activations that could correspond to semantic categories in the input using the sigmoid activation function during the training stage and softmax activation during the fine-tuning stage. Afterwards, the researchers assigned these semantic categories to intermediate network feature channels using labels available in the training dataset. In other words, given a video frame or a sound, the approach used the category-to-feature-channel correspondence to select a specific type of source or object for separation or localization. Aiming to introduce weak labels to improve performance, Parekh et al. designed an approach based on multiple-instance learning, a well-known strategy for weakly supervised learning.
Inspired by the human auditory system, which accepts information selectively, Sun et al. proposed a metamaterial-based single-microphone listening system (MSLS) to localize and separate the fixed sound signal in 3D space. The core part of the system is metamaterial enclosure which consisted of multiple second-order acoustic filters to decide the frequency response of different directions.
Speaker voice separation has achieved great progress in various specific fields in the past decades, especially in audio-only modality. Introducing visual modality has increased both the performance and applications scenarios. Due to the explicit pattern between the voice and video, for example, the lip movement of the target speaker is highly related to the voice, recent efforts tend to leverage this pattern in sound separation tasks. However, it hard to capture an explicit pattern between audio and visual in the more general tasks, such as object′s sound separation and localization. Therefore, researchers introduced effective strategies (such as sparse correlation, temporal motion information, multiple-instance learning, etc.) and more powerful networks for this task.
In this section, we introduce several studies that explored the global semantic relation between audio and visual modalities. We name this branch of research audio-visual correspondence learning; it consists of 1) the audio-visual matching task and 2) the audio-visual speech recognition task. We summarize the advantages and disadvantages in Tables 3 and 4.
Category Method Ideas & strengths Weaknesses Voice-face matching Nagrani et al. The method is novel and incorporates dynamic information As the sample size increases, the accuracy decreases excessively Wen et al. The correlation between modes is utilized Dataset acquisition is difficult Wang et al. Can deal with multiple samples;
Can change the size of input
Static image only;
Hoover et al. Easy to implement;
Robust and Efficient
Cannot handle large-scale data Zheng et al. Adversarial learning and metric learning is leveraged to explore the better feature representation No high level semantic information is taken into account Audio-visual retrieval Hong et al. Preserve modality- specific characteristics; Soft intra-modality structure loss Complex network Sanguineti et al. Acoustic images contain more information; Simple and efficient;
Three branches complex network;
Lack of details in some places
Takashima et al. Using CCA instead of distance Unclear details Surís et al. Metric learning;
Using fewer parameters;
Static images Zeng et al. Consider mismatching pairs;
Exploit negative examples
Complex network Chen et al. Deal with remote sensing data;
Low memory and fast retrieval properties
Lack of remote sensing data Arsha et al. Curriculum learning;
Low data cost
Low accuracy for multiple samples Audio-visual speech recognition Petridis et al. Simultaneously obtain feature and classification Lack of audio information Wand et al. LSTM;
Word-level Chung et al. Audio and visual information;
The dataset is not guaranteed to be clean Zhang et al. Novel FBP; State-of-the-art The experimental part is too simple Shillingford et al. Sentence-level;
No audio information Zhou et al. Anti-noise;
Simple and effective
Insufficient innovation Tao et al. Novel idea;
Insufficient contributions Makino et al. Large audio-visual dataset Lead to low practical value Trigeorgis et al. Audio information;
The algorithm is robust;
Noise is not considered Afouras et al. Study noise in audio;
Table 3. Summary of audio-visual correspondence learning
Category Method Dataset Result Voice-face matching Nagrani et al. VGGFace and Voxceleb V-F: 0.81 Wen et al. VGGFace and Voxceleb V-F: 0.84 Hoover et al. LibriVox Accuracy: 0.71 Audio-visual retrieval Surís et al. Subset of YouTube-8M Audio-video recall: 0.631 Nagrani et al. Voxceleb − Audio-visual speech recognition Chung et al. LRS − Trigeorgis et al. RECOLA MSE: 0.684 Afouras et al. LRS2-BBC −
Table 4. A quantitative study on correspondence learning
Biometric authentication, ranging from facial recognition to fingerprint and iris authentication is a popular topic that has been researched over many years, while evidence shows that this system can be attacked maliciously. To detect such attacks, recent studies particularly focus on speech antispoofing measures.
Sriskandaraja et al. proposed a network based on a Siamese architecture to evaluate the similarities between pairs of speech samples. Białobrzeski et al. presented a two-stream network, where the first network was a Bayesian neural network assumed to be overfitting, and the second network was a CNN used to improve generalization. Gomez-Alanis et al. further incorporated LightCNN and a gated recurrent unit (GRU) as a robust feature extractor to represent speech signals in utterance-level analysis to improve performance.
We note that cross-modality matching is a special form of such authentication that has recently been extensively studied. It attempts to learn the similarity between pairs. We divide this matching task into fine-grained voice-face matching and coarse-grained audio-image retrieval.
Given facial images of different identities and the corresponding audio sequences, voice-face matching aims to identify the face that the audio belongs to (the V2F task) or vice versa (the F2V task), as shown in Fig. 4. The key point is finding the embedding between audio and visual modalities. Nagrani et al. proposed using three networks to address the audio-visual matching problem: a static network, a dynamic network, and an N-way network. The static network and the dynamic network could only handle the problem with a specific number of images and audio tracks. The difference was that the dynamic network added to each image temporal information such as the optical flow or a 3D convolution[50, 51]. Based on the static network, Nagrani et al. increased the number of samples to form an N-way network that was able to solve the N∶1 identification problem.
Figure 4. Demonstration of audio-to-image retrieval (The blue arrows) and image-to-audio retrieval (The green arrows).
However, the correlation between the two modalities was not fully utilized in the above method. Therefore, Wen et al. proposed a disjoint mapping network (DIMNets) to fully use the covariates (e.g., gender and nationality)[53, 54] to bridge the relation between voice and face information. The intuitive assumption was that for a given voice and face pair, the more covariates were shared between the two modalities, the higher the probability of being a match. The main drawback of this framework was that a large number of covariates led to high data costs. Therefore, Hoover et al. suggested a low-cost but robust approach of detection and clustering on audio clips and facial images. For the audio stream, the researchers applied a neural network model to detect speech for clustering and subsequently assigned a frame cluster to the given audio cluster according to the majority principle. Doing so required a small amount of data for pretraining.
To further enhance the robustness of the network, Chung et al. proposed an improved two-stream training method that increased the number of negative samples to improve the error-tolerance rate of the network. The cross-modality matching task, which is essentially a classification task, allows for wide-ranging applications of the triplet loss. However, it is fragile in the case of multiple samples. To overcome this defect, Wang et al. proposed a novel loss function to expand the triplet loss for multiple samples and a new elastic network (called EmNet) based on a two-stream architecture that can tolerate a variable number of inputs to increase the flexibility of the network. Most recently, Zheng et al. proposed a novel adversarial-metric learning model that generates a modality-independent representation for each individual in each modality by adversarial learning while learning a robust similarity measure for cross-modality matching by metric learning.
The cross-modality retrieval task aims to discover the relationship between different modalities. Given one sample in the source modality, the proposed model can retrieve the corresponding sample with the same identity in the target modality. For audio-image retrieval as an example, the aim is to return a relevant piano sound, given a picture of a girl playing the piano. Compared with the previously considered voice and face matching, this task is more coarse-grained.
Unlike other retrieval tasks such as the text-image task[59-61] or the sound-text task, the audio-visual retrieval task mainly focuses on subspace learning. Surís et al. proposed a new joint embedding model that mapped two modalities into a joint embedding space and then directly calculated the Euclidean distance between them. Surís et al. also leveraged cosine similarity to ensure that the two modalities in the same space were as close as possible while not overlapping. Note that the designed architecture would have a large number of parameters due to the existence of a large number of fully connected layers.
Hong et al. proposed a joint embedding model that relied on pre-trained networks and used CNNs to replace fully connected layers to reduce the number of parameters to some extent. The video and music were fed to the pre-trained network and then aggregated, followed by a two-stream network trained via the inter-modal ranking loss. In addition, to preserve modality-specific characteristics, the researchers proposed a novel soft intra-modal structure loss. However, the resulting network was very complex and difficult to apply in practice. To solve this problem, Nagrani et al. proposed a cross-modality self-supervised method to learn the embedding of audio and visual information from a video and significantly reduced the complexity of the network. For sample selection, Nagrani et al. designed a novel curriculum learning schedule to further improve performance. In addition, the resulting joint embedding could be efficiently and effectively applied in practical applications.
Different from the above works only considering the matching pairs, Zeng et al. further focused the mismatching pairs and proposed a novel deep triplet neural network with cluster-based canonical correlation analysis in a two-stream architecture. Rather than designing a model base on a two-stream structure, Sanguineti et al. introduced an extra model named acoustic images, which contained abundant information. They aligned three modalities in time and space and took advantage of such correlation to learn more powerful audio-visual representations via knowledge distillation. Different from the approaches focused on face and audio, Chen et al. proposed a deep image-voice retrieval (DIVR) to deal with remote sensing images. During the training process, they followed the idea of triplet loss. Moreover, they minimize the distance between hash-like codes and hash codes to reduce quantization error.
Music-emotion retrieval is an interesting topic in audio-image retrieval task. Takashima et al. proposed a deep canonical correlation analysis (DeepCCA) by maximizing the correlation between two modalities in projection space via CCA rather than distance computing.
The recognition of the content of a given speech clip (for example, predicting the emotion based on the given speech) has been studied for many years, yet despite great achievements, researchers are still aiming for satisfactory performance in challenging scenarios. Due to the correlation between audio and vision, combining these two modalities tends to offer more prior information. For example, one can predict the scene where the conversation took place, which provides a strong prior for speech recognition, as shown in Fig. 5.
Earlier efforts on audio-visual fusion models usually consisted of two steps: 1) extracting features from the image and audio signals and 2) combining the features for joint classification[71-73]. Later, taking advantage of deep learning, feature extraction was replaced with a neural network encoder[74-76]. Several recent studies have shown a tendency to use an end-to-end approach to visual speech recognition. These studies can be mainly divided into two groups. They either leverage the fully connected layers and LSTM to extract features and model the temporal information[77, 78] or use a 3D convolutional layer followed by a combination of CNNs and LSTMs[79, 80]. However, LSTM is normally extensive. To this end, Makino et al. proposed a large-scale system on the basis of a recurrent neural network transducer (RNN-T) architecture to evaluate the performance of RNN model.
Instead of a two-step strategy, Petridis et al. introduced a new audio-visual model that is simultaneously extracting features directly from pixels and classifying speech, followed by a bidirectional LSTM module to fuse audio and visual information. To this end, Wand et al. presented a word-level lip-reading system using LSTM. However, this work only conduct experiments ona lab-controlled dataset. In contrast to previous methods, Assael et.al proposed an end-to-end LipNet model based on sentence-level sequence prediction, which consisted of spatial-temporal convolutions, a recurrent network, and a model trained via the connectionist temporal classification (CTC) loss. Experiments showed that lip-reading outperformed the two-step strategy.
However, the limited information in the visual modality may lead to a performance bottleneck. To combine both audio and visual information for various scenes, especially in noisy conditions, Trigeorgis et al. introduced an end-to-end model to obtain a context-aware feature from the raw temporal representation.
Chung et al. presented a Watch, Listen, Attend, and Spell (WLAS) network to explain the influence of audio on the recognition task. The model took advantage of the dual attention mechanism and could operate on a single or combined modality. To speed up the training and avoid overfitting, the researchers also used a curriculum learning strategy. To analyze an in-the-wild dataset, Nussbaum-Thom et al. proposed another model based on residual networks and a bidirectional GRU. However, they did not take the ubiquitous noise in the audio into account. To solve this problem, Afouras et al. proposed a model for performing speech recognition tasks. The researchers compared two common sequence prediction types: connectionist temporal classification and sequence-to-sequence (seq2seq) methods in their models. In the experiment, they observed that the model using seq2seq could perform better according to word error rate (WER) when it was only provided with silent videos. For pure-audio or audio-visual tasks, the two methods behaved similarly. In a noisy environment, the performance of the seq2seq model was worse than that of the corresponding CTC model, suggesting that the CTC model could better handle background noises.
Recent works introduced attention mechanisms to highlight some significant information contained in audio or visual representations. Zhang et al. proposed a factorized bilinear pooling to learn the feature of respective modalities via an embedded attention mechanism, and then to integrate complex association between audio and video for the audio-video emotion recognition task. Zhou et al. focused on the feature of respective modalities by multimodal attention mechanism to exploit the importance of both modalities to obtain a fused representation. Compared with the previous works which focused on the feature of each modal, Tao et al. paid more attention to the network and proposed a cross-modal discriminative network called VFNet to establish the relationship between audio and face by cosine loss.
The representation learning between modalities is crucial in audio-visual correspondence learning. One can add more supplementary information (e.g., mutual information, temporal information) or adjust the structure of the network such as the use of RNN and LSTM, increasing the modal structure or input pretreatment, etc., to obtain better representation.
The previously introduced retrieval task shows that the trained model is able to find the most similar audio or visual counterpart. While humans can imagine the scenes corresponding to sounds and vice versa, researchers have tried to endow machines with this kind of imagination for many years. Following the invention and advances of generative adversarial networks (GANs, a generative model based on adversarial strategy), image or video generation has emerged as a topic. It involves several subtasks, including generating images or video from a potential space, cross-modality generation[92, 93], etc. These applications are also relevant to other tasks, e.g., domain adaptation[94, 95]. Due to the difference between audio and visual modalities, the potential correlation between them is nonetheless difficult for machines to discover. Generating sound from a visual signal or vice versa, therefore, becomes a challenging task.
In this section, we will mainly review the recent development of audio and visual generation, i.e., generating audio from visual signals or vice versa. Visual signals here mainly refer to images, motion dynamics, and videos. Section 4.1 mainly focuses on recovering the speech from the video of the lip area (Fig. 6(a)) or generating sounds that may occur in the given scenes (Fig. 6(a)). In contrast, Section 4.2 will examine generating images from a given audio (Fig. 7(a)), body motion generation (Fig. 7(b)), and talking face generation (Fig. 7(c)). The brief advantages and disadvantages are shown in Tables 5-7.
Category Method Dataset Result Lip sequence to speech Le Cornu et al. GRID − Ephrat et al. GRID and TCD TIMIT PESQ: 1.922 (on GRID S4) Le Cornu et al. GRID − General video to audio Davis et al. Videos they collected SSNR: 28.7 Owens et al. Videos they collected ERR: 0.21 Zhou et al. VEGAS Flow at category level: 0.603
Table 6. A quantitative study on video-to-audio generation
Category Method Ideas & strengths Weaknesses Lip sequence to speech Le Cornu et al. Reconstruct intelligible speech only from visual speech features Applied to limited scenarios Ephrat et al. Compute optical flow between frames Applied to limited scenarios Le Cornu et al. Reconstruct speech using a classification approach combined with feature-level temporal information Cannot apply to real-time conversational speech General video to audio Davis et al. Recover real-world audio by capturing vibrations of objects Requires a specific device; Can only be applied to soft objects Owens et al. Use LSTM to capture the relation between material and motion For a lab-controlled environment only Zhou et al. Leverage a hierarchical RNN to generate in-the-wild sounds Monophonic audio only Morgado et al. Localize and separate sounds to generate spatial audio from 360° video Expensive 360° videos are required Zhou et al. A unified model to generate stereophonic audio from mono data
Table 5. Summary of recent approaches to video-to-audio generation
Category Method Ideas & strengths Weaknesses Audio to image Wan et al. Combined many existing techniques to form a GAN Relative low quality Qiu and Kataoka Generated images related to music Relative low quality Chen et al.  Generated both audio-to-visual and
The models were independent Wen et al. Explore the relationship between two modalities Hao et al.  Proposed a cross-modality cyclic GAN Generated images only Li et al. A teacher-student for speech-to-image generation Wang et al.  Relation information is leveraged Audio to motions Alemi et al. Generated dance movements from music via real-time GrooveNet Lee et al. Generated a choreography system via an autoregressive network Shlizerman et al. Applied a target delay LSTM to predict body keypoints Constrained to the given dataset Tang et al. Developed a music-oriented dance choreography synthesis method Yalta et al. Produced weak labels from motion directions for motion-music alignment Talking face Kumar et al. and Supasorn et al. Generated keypoints by a time-delayed LSTM Need retraining for different identities Jamaludin et al. Developed an encoder-decoder CNN model suitable for more identities Jalalifar et al. Combined RNN and GAN and applied keypoints For a lab-controlled environment only Vougioukas et al.  Applied a temporal GAN for more temporal consistency Chen et al. Applied optical flow Generated lips only Eskimez et al. 3D talking face landmarks;
New training method
Mass of time to train the model Eskimez et al. Emotion discriminative loss Heavy burden on the network and need lots of time Zhou et al. Disentangled information Lacked realism Zhu et al. Asymmetric mutual information estimation to capture modality coherence Suffered from the zoom-in-and-out condition Chen et al. Dynamic pixelwise loss Required multistage training Wiles et al. Self-supervised model for multimodality driving Relative low quality
Table 7. Summary of recent studies of audio-to-visual generation
Many methods have been explored to extract audio information from visual information, including predicting sounds from visually observed vibrations and generating audio via a video signal. We divide the visual-to-audio generation tasks into two categories: generating speech from lip video and synthesizing sounds from general videos without scene limitations.
There is a natural relationship between speech and lips. Separately from understanding the speech content by observing lips (lip-reading), several studies have tried to reconstruct speech by observing lips. Le Cornu et al. attempted to predict the spectral envelope from visual features, combining it with artificial excitation signals, and synthesizing audio signals in a speech production model. Ephrat and Peleg proposed an end-to-end model based on a CNN to generate audio features for each silent video frame based on its adjacent frames. The waveform was therefore reconstructed based on the learned features to produce understandable speech.
Using temporal information to improve speech reconstruction has been extensively explored. Ephrat et al. proposed leveraging the optical flow to capture the temporal motion at the same time. Le Cornu et al. leveraged recurrent neural networks to incorporate temporal information into the prediction.
When a sound hits the surfaces of some small objects, the latter will vibrate slightly. Therefore, Davis et al. utilized this specific feature to recover the sound from vibrations observed passively by a high-speed camera. Note that it should be easy for suitable objects to vibrate, which is the case for a glass of water, a pot of plants, or a box of napkins. We argue that this work is similar to the previously introduced speech reconstruction studies[96-99] since all of them use the relation between visual and sound context. In speech reconstruction, the visual part concentrates more on lip movement, while in this work, it focuses on small vibrations.
Owens et al. observed that when different materials were hit or scratched; they emitted a variety of sounds. Thus, the researchers introduced a model that learned to synthesize sound from a video in which objects made of different materials were hit with a drumstick at different angles and velocities. The researchers demonstrated that their model could not only identify different sounds originating from different materials but also learn the pattern of interaction with objects (different actions applied to objects result in different sounds). The model leveraged an RNN to extract sound features from video frames and subsequently generated waveforms through an instance-based synthesis process.
Although Owens et al. could generate sound from various materials, the approach they proposed still could not be applied to real-life applications since the network was trained by videos shot in a lab environment under strict constraints. To improve the result and generate sounds from in-the-wild videos, Zhou et al. designed an end-to-end model. It was structured as a video encoder and a sound generator to learn the mapping from video frames to sounds. Afterwards, the network leveraged a hierarchical RNN for sound generation. Specifically, the authors trained a model to directly predict raw audio signals (waveform samples) from input videos. They demonstrated that this model could learn the correlation between sound and visual input for various scenes and object interactions.
The previous efforts we have mentioned focused on monophonic audio generation, while Morgado et al. attempted to convert monophonic audio recorded by a 360° video camera into spatial audio. Performing such a task of audio specialization requires addressing two primary issues: source separation and localization. Therefore, the researchers designed a model to separate the sound sources from mixed-input audio and then localize them in the video. Another multimodality model was used to guide the separation and localization since the audio and video were complementary. To generate stereophonic audio from mono data, Zhou et al. proposed a sep-stereo framework that integrates stereo generation and source separation into a unified framework.
In this section, we provide a detailed review of audio-to-visual generation. We first introduce audio-to-images generation, which is easier than video generation since it does not require temporal consistency between the generated images.
To generate images of better quality, Wan et al. put forward a model that combined the spectral norm, an auxiliary classifier, and a projection discriminator to form the researchers′ conditional GAN model. The model could output images of different scales according to the volume of the sound, even for the same sound. Instead of generating real-world scenes of the sound that had occurred, Qiu and Kataoka suggested imagining the content from music. They proposed a model features by feeding the music and images into two networks and learning the correlation between those features, and finally generating images from the learned correlation.
Several studies have focused on audio-visual mutual generation. Chen et al. were the first to attempt to solve this cross-modality generation problem using conditional GANs. The researchers defined a sound-to-image (S2I) network and an image-to-sound (I2S) network that generated images and sounds, respectively. Instead of separating S2I and I2S generation, Hao et al. combined the respective networks into one network by considering a cross-modality cyclic generative adversarial network (CMCGAN) for the cross-modality visual-audio mutual generation task. Following the principle of cyclic consistency, CMCGAN consisted of four subnetworks: audio-to-visual, visual-to-audio, audio-to-audio, and visual-to-visual.
Most recently, some studies tried to generate images conditioned on the speech description. Li et al. proposed a speech encoding to learn the embedding features of speech, which is trained with a pre-trained image encoder using teacher-student learning strategy to obtain better generalization capability. Wang et al. leveraged a speech embedding network to learn speech embeddings with the supervision of corresponding visual information from images. A relation-supervised densely-stacked generative model is then proposed to synthesize images conditioned on the learned embeddings. Furthermore, some studies have tried to reconstruct facial images from speech clips. Duarte et al. synthesized facial images containing expressions and poses through the GAN model. Moreover, Duarte et al. enhanced their model′s generation quality by searching for the optimal input audio length. To better learn normalized faces from speech, Oh et al. explored a reconstructive model. The researchers trained an audio encoder by learning to align the feature space of speech with a pre-trained face encoder and decoder.
Different from the above methods, Wen et al. proposed an unsupervised approach to reconstruct a face from audio. Specifically, they proposed a novel framework base on GANs, which reconstructed a face via an audio vector captured by the voice embedding and the generated face and identity are distinguished by discriminator and classifier, respectively.
Instead of directly generating videos, numerous studies have tried to animate avatars using motions. The motion synthesis methods leveraged multiple techniques, such as dimensionality reduction[113, 114], hidden Markov models, Gaussian processes, and neural networks[117-119].
Alemi et al. proposed a real-time GrooveNet based on conditional restricted Boltzmann machines and recurrent neural networks to generate dance movements from music. Lee et al. utilized an autoregressive encoder-decoder network to generate a choreography system from music. Shlizerman et al. further introduced a model that used a target delay LSTM to predict body landmarks. The latter was further used as agents to generate body dynamics. The key idea was to create an animation from the audio that was similar to the action of a pianist or a violinist. In summary, the entire process generated a video of artists′ performance corresponding to the input audio.
Although previous methods could generate body motion dynamics, the intrinsic beat information of the music has not been used. Tang et al. proposed a music-oriented dance choreography synthesis method that extracted a relation between acoustic and motion features via an LSTM-autoencoder model. Moreover, to achieve better performance, the researchers improved their model with a masking method and temporal indexes. Providing weak supervision, Yalta et al. explored producing weak labels from motion direction for motion-music alignment. The authors generated long dance sequences via a conditional autoconfigured deep RNN that was fed by an audio spectrum.
Exploring audio-to-video generation, many researchers showed great interest in synthesizing people′s faces from speech or music. This has many applications, such as animating movies, teleconferencing, talking agents, and enhancing speech comprehension while preserving privacy. Earlier studies of talking face generation mainly synthesized a specific identity from the dataset based on audio of an arbitrary speech. Kumar et al. attempted to generate key points synced to audio by utilizing a time-delayed LSTM and then generated the video frames conditioned on the key points by another network. Furthermore, Supasorn et al. proposed a teeth proxy to improve the visual quality of teeth during generation.
Subsequently, Jamaludin et al. attempted to use an encoder-decoder CNN model to learn the correspondences between raw audio and videos. Combining recurrent neural network (RNN) and GAN, Jalalifar et al. produced a sequence of realistic faces synchronized with the input audio by two networks. One was an LSTM network used to create lip landmarks out of audio input. The other was a conditional GAN (cGAN) used to generate the resulting faces conditioned on a given set of lip landmarks. Instead of applying cGAN, proposed using a temporal GAN to improve the synthesis quality. However, the above methods were only applicable to synthesizing talking faces with identities limited to those in a dataset.
The synthesis of talking faces of arbitrary identities has recently drawn significant attention. Chen et al. considered correlations among speech and lip movements while generating multiple lip images. The researchers used the optical flow to better express the information between the frames. The fed optical flow represented not only the information of the current shape but also the previous temporal information.
A frontal face photo usually has both identity and speech information. Assuming this, Zhou et al. used an adversarial learning method to disentangle different types of information of one image during generation. The disentangled representation had a convenient property that both audio and video could serve as the source of speech information for the generation process. As a result, it was possible to not only output the features but also express them more explicitly while applying the resulting network.
Most recently, to discover the high-level correlation between audio and video, Zhu et al. proposed a mutual information approximation to approximate mutual information between modalities. Chen et al. applied landmark and motion attention to generating talking faces and further proposed a dynamic pixel-wise loss for temporal consistency. Facial generation is not limited to specific audio or visual modalities since the crucial point is whether there is a mutual pattern between these different modalities. Wiles et al. put forward a self-supervising framework called X2Face to learn the embedded features and generate target facial motions. It could produce videos from any input as long as the embedded features were learned.
Different from the above works, Eskimez et al. proposed a supervised system (fed with a speech utterance, face image, emotion label and noise) to generate talking face and focused on emotion to improve the authenticity of the results. As intermediate information in talking face, generating landmarks from audio has attracted more attention in recent years. Eskimez et al. proposed to generate 3D talking face landmarks from audio in a noisy environment. They exploited active shape model (ASM) coefficients of face landmarks to smooth video frames and introduced speech enhancement to cope with noise in the background.
Audio-visual generation is an important yet challenging task among these fields. The challenge mainly derives from the big gap between audio and visual modalities. In order to narrow this gap, some scholars introduced extra information for their model, including landmarks, keypoints, mutual information, and optical flow, etc. More common approaches are changing network structure base on power GAN or other generative models such as cross-modality cycle generative adversarial network, GrooveNet, condition GAN, etc. Another effective approach is pre-processing the model′s input, such as aligning the feature space, and animating avatars from motions.
Representation learning aims to discover the pattern representation from data automatically. It is motivated by the fact that the choice of data representation usually greatly impacts the performance of machine learning. However, real-world data such as images, videos, and audio are not amenable to defining specific features algorithmically. Additionally, the quality of data representation usually determines the success of machine learning algorithms. Bengio et al. assumed the reason for this to be that different representations could better explain the laws underlying data, and the recent enthusiasm for AI has motivated the design of more powerful representation learning algorithms to achieve these priors.
In this section, we will review a series of audio-visual learning methods ranging from single-modality to dual-modality representation learning[16, 17, 139-141]. The basic pipeline of such studies is shown in Fig. 8, and the strengths and weaknesses are shown in Tables 8 and 9.
Type Method Ideas & strengths Weaknesses Single modality Aytar et al. Student-teacher training procedure with natural video synchronization Only learned the audio representation Dual modalities Leidal et al. Regularized the amount of information encoded in the semantic embedding Focused on spoken utterances and handwritten digits Arandjelovic et al.[16, 139] Proposed the AVC task Considered only audio and video correspondence Korbar et al. Proposed the AVTS task with curriculum learning The sound source has to feature in the video Parekh et al. Use video labels for weakly supervised learning Leverage the prior knowledge of event classification Hu et al. Disentangle each modality into a set of distinct components Require a predefined number of clusters
Table 8. Summary of recent audio-visual representation learning studies
Type Method Dataset Result Single modality Aytar et al. DCASE, ESC-50 and ESC-10 Classification accuracy: 0.88 (on DCASE) Dual modalities Leidal et al. TIDIGITs and MNIST − Arandjelovic et al. Flickr-SoundNet and Kinetics Accuracy: 0.74 (on Kinetics) Korbar et al. Kinetics and AudioSet Accuracy: 0.78 (on Kinetics) Parekh et al. Subset of AudioSet Recall: 0.694
Table 9. A quantitative study on audio-visual representation learning studies
Naturally, to determine whether audio and video are related to each other, researchers focus on determining if they are from the same video or synchronized in the same video. Aytar et al. exploited the natural synchronization between video and sound to learn an acoustic representation of a video. The researchers proposed a student-teacher training process that used an unlabeled video as a bridge to transfer discernment knowledge from a sophisticated visual identity model to the sound modality. Although the proposed approach managed to learn audio-modality representation in an unsupervised manner, discovering audio and video representations simultaneously remained to be solved.
The information concerning modality tends to be noisy in the corresponding audio and images, while we only require semantic content rather than the exact visual content. Leidal et al. explored unsupervised learning of the semantic embedded space, which required a close distribution of the related audio and image. The researchers proposed a model to map an input to vectors of the mean. The logarithm of variance of a diagonal Gaussian distribution, and the sample semantic embeddings were drawn from these vectors.
To learn the audio and video′s semantic information by simply watching and listening to a large number of unlabeled videos, Arandjelovic et al. introduced an audio-visual correspondence learning task (AVC) for training two (visual and audio) networks from scratch, as shown in Fig. 9(a). In this task, the corresponding audio and visual pairs (positive samples) were obtained from the same video, while mismatched (negative) pairs were extracted from different videos. To solve this task, Arandjelovic and Zisserman proposed an L3-Net that detected whether the semantics in visual and audio fields were consistent. Although this model was trained without additional supervision, it could learn representations of dual modalities effectively.
Exploring the proposed audio-visual coherence (AVC) task, Arandjelovic and Zisserman continued to investigate AVE-Net that to find the most similar visual area to the current audio clip. Owens and Efros proposed adopting a model similar to that of  but used a 3D convolution network for the videos instead, which could capture the motion information for sound localization.
In contrast to previous AVC task-based solutions, Korbar et al. introduced another proxy task called audio-visual time synchronization (AVTS) that further considered whether a given audio sample and video clip were synchronized or not. In the previous AVC tasks, negative samples were obtained as audio and visual samples from different videos. However, exploring AVTS, the researchers trained the model using harder negative samples representing unsynchronized audio and visual segments sampled from the same video, forcing the model to learn the relevant temporal features. At this time, not only the semantic correspondence was enforced between the video and the audio, but more importantly, the synchronization between them was also achieved. The researchers applied the curriculum learning strategy to this task and divided the samples into four categories: positives (the corresponding audio-video pairs), easy negatives (audio and video clips originating from different videos), difficult negatives (audio and video clips originating from the same video without overlap), and super-difficult negatives (audio and video clips that partly overlap), as shown in Fig. 9(b).
The above studies rely on two latent assumptions: 1) The sound source should be present in the video, and 2) only one sound source is expected. However, these assumptions limit the applications of the respective approaches to real-life videos. Therefore, Parekh et al. leveraged class-agnostic proposals from both video frames to model the problem as a multiple-instance learning task for audio. As a result, the classification and localization problems could be solved simultaneously. The researchers focused on localizing salient audio and visual components using event classes in a weakly supervised manner. This framework was able to deal with the difficult case of asynchronous audio-visual events. To leverage more detailed relations between modalities, Hu et al. recommended a deep coclustering model that extracted a set of distinct components from each modality. The model continually learned the correspondence between such representations of different modalities and further introduced K-means clustering to distinguish concrete objects or sounds.
Representation learning between audio and visual is an emerging topic in deep learning. For single modality representation learning, existing efforts usually train an audio network to correlate with visual outputs. The visual networks are pre-trained with fixed parameters acting as a teacher. In order to learn audio and visual representation simultaneously, some efforts usually use the natural audio-visual correspondence in videos. However, this weak constraint cannot enforce models to produce precise information. Therefore, some efforts were proposed to solve this dilemma by adding more constraints such as class-agnostic proposals, corresponded or not, negative or positive samples, etc. Moreover, some works tend to exploit more precise constraints, for example, synchronized or not, harder negative samples, asynchronous audio-visual events, etc. By adding these constraints, their model can achieve better performance.
Many audio-visual datasets ranging from speech-related to event-related data have been collected and released. We divide datasets into two categories: audio-visual speech datasets that record human faces with the corresponding speech, and audio-visual event datasets consisting of musical instrument videos and real events′ videos. In this section, we summarize the information of recent audio-visual datasets (Table 10 and Fig. 10).
Category Dataset Environment Classes Length* Year Speech GRID Lab 34 33000 2006 Lombard Grid Lab 54 54000 2018 TCD TIMIT Lab 62 − 2015 Vid TIMIT Lab 43 − 2009 RAVDESS Lab 24 − 2018 SEWA Lab 180 − 2017 OuluVS Lab 20 1000 2009 OuluVS2 Lab 52 3640 2016 MEAD Lab 60 281400 2020 Voxceleb Wild 1251 154516 2017 Voxceleb2 Wild 6112 1128246 2018 LRW Wild $\sim1\;000 $ 500000 2016 LRS Wild $\sim1\;000 $ 118116 2017 LRS3 Wild $\sim1\;000 $ 74564 2017 AVA-ActiveSpeaker Wild − 90341 2019 Music C4S Lab − 4.5 2017 ENST-Drums Lab − 3.75 2006 URMP Lab − 1.3 2019 Real event YouTube-8M Wild 3862 350000 2016 AudioSet Wild 632 4971 2016 Kinetics-400 Wild 400 850* 2018 Kinetics-600 Wild 600 1400* 2018 Kinetics-700 Wild 700 1806* 2018
Table 10. Summary of speech-related audio-visual datasets. These datasets can be used for all tasks related to speech we have mentioned above. Note that the length of a speech dataset denotes the number of video clips, while for music or real event datasets, the length represents the total number of hours of the dataset.
Constructing datasets containing audio-visual corpora is crucial to understanding audio-visual speech. The datasets are collected in lab-controlled environments where volunteers read the prepared phrases or sentences, or in-the-wild environments of TV interviews or talks.
Lab-controlled speech datasets are captured in specific environments, where volunteers are required to read the given phases or sentences. Some of the datasets only contain videos of speakers that utter the given sentences; these datasets include GRID, TCD TIMIT, and VidTIMIT. Such datasets can be used for lip reading, talking face generation, and speech reconstruction. Development of more advanced datasets has continued: e.g., Livingstone et al. offered the RAVDESS dataset that contained emotional speeches and songs. The items in it are also rated according to emotional validity, intensity, and authenticity.
Some datasets such as Lombard Grid and OuluVS[149, 150] focus on multiview videos. In addition, a dataset named SEWA offers rich annotations, including answers to a questionnaire, facial landmarks, LLD (low-level descriptors) features, hand gestures, head gestures, transcript, valence, arousal, liking or disliking, template behaviors, episodes of agreement or disagreement, and episodes of mimicry. MEAD is a large-scale, high-quality emotional audio-visual dataset that contains 60 actors and actresses talking with eight different emotions at three different intensity levels. This large-scale emotional dataset can be applied to many fields, such as conditional generation, cross-modal understanding, and expression recognition.
The above datasets were collected in lab environments; as a result, models trained on those datasets are difficult to apply in real-world scenarios. Thus, researchers have tried to collect real-world videos from TV interviews, talks, and movies and released several real-world datasets, including LRW, LRW variants[84, 153, 154], Voxceleb and its variants[155, 156], AVA-ActiveSpeaker, and AVSpeech. The LRW dataset consists of 500 sentences, while its variant contains 1000 sentences[84, 154], all of which were spoken by hundreds of different speakers. VoxCeleb and its variants contain over 100000 utterances of 1,251 celebrities and over a million utterances of 6112 identities.
AVA-ActiveSpeaker and AVSpeech datasets contain even more videos. The AVA-ActiveSpeaker dataset consists of 3.65 million human-labeled video frames (approximately 38.5 h). The AVSpeech dataset contains approximately 4700 h of video segments from a total of 290000 YouTube videos spanning a wide variety of people, languages, and face poses. The details are reported in Table 10.
Another audio-visual dataset category consists of music or real-world event videos. These datasets are different from the aforementioned audio-visual speech datasets in not being limited to facial videos.
Most music-related datasets were constructed in a lab environment. For example, ENST-Drums merely contains drum videos of three professional drummers specializing in different music genres. The C4S dataset consists of 54 videos of 9 distinct clarinetists, each performing three different classical music pieces twice (4.5 h in total).
The URMP dataset contains a number of multi-instrument musical pieces. However, these videos were recorded separately and then combined. To simplify the use of the URMP dataset, Chen et al. further proposed the Sub-URMP dataset that contains multiple video frames and audio files extracted from the URMP dataset.
More and more real-world audio-visual event datasets have recently been released, that consisting numerous videos uploaded to the Internet. The datasets often comprise hundreds or thousands of event classes and the corresponding videos. Representative datasets include the following.
Kinetics-400, Kinetics-600 and Kinetics-700 contain 400, 600 and 700 human action classes with at least 400, 600 and 700 video clips for each action, respectively. Each clip lasts approximately 10 s and is taken from a distinct YouTube video. The actions cover a broad range of classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. The AVA-Actions dataset densely annotated 80 atomic visual actions in 43015 min of movie clips, where actions were localized in space and time, resulting in 1.58 M action labels with multiple labels corresponding to a certain person.
AudioSet, a more general dataset, consists of an expanding ontology of 632 audio event classes and a collection of 2084320 human-labeled 10-second sound clips. The clips were extracted from YouTube videos and cover a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs with high-quality machine-generated annotations from a diverse vocabulary of 3800+ visual entities.
Deep Audio-visual Learning: A Survey
- Received: 2020-12-04
- Accepted: 2021-03-15
- Published Online: 2021-04-15
- Deep audio-visual learning /
- audio-visual separation and localization /
- correspondence learning /
- generative models /
- representation learning
Abstract: Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods, as well as the remaining challenges of each subfield, are further discussed. Finally, we summarize the commonly used datasets and challenges.
|Citation:||Citation: H. Zhu, M. D. Luo, R. Wang, A. H. Zheng, R. He. Deep audio-visual learning: A survey. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1293-0 doi: 10.1007/s11633-021-1293-0|