Summary: | Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acoustic events occurring within an audio clip but also able to translate that information into natural language. For LBAR, the model must have a good understanding of the context and meaning of both the audio events and the query text caption, so it can retrieve relevant audio clips based on user-specified queries. This can be a difficult task, as audio data can often be noisy and the sound events within it may sound different because of the many differing sources in different environments. To overcome these challenges, we propose three different self-supervised techniques to enhance the cross-modality relationship between text and audio representations. In the first study, we propose Reconstruction Latent Space Similarity Regularization (RLSSR) for AAC, an additional module in the model architecture to optimize. This module is trained in a self-supervised manner and does not require any additional annotations. The idea behind this is based on various tasks in computer vision that involve having the model recreate the original image. Instead of recreating the original audio, a small component is employed to recreate the audio embeddings from the text embeddings using a method that increases the similarity between the two. This feedback process serves as a form of regularization and improves the overall quality of the generation. We also perform an analysis of the design of the audio encoder and found that using a transformer encoder is beneficial to Automated Audio Captioning. The combination of both methods allows us to surpass state-of-the-art results (0.242 SPIDEr score) by a significant margin on the Clotho dataset across several metrics and benchmarks
In the second study, we tackle the new Language-Based Audio Retrieval challenge presented in DCASE 20221 . Firstly, we introduce an easy-to-use and scalable architecture, Converging Tied Layers. This architecture makes use of shared transformer layers to align both the audio and text representations in the same subspace. This approach requires minimal training and allows the use of many publicly available models without the need for fine-tuning. Secondly, we demonstrate that by using this architecture along with self-supervised contrastive loss, the model exceeds the performance of the baseline model. Lastly, our approach has a low memory requirement for training and it allows the use of pre-trained models as is, without requiring fine-tuning. Our evaluation shows that by using our approach, we beat the baseline scores by 0.08 (267%) in R@1 and 0.13 in mAP10 on the Clotho dataset. In the third study, we present a new algorithm named Epochal Difficult Captions to aid in the training of models for AAC. The algorithm adjusts target captions based on a predetermined curriculum and difficulty level that is determined by the current training epoch. The algorithm is efficient, self-supervised, and can be incorporated into any model architecture. Epochal Difficult Captions will not cause a noticeable increase in training time. This algorithm improves the keyword estimation method that has been used in earlier work to train the AAC encoder. We evaluated our approach on two different models in three settings and found that using Epochal Difficult Captions consistently improves performance by as much as 0.013 SPIDEr score on the Clotho dataset. In addition to the above work, we present 2 novel papers for word sense disambiguation via transfer learning and audio tagging. The former study makes use of BERT to reframe the word sense disambigution problem into a relevance ranking problem to allow the model to perform better by 2.6% F1score on the SE15 dataset. The latter method for audio tagging makes use of label manipulation to convert strong labels to weak labels to mitigate the model’s tendency to predict inactive frames. This approach outperforms the DCASE 2022 baseline by 45.5% on the real validation set in both aspects of the PSDS metric. Both methods for word sense disambiguation via transfer learning and audio tagging are complementary to audio captioning and retrieval due to the need for good cross-modal audio and text representations.
|