AudioCaps
21 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in AudioCaps
Most implemented papers
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency.
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
In this work, we explore a scalable way for building a general representation model toward unlimited modalities.
Audio Retrieval with Natural Language Queries
We consider the task of retrieving audio using free-form natural language queries.
Audio Captioning Transformer
In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer network based on an encoder-decoder architecture and is totally convolution-free.
Can Audio Captions Be Evaluated with Image Caption Metrics?
Current metrics are found in poor correlation with human annotations on these datasets.
AUTOMATED AUDIO CAPTIONING BY FINE-TUNING BART WITH AUDIOSET TAGS
utomated audio captioning is the multimodal task of describing environmental audio recordings with fluent natural language.
Audio Retrieval with Natural Language Queries: A Benchmark Study
Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho.
Separate What You Describe: Language-Queried Audio Source Separation
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e. g., "a man tells a joke followed by people laughing").
On Metric Learning for Audio-Text Cross-Modal Retrieval
We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets.
Audio Retrieval with WavText5K and CLAP Training
In this work, we propose a new collection of web audio-text pairs and a new framework for retrieval.