SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding


Luoyi Sun1,2
Xiao Zhou3
Zeqian Li3
Ya Zhang2,3
Yanfeng Wang2,3
Weidi Xie2,3

1Zhejiang University
2Shanghai AI Lab
2Shanghai Jiao Tong University

Under Review



Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Benchmark [HuggingFace]


Qualitative examples and performance comparison

Large Audio-Language Models (ALMs) struggle with temporal grounding, the task of pinpointing exactly when an event occurs within long-form audio. We introduce SpotSound, an audio language model designed for precise temporal grounding. SpotSound incorporates a novel training objective to suppress hallucinated timestamps and accurately grounds short events obscured by dense background sounds.



Abstract

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than 10% of each clip, creating a rigorous ‘needle-in-a-haystack’ evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks.




Methodology: Timestamp-Interleaved Sequence

Non-existent Event Example

To establish a precise alignment between audio features and their temporal positions, we explicitly encode time by inserting textual timestamp tokens before the corresponding audio tokens at a fixed granularity (e.g., 1 second). We harness the retrieval capabilities of ALMs to read out the inserted timestamp tokens rather than decoding dense positional encodings.




Dataset Generation Pipeline

Non-existent Event Example

We construct a temporally-aware audio-language dataset comprising 77.6k samples. To enrich supervision with denser linguistic cues, we employ LLMs (e.g., DeepSeek-v3 and Qwen2-Audio) to generate fine-grained captions for foreground audio. We then randomly mix these foreground sounds with background ambiance, preserving the insertion timestamp as the ground-truth for robust temporal grounding training.




SpotSound-Bench: A 'Needle-in-a-Haystack' Evaluation

Existing benchmarks often feature high ratios of target-window duration to full audio clip duration. We introduce SpotSound-Bench, featuring short acoustic events embedded within long, unstructured recordings. With an average clip length of 53.4s and target events averaging 4.5s (a temporal density of 8.4%), this benchmark creates a large search space dominated by background content, demanding high temporal precision and robustness against hallucinations.




Experimental Results: Audio Temporal Grounding

We evaluate SpotSound (built upon Qwen2-Audio and Audio Flamingo 3) against task-specific methods and recent large audio-language models. SpotSound achieves state-of-the-art results across multiple benchmarks, including Clotho-Moment, UnAV-100 subset, AudioGrounding, and our proposed SpotSound-Bench. Notably, SpotSound-A surpasses previous methods in mIoU by +20.4% on SpotSound-Bench and +27.0% on the UnAV-100 subset.

Audio temporal grounding results table




Hallucination Mitigation for Negative Samples

Non-existent Event Example

A recurring failure mode in previous ALMs is the tendency to predict temporal windows regardless of whether the queried event is actually present. By restructuring training instances into a discriminative quadruplet format (including negative queries), SpotSound reliably determines event presence. SpotSound-A achieves up to 93.4% accuracy on positive samples and 87.9% on negative samples in AudioGrounding, demonstrating strong robustness against hallucinations.




Two-stage Joint Assessment

Non-existent Event Example

We evaluate model performance by combining two stages: determining the presence of a sound event and predicting the corresponding time window, utilizing the F1-score. While existing large ALMs underperform due to their susceptibility to hallucinating non-existent events and a lack of audio temporal grounding capabilities, our proposed SpotSound models consistently maintain highly competitive performance across all benchmarks.




Sound Event Detection

Non-existent Event Example

To further assess the generalization capability of SpotSound, we evaluate on standard Sound Event Detection (SED) benchmarks, specifically TUT Sound Events 2017 and DESED. Despite the complexity of these datasets, SpotSound achieves the highest results, demonstrating broad generalization and robust temporal grounding accuracy even on standard SED tasks.




Qualitative Results

We present qualitative examples comparing SpotSound's predictions with ground truth annotations and baseline models. SpotSound demonstrates precise temporal localization for short, transient sounds embedded in complex backgrounds. Furthermore, when queried with events that do not exist in the audio clip, SpotSound correctly identifies their absence, avoiding the hallucination issues prevalent in previous Large Audio-Language Models.

Qualitative results showing temporal grounding



Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.