SpotSound

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than 10% of each clip, creating a rigorous ‘needle-in-a-haystack’ evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks.

To establish a precise alignment between audio features and their temporal positions, we explicitly encode time by inserting textual timestamp tokens before the corresponding audio tokens at a fixed granularity (e.g., 1 second). We harness the retrieval capabilities of ALMs to read out the inserted timestamp tokens rather than decoding dense positional encodings.

We construct a temporally-aware audio-language dataset comprising 77.6k samples. To enrich supervision with denser linguistic cues, we employ LLMs (e.g., DeepSeek-v3 and Qwen2-Audio) to generate fine-grained captions for foreground audio. We then randomly mix these foreground sounds with background ambiance, preserving the insertion timestamp as the ground-truth for robust temporal grounding training.

We evaluate SpotSound (built upon Qwen2-Audio and Audio Flamingo 3) against task-specific methods and recent large audio-language models. SpotSound achieves state-of-the-art results across multiple benchmarks, including Clotho-Moment, UnAV-100 subset, AudioGrounding, and our proposed SpotSound-Bench. Notably, SpotSound-A surpasses previous methods in mIoU by +20.4% on SpotSound-Bench and +27.0% on the UnAV-100 subset.

A recurring failure mode in previous ALMs is the tendency to predict temporal windows regardless of whether the queried event is actually present. By restructuring training instances into a discriminative quadruplet format (including negative queries), SpotSound reliably determines event presence. SpotSound-A achieves up to 93.4% accuracy on positive samples and 87.9% on negative samples in AudioGrounding, demonstrating strong robustness against hallucinations.

We evaluate model performance by combining two stages: determining the presence of a sound event and predicting the corresponding time window, utilizing the F1-score. While existing large ALMs underperform due to their susceptibility to hallucinating non-existent events and a lack of audio temporal grounding capabilities, our proposed SpotSound models consistently maintain highly competitive performance across all benchmarks.

To further assess the generalization capability of SpotSound, we evaluate on standard Sound Event Detection (SED) benchmarks, specifically TUT Sound Events 2017 and DESED. Despite the complexity of these datasets, SpotSound achieves the highest results, demonstrating broad generalization and robust temporal grounding accuracy even on standard SED tasks.

We present qualitative examples comparing SpotSound's predictions with ground truth annotations and baseline models. SpotSound demonstrates precise temporal localization for short, transient sounds embedded in complex backgrounds. Furthermore, when queried with events that do not exist in the audio clip, SpotSound correctly identifies their absence, avoiding the hallucination issues prevalent in previous Large Audio-Language Models.

Abstract

Methodology: Timestamp-Interleaved Sequence

Dataset Generation Pipeline

SpotSound-Bench: A 'Needle-in-a-Haystack' Evaluation

Experimental Results: Audio Temporal Grounding

Hallucination Mitigation for Negative Samples

Two-stage Joint Assessment

Sound Event Detection

Qualitative Results

Acknowledgements