Papers
arxiv:2601.03227

The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization

Published on Jan 6
· Submitted by
Rising0321
on Jan 7
Authors:
,
,
,

Abstract

Audio geo-localization benchmark AGL1K is introduced to advance audio language models' geospatial reasoning capabilities through curated audio clips and evaluation across multiple models.

AI-generated summary

Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.

Community

Paper author Paper submitter

We found the sonar moment in audio language models. We propose the task of audio geo-localization. And amazingly, Gemini 3 Pro can reach the distance error of less than 55km for 25% samples.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Here are a few concerns I had after reading the paper.
First, the motivation feels a bit thin: it’s not obvious how often we truly need to infer location from audio alone in real-world settings, especially when many plausible use cases would typically rely on richer signals (video/images, timestamps, device metadata, or surrounding context).
Second, the dataset construction may introduce strong sampling bias. Since the benchmark is built from user-uploaded clips on Aporee, it likely over-represents travel/landmark-style recordings and soundscapes with strong linguistic cues, rather than a distribution that resembles everyday environments.
Third, the scale is quite small for a “global” claim (about 1.4K clips across 72 countries/regions, with clear geographic imbalance). With this size, it’s hard to conclude models have a generally reliable audio geo-localization capability; the results could mostly reflect success on a limited set of highly localizable or otherwise “representative” locations.
Finally, since the audio originates from a public online source, it’s plausible that parts of this corpus (or close variants) were already present in some models’ pretraining data. If so, strong performance might reflect memorization or retrieval of seen content rather than genuine audio-based reasoning.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.03227 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.03227 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.