CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages


1Central Conservatory of Music

2Hong Kong University of Science and Technology

3New York University Shanghai

4Mohamed bin Zayed University of Artificial Intelligence

5Korea Advanced Institute of Science and Technology

6Tsinghua University

Abstract

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities—including sheet music, performance signals, and audio recordings—with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.

Introduction

A key aspect of Music Information Retrieval (MIR) is connecting music with language, enabling music to be retrieved through text descriptions. However, existing MIR systems lack robust multimodal and multilingual support, limiting their ability to capture music's full complexity.

We developed CLaMP 3, a universal MIR framework trained on metadata in 27 languages and music from 194 countries, covering all major musical modalities. Trained with contrastive learning, it generalized to 100 languages and aligned musical modalities without supervision.

All retrieved music in this page is from WikiMT-X, a 1,000-piece benchmark, searched via the CLaMP 3 demo on Hugging Face Spaces. This page uses the CLaMP 3 model with the weights CLaMP 3 SAAS (Optimal for Audio). For each query, we showcase the Top 1 retrieval result. Click on the images to play the retrieved music videos on YouTube.

English Text-to-Music Retrieval

Multilingual Text-to-Music Retrieval


CLaMP 3 maps text to music across languages, even those without training data, showing strong cross-lingual generalization.

Image-to-Music Retrieval


Additionally, CLaMP 3 is trained with scene depiction captions, learning visual semantics to describe suitable music contexts.
Leveraging this, image captioning models like BLIP can help CLaMP 3 to retrieve music that matches a given image's scene.

Cross-Modal Music Retrieval


CLaMP 3 enables cross-modal music retrieval, allowing queries like sheet music to find similar audio pieces.
It often finds semantically related works and sometimes different musical representations of the same piece.

Zero-Shot Music Classification


CLaMP 3 performs zero-shot music classification by computing semantic similarity between queries and class prototypes.
This allows flexible, scalable classification without labeled training data.

Brahms' Lullaby

Zero-Shot music classification on Brahms' Lullaby

Music Semantic Similarity Evaluation


Here, we use both the prompt and SUNO v4-generated music as references to measure the semantic similarity of music generated by other systems. Using CLaMP 3, we calculate similarity scores for each model's output against these references. Results show that CLaMP 3 measured semantic similarities closely align with human perception.

Model Prompt Sim↑ Suno v4 Sim↑ Audio
SUNO v3.5 0.1822 0.8346
SkyMusic 0.2216 0.6269
YuE 0.1821 0.5006
Udio 0.1744 0.4883
MiniMax 0.1177 0.4388

Note: MiniMax failed to generate fully due to 1 min time constraints.



Audio Reference: SUNO v4

🎵 Music Generation Prompt

          [Genre] synthesizer vocal uplifting electronic pop male positive

          [verse]
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian
          My, my, my, my land of the free
          These are all my pals, we are a team
          Hey, hey, hey, no matter what you say
          This is who I am and who I aim to be
          (I aim to be)
          
          
          [verse]
          Everybody came here seekin' to be glad
          From Boston to Dallas, land of joy we've had
          Everybody came here seekin' to be glad
          From Boston to Dallas, land of joy we've had
          Everybody, everybody, everybody
          (Oh, oh, oh)
          
          
          [chorus]
          I am American, glorious and true
          I am American, land of hope and crew
          Am-Am-Am-Am-Am-American, brave and new
          I am American, with dreams to pursue
          
          
          
          [verse]
          You-ou-ou, you hold the might
          Fight for what you believe, we won't give up the fight
          There's no way we'll turn back the light
          Towards the future dear, it's yours and mine
          
          
          [verse]
          Everybody came here longing for a dream
          From Miami to Chicago, land of hope it seems
          Everybody came here longing for a dream
          From Miami to Chicago, land of hope it seems
          
          
          [chorus]
          Each one, each one, each one
          (Hey, hey, hey)
          I am American, American, American
          I am American, American, Green, yellow, and brown
          Am-Am-Am-Am-Am-American, American, American
          I am American, American, just like you now
          
          
          
          [bridge]
          
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian, I
          
          
          [chorus]
          I am Canadian, Canadian, Canadian
          I am Canadian, Canadian, Red and white too
          I am Canadian, Canadian, Canadian
          I am Canadian, Canadian, just like you do
          
          
          
          [chorus]
          Oh-Oh-Oh-Oh-Oh-American, American, American
          I am American, American, Proud, brave, and true
          I am American, American, American
          I am American, American, as great as you do
          
          
          [end]
        

t-SNE Visualizations on WikiMT-X


Finally, we use t-SNE on WikiMT-X to visualize how CLaMP 3 organizes its learned music representations.
These visualizations reveal the distribution of musical modalities, languages, and semantic categories within the shared space.

BibTeX

@misc{wu2025clamp3universalmusic,
        title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages}, 
        author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
        year={2025},
        eprint={2502.10362},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2502.10362}, 
  }