CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages

Shangda Wu¹, Zhancheng Guo¹, Ruibin Yuan², Junyan Jiang^{3, 4}, Seungheon Doh⁵,
Gus Xia^{3, 4}, Juhan Nam⁵, Xiaobing Li¹, Feng Yu¹, Maosong Sun^{1, 6}

¹Central Conservatory of Music
²Hong Kong University of Science and Technology
³New York University Shanghai
⁴Mohamed bin Zayed University of Artificial Intelligence
⁵Korea Advanced Institute of Science and Technology
⁶Tsinghua University

📄 Read the Paper 🛠 GitHub Repository ✨ Try the Demo 🤗 Model Weights

Abstract

CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities—including sheet music, performance signals, and audio recordings—with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.

Figure 1: CLaMP 3 uses contrastive learning to align features across modalities. Sheet music and performance signals are segmented into units (bars or MIDI messages) and processed by the symbolic music encoder, while audio is segmented into 5-second clips and processed through the audio feature extractor and audio music encoder. Both symbolic and audio representations are aligned with text representations from the multilingual text encoder.

Figure 2: CLaMP 3 uses contrastive learning to align features across modalities. Sheet music and performance signals are segmented into units (bars or MIDI messages) and processed by the symbolic music encoder, while audio is segmented into 5-second clips and processed through the audio feature extractor and audio music encoder. Both symbolic and audio representations are aligned with text representations from the multilingual text encoder.

Introduction

A key aspect of Music Information Retrieval (MIR) is connecting music with language, enabling music to be retrieved through text descriptions. However, existing MIR systems lack robust multimodal and multilingual support, limiting their ability to capture music's full complexity.

We developed CLaMP 3, a universal MIR framework trained on metadata in 27 languages and music from 194 countries, covering all major musical modalities. Trained with contrastive learning, it generalized to 100 languages and aligned musical modalities without supervision.

All retrieved music in this page is from WikiMT-X, a 1,000-piece benchmark, searched via the CLaMP 3 demo on Hugging Face Spaces. This page uses the CLaMP 3 model with the weights CLaMP 3 SAAS (Optimal for Audio). For each query, we showcase the Top 1 retrieval result. Click on the images to play the retrieved music videos on YouTube.

English Text-to-Music Retrieval

Title: I Saw Her Standing There
Query: classic rock, British, 1960s, upbeat

Title: Frenesi
Query: A Latin jazz piece with rhythmic percussion and brass

Title: Oh! Look At Me Now
Query: big band, major key, swing, brass-heavy, syncopation, baritone vocal

Title: When You Were Sweet Sixteen
Query: Heartfelt and nostalgic, with a bittersweet, melancholic feel

Multilingual Text-to-Music Retrieval

CLaMP 3 maps text to music across languages, even those without training data, showing strong cross-lingual generalization.

Title: Canon In D Major
Query (Spanish, seen during training): Melodía instrumental en re mayor con progresión armónica repetitiva y fluida
Translation: Instrumental melody in D major with a repetitive and fluid harmonic progression

Title: Wearing Of The Green
Query (Chinese, seen during training): D大调四四拍的爱尔兰舞曲
Translation: An Irish dance tune in D major and 4/4 time

Title: Holy God, We Praise Thy Name
Query (Greek, unseen during training): Ιερή μουσική με πνευματική ατμόσφαιρα
Translation: Sacred music with a spiritual atmosphere

Title: If I Could Be With You (One Hour Tonight)
Query (Amharic, unseen during training): የፍቅር ሙዚቃ ሞቅ እና ስሜታማ ከሆነ ነገር ግን ድንቅ እና አስደሳች ቃላት ያካትታል
Translation: A love song that is warm and emotional, with beautiful and heartfelt lyrics

Image-to-Music Retrieval

Additionally, CLaMP 3 is trained with scene depiction captions, learning visual semantics to describe suitable music contexts.
Leveraging this, image captioning models like BLIP can help CLaMP 3 to retrieve music that matches a given image's scene.

Title: Have Yourself A Merry Little Christmas
BLIP-generated Caption: a christmas tree in the harbor at night

Title: Pomp and Circumstance
BLIP-generated Caption: a group of graduates

Title: He's a Pirate
BLIP-generated Caption: a pirate ship sailing in the ocean at sunset

Title: Wedding March
BLIP-generated Caption: a bride and groom are standing in front of the altar

Cross-Modal Music Retrieval

CLaMP 3 enables cross-modal music retrieval, allowing queries like sheet music to find similar audio pieces.
It often finds semantically related works and sometimes different musical representations of the same piece.

Query: Hallelujah Chorus (lead sheet)

Top similar audio: Hallelujah Chorus

Second similar audio: Battle Cry of Freedom

Third similar audio: Funiculi, Funicula

Zero-Shot Music Classification

CLaMP 3 performs zero-shot music classification by computing semantic similarity between queries and class prototypes.
This allows flexible, scalable classification without labeled training data.

Zero-Shot music classification on Brahms' Lullaby

Music Semantic Similarity Evaluation

Here, we use both the prompt and SUNO v4-generated music as references to measure the semantic similarity of music generated by other systems. Using CLaMP 3, we calculate similarity scores for each model's output against these references. Results show that CLaMP 3 measured semantic similarities closely align with human perception.

Model	Prompt Sim↑	Suno v4 Sim↑
SUNO v3.5	0.1822	0.8346
SkyMusic	0.2216	0.6269
YuE	0.1821	0.5006
Udio	0.1744	0.4883
MiniMax	0.1177	0.4388

Note: MiniMax failed to generate fully due to 1 min time constraints.

Audio Reference: SUNO v4

🎵 Music Generation Prompt

          [Genre] synthesizer vocal uplifting electronic pop male positive

          [verse]
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian
          My, my, my, my land of the free
          These are all my pals, we are a team
          Hey, hey, hey, no matter what you say
          This is who I am and who I aim to be
          (I aim to be)
          
          
          [verse]
          Everybody came here seekin' to be glad
          From Boston to Dallas, land of joy we've had
          Everybody came here seekin' to be glad
          From Boston to Dallas, land of joy we've had
          Everybody, everybody, everybody
          (Oh, oh, oh)
          
          
          [chorus]
          I am American, glorious and true
          I am American, land of hope and crew
          Am-Am-Am-Am-Am-American, brave and new
          I am American, with dreams to pursue
          
          
          
          [verse]
          You-ou-ou, you hold the might
          Fight for what you believe, we won't give up the fight
          There's no way we'll turn back the light
          Towards the future dear, it's yours and mine
          
          
          [verse]
          Everybody came here longing for a dream
          From Miami to Chicago, land of hope it seems
          Everybody came here longing for a dream
          From Miami to Chicago, land of hope it seems
          
          
          [chorus]
          Each one, each one, each one
          (Hey, hey, hey)
          I am American, American, American
          I am American, American, Green, yellow, and brown
          Am-Am-Am-Am-Am-American, American, American
          I am American, American, just like you now
          
          
          
          [bridge]
          
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian
          I am Canadian, Canadian
          I-I-I am Canadian, Canadian, I
          
          
          [chorus]
          I am Canadian, Canadian, Canadian
          I am Canadian, Canadian, Red and white too
          I am Canadian, Canadian, Canadian
          I am Canadian, Canadian, just like you do
          
          
          
          [chorus]
          Oh-Oh-Oh-Oh-Oh-American, American, American
          I am American, American, Proud, brave, and true
          I am American, American, American
          I am American, American, as great as you do
          
          
          [end]

t-SNE Visualizations on WikiMT-X

Finally, we use t-SNE on WikiMT-X to visualize how CLaMP 3 organizes its learned music representations.
These visualizations reveal the distribution of musical modalities, languages, and semantic categories within the shared space.

BibTeX

@misc{wu2025clamp3universalmusic,
        title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages}, 
        author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
        year={2025},
        eprint={2502.10362},
        archivePrefix={arXiv},
        primaryClass={cs.SD},
        url={https://arxiv.org/abs/2502.10362}, 
  }