CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities—including sheet music, performance signals, and audio recordings—with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.
A key aspect of Music Information Retrieval (MIR) is connecting music with language, enabling music to be retrieved through text descriptions. However, existing MIR systems lack robust multimodal and multilingual support, limiting their ability to capture music's full complexity.
We developed CLaMP 3, a universal MIR framework trained on metadata in 27 languages and music from 194 countries, covering all major musical modalities. Trained with contrastive learning, it generalized to 100 languages and aligned musical modalities without supervision.
All retrieved music in this page is from
WikiMT-X, a 1,000-piece benchmark, searched via the CLaMP 3 demo on
Hugging Face Spaces. This page uses the CLaMP 3 model with the weights CLaMP 3 SAAS (Optimal for Audio). For each query, we showcase the Top 1 retrieval result. Click on the images to play the retrieved music videos on YouTube.
CLaMP 3 maps text to music across languages, even those without training data, showing strong cross-lingual generalization.
Title: Canon In D Major
Query (Spanish, seen during training): Melodía instrumental en re mayor con progresión armónica repetitiva y fluida
Translation: Instrumental melody in D major with a repetitive and fluid harmonic progression
Title: Wearing Of The Green
Query (Chinese, seen during training): D大调四四拍的爱尔兰舞曲
Translation: An Irish dance tune in D major and 4/4 time
Additionally, CLaMP 3 is trained with scene depiction captions, learning visual semantics to describe suitable music contexts.
Leveraging this, image captioning models like BLIP can help CLaMP 3 to retrieve music that matches a given image's scene.
CLaMP 3 enables cross-modal music retrieval, allowing queries like sheet music to find similar audio pieces.
It often finds semantically related works and sometimes different musical representations of the same piece.
Here, we use both the prompt and SUNO v4-generated music as references to measure the semantic similarity of music generated by other systems. Using CLaMP 3, we calculate similarity scores for each model's output against these references. Results show that CLaMP 3 measured semantic similarities closely align with human perception.
Model | Prompt Sim↑ | Suno v4 Sim↑ | Audio |
---|---|---|---|
SUNO v3.5 | 0.1822 | 0.8346 | |
SkyMusic | 0.2216 | 0.6269 | |
YuE | 0.1821 | 0.5006 | |
Udio | 0.1744 | 0.4883 | |
MiniMax | 0.1177 | 0.4388 |
Note: MiniMax failed to generate fully due to 1 min time constraints.
[Genre] synthesizer vocal uplifting electronic pop male positive [verse] I am Canadian, Canadian I-I-I am Canadian, Canadian My, my, my, my land of the free These are all my pals, we are a team Hey, hey, hey, no matter what you say This is who I am and who I aim to be (I aim to be) [verse] Everybody came here seekin' to be glad From Boston to Dallas, land of joy we've had Everybody came here seekin' to be glad From Boston to Dallas, land of joy we've had Everybody, everybody, everybody (Oh, oh, oh) [chorus] I am American, glorious and true I am American, land of hope and crew Am-Am-Am-Am-Am-American, brave and new I am American, with dreams to pursue [verse] You-ou-ou, you hold the might Fight for what you believe, we won't give up the fight There's no way we'll turn back the light Towards the future dear, it's yours and mine [verse] Everybody came here longing for a dream From Miami to Chicago, land of hope it seems Everybody came here longing for a dream From Miami to Chicago, land of hope it seems [chorus] Each one, each one, each one (Hey, hey, hey) I am American, American, American I am American, American, Green, yellow, and brown Am-Am-Am-Am-Am-American, American, American I am American, American, just like you now [bridge] I am Canadian, Canadian I-I-I am Canadian, Canadian I am Canadian, Canadian I-I-I am Canadian, Canadian, I [chorus] I am Canadian, Canadian, Canadian I am Canadian, Canadian, Red and white too I am Canadian, Canadian, Canadian I am Canadian, Canadian, just like you do [chorus] Oh-Oh-Oh-Oh-Oh-American, American, American I am American, American, Proud, brave, and true I am American, American, American I am American, American, as great as you do [end]
@misc{wu2025clamp3universalmusic,
title={CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages},
author={Shangda Wu and Zhancheng Guo and Ruibin Yuan and Junyan Jiang and Seungheon Doh and Gus Xia and Juhan Nam and Xiaobing Li and Feng Yu and Maosong Sun},
year={2025},
eprint={2502.10362},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.10362},
}