Music Source Separation in the Waveform Domain

Alexandre Défossez; Nicolas Usunier; Léon Bottou; Francis Bach

Pré-Publication, Document De Travail Année : 2021

Music Source Separation in the Waveform Domain

(1, 2, 3) , (1) , (4) , (5, 3, 2)

1
2
3
4
5

Alexandre Défossez

Fonction : Auteur
PersonId : 15596
IdHAL : alexandre-defossez
ORCID : 0000-0003-3616-1968
IdRef : 260370533

Facebook AI Research [Paris]

Statistical Machine Learning and Parsimony

Université Paris Sciences et Lettres

Nicolas Usunier

Fonction : Auteur
PersonId : 948794

Facebook AI Research [Paris]

Léon Bottou

Fonction : Auteur
PersonId : 920968

Facebook AI Research [New York]

Francis Bach

Fonction : Auteur
PersonId : 863126

Département d'informatique - ENS Paris

Université Paris Sciences et Lettres

Statistical Machine Learning and Parsimony

Résumé

Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments. Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation, to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffers from significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model, with a U-Net structure and bidirectional LSTM. Experiments on the MusDB dataset show that, with proper data augmentation, Demucs beats all existing state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source). Using recent development in model quantization, Demucs can be compressed down to 120MB without any loss of accuracy. We also provide human evaluations, showing that Demucs benefit from a large advantage in terms of the naturalness of the audio. However, it suffers from some bleeding, especially between the vocals and other source.

Domaines

Son [cs.SD] Apprentissage [cs.LG] Machine Learning [stat.ML]

Fichier principal

demucs.pdf (359.24 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Alexandre Defossez : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02379796

Soumis le : mercredi 28 avril 2021-14:29:48

Dernière modification le : vendredi 19 avril 2024-16:18:56

Dates et versions

hal-02379796 , version 1 (25-11-2019)

hal-02379796 , version 2 (28-04-2021)

Identifiants

HAL Id : hal-02379796 , version 2
ARXIV : 1911.13254

Citer

Alexandre Défossez, Nicolas Usunier, Léon Bottou, Francis Bach. Music Source Separation in the Waveform Domain. 2021. ⟨hal-02379796v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

ENS-PARIS CNRS INRIA INRIA2 PSL

4096 Consultations

13160 Téléchargements

Music Source Separation in the Waveform Domain

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager