ByteDance AI Research Introduces StemGen: A New Music Generation Deep Learning Model

2023-12-19


Music generation is a method of creating music using deep learning models, which can learn the patterns and structures of existing music. Common deep learning models include RNN, LSTM, and transformers. This article explores a novel method of music generation that uses a non-autoregressive transformer model to generate music audio based on the musical context. This approach focuses on listening and responding, unlike traditional abstract-condition-based methods. The article also introduces the latest advancements in this field and improves the model architecture.

Researchers from ByteDance's SAMI team proposed a non-autoregressive transformer model that can listen and respond to musical context, using the publicly available MusicGen model's encoder checkpoint. They evaluated the model using standard metrics and music information retrieval descriptor methods, including FAD and MIRDD. They demonstrated the audio quality and music alignment ability of the model, validated through objective metrics and subjective MOS testing.

This study summarizes the latest advancements in end-to-end music audio generation, drawing inspiration from techniques used in image and language processing. It highlights the alignment issues in music composition and criticizes traditional abstract-condition-based methods. It proposes a new training method that uses a non-autoregressive transformer model capable of responding to musical context. It utilizes two conditioning sources and defines the problem as conditional generation. The model is evaluated using objective metrics, music information retrieval descriptors, and auditory tests.

This method uses a non-autoregressive transformer model to generate music by integrating a residual vector quantizer into an audio encoding model. It combines multiple audio channels by concatenating embeddings into a sequence element. It employs a masking process and uses an unclassified guidance in the token sampling process to improve music alignment. The model's performance is evaluated using FAD and MIRDD. Various metrics are used to generate and compare output samples with real stems.

This study evaluates the generated model using standard metrics and music information retrieval descriptor methods, including FAD and MIRDD. The comparison with real stems shows that the model achieves audio quality comparable to state-of-the-art text-conditioned models and exhibits strong music consistency. MOS testing conducted by participants involved in music training further demonstrates the model's ability to generate plausible music results. MIRDD, which evaluates the distribution alignment between generated and real stems, provides a measure of music consistency and alignment.

In summary, the main contributions of this study are as follows:

  • Proposing a new training method for generating models capable of responding to musical context.
  • Introducing a non-autoregressive transformer model with two novel improvements: multi-source unclassified guidance and an iterative decoding process for causal bias.
  • Training the model on open-source and proprietary datasets, achieving state-of-the-art audio quality.
  • Validating the audio quality of the model using standard metrics and music information retrieval descriptors.
  • Verifying the model's ability to generate realistic music results through MOS testing.