Audio Diffusion: The Secrets of Music Creation

2024-01-23

Background

The term "AI-generated" has become ubiquitous in the music industry, but what does it actually mean? With the widespread use of this buzzword, the term is thrown around loosely, whether it's used to simulate effects, automatically mix or master tracks, separate audio sources, or enhance sound. As long as the final audio has been processed to some extent by AI, the entire work is labeled with this term. However, the majority of music currently released is still primarily created by humans (yes, even including the genius of "Heart On My Sleeve").

Although the term "AI-generated" has become a cliché for clickbait, the appropriate usage is when a computer truly creates new sounds, referred to as "generative audio".

Generative audio can include the creation of sound effects, melodies, vocals, or even entire songs. There are two main methods for achieving generative audio: MIDI generation and audio waveform generation. MIDI (Musical Instrument Digital Interface) generation has much lower computational costs and can provide high-quality output because the generated MIDI data can be used to produce sound through existing virtual instruments. This concept is similar to producers composing MIDI on a piano roll and playing it back through VST plugins like Serum.

While this is appealing, it is only partial generation, as AI does not actually generate audio, just as humans cannot synthesize instrument sounds out of thin air. The virtual instruments that algorithms can use also further limit their creative capabilities. However, products utilizing this technology, such as AIVA and Lemonaide's Seeds, are still able to produce quite impressive outputs.

Audio waveform generation is a much more complex task, as it is an end-to-end system that does not rely on any external techniques. In other words, it can generate sound from scratch. This process aligns more closely with the true definition of "AI-generated" audio.

Audio waveform generation can be accomplished through various methods, resulting in different outcomes. It can generate individual samples, such as Audialab's ED 2 and Humanize, or collaborate with Tiny Audio Diffusion to generate complete songs, such as AudioLM, Moûsai, Riffusion, MusicGen, and Stable Audio. Many of these cutting-edge models utilize some form of diffusion to generate sound, a concept you may have heard of from Stable Diffusion or other globally popular image generation models. This generation method is equally applicable to audio. But what does this mean exactly?

What is Diffusion?

Background

In the context of AI, diffusion refers to the process of adding or removing noise from a signal (similar to static on an old television). Forward diffusion adds noise to the signal (noising), while reverse diffusion removes noise (denoising). Conceptually, diffusion models employ white noise and progressively denoise it until the audio resembles something recognizable, such as a sample or a song. This denoising process is the secret to the creative power of many generative audio models.


This process was initially developed for images. By observing how noise decomposes into an image (e.g., a small dog sitting next to a tennis ball), we can gain a clearer understanding of how these models work.

With a conceptual understanding in place, let's take a look at the key components of the architecture of an audio diffusion model.

U-Net Architecture, Compression, and Reconstruction

The core of an audio diffusion model is the U-Net. U-Net was originally developed for medical image segmentation and is named after its U-shaped appearance. Due to its powerful ability to capture both local and global features in data, U-Net has been adapted for audio generation. The original U-Net is a two-dimensional convolutional neural network (CNN) used for images, but it can also be adjusted to one-dimensional convolutions to handle audio waveform data. Below is a visual representation of the original U-Net architecture for images.

Similar to a Variational Autoencoder (VAE), U-Net consists of an encoder (left side of the U) and a decoder (right side of the U), connected by a bottleneck (bottom layer of the U). However, unlike VAE, U-Net carries skip connections (shown as horizontal gray arrows) that connect the encoder and decoder, with the decoder being the crucial part for generating high-resolution output. The encoder is responsible for capturing the features or characteristics of the input audio signal, while the decoder is responsible for reconstructing the signal.

To help visualize, imagine audio data entering from the top left of the U, traveling along the red and blue arrows through the encoder to reach the bottleneck at the bottom, and then returning through the decoder along the blue and green arrows to the top right of the U. At each layer of the encoder, the input audio signal is further compressed until it reaches a highly condensed representation of the sound at the bottom of the U (the bottleneck). The decoder then receives the compressed signal and effectively reverses the process to reconstruct the signal. Each layer (blue rectangle) that the data passes through has a series of adjustable weights, which can be thought of as millions of small knobs that adjust the compression/reconstruction process. With layers of varying compression levels, the model learns a range of features from the data, from large-scale features like melody and rhythm to fine-grained details like high-frequency timbral characteristics.

Think of the entire system as creating an MP3 audio file and then listening to the MP3 on a playback device. At its core, MP3 is a compressed version of an audio signal. Imagine the encoder's job is to create a new type of compressed audio format, similar to MP3, that compresses the audio signal as much as possible without losing fidelity. Then, the decoder's job is to reconstruct the MP3 into high-fidelity audio that can be played through headphones, similar to an iPhone (or any playback device). The bottleneck can be seen as the newly created MP3-type format itself. U-Net represents the compression and reconstruction process, not the audio data. The model can then be trained with the goal of accurately compressing and reconstructing various audio signals.

All of this is well and good, but we haven't generated anything yet. We've just built a method for compressing and reconstructing audio signals. However, this is the fundamental process needed to begin generating new audio, with just a few adjustments.

Noising and Denoising

Let's revisit the concept of noise and denoising mentioned earlier. In theory, we have some magical models that can be taught to denoise white noise into recognizable audio, perhaps a beautiful concerto. One key requirement for this magical model is that it must be able to reconstruct the input audio signal with high fidelity. Fortunately, the U-Net architecture is designed for exactly this purpose. The next challenge is to modify the U-Net to perform the denoising process.

Contrary to intuition, to teach a model to denoise audio signals, we must first teach it how to add noise to the signal. Once this process is learned, it naturally knows how to reverse the operation to denoise the signal.

Recalling the previous section, we went into detail about how U-Net learns to compress and reconstruct audio signals. The denoising process follows almost the same formula, but U-Net is not aiming to reconstruct the exact input audio signal; instead, it aims to add a small amount of noise while reconstructing the input audio signal. This can be metaphorically understood as reversing the steps in a series of small dog images.

The process of adding noise to the signal must be probabilistic (i.e., predictable). The model is first shown an audio signal and then instructed to predict the same signal with a small amount of Gaussian noise added. Gaussian noise is the most commonly used noise, but it is not necessary. The noise must be defined by a probability distribution, meaning it follows a specific pattern that can be predictably sustained. The process of instructing the model to add a small amount of predictable noise to the audio signal is repeated several steps until the signal is essentially reduced to just noise.


For example, let's take a single sample of a snare drum. We provide this snare drum sample to the U-Net and ask it to reconstruct the snare drum sound but with some noise added to make it sound less clean. Then, we provide this slightly noisy snare drum sample to the model and ask it to reconstruct this snare drum sample with more noise. This process is repeated until the snare drum sample sounds like it no longer exists, leaving only white noise. Then, the model is taught how to perform this process on various sounds. Once the model becomes an expert at predicting how to add noise to the input audio signal, the process can simply be reversed, removing a bit of noise at each step. When white noise is input, the model generates the snare drum sample in this way.

Due to the probabilistic nature of this process, it yields some incredible abilities, particularly in simulating creativity.

Let's continue with the example of the snare drum. Imagine that the model has been trained on thousands of individual snare drum samples. You can imagine that it can take in some white noise and transform it into any one of those snare drum samples it has been trained on. However, the model doesn't learn in this way. Instead, due to the wide range of sounds it has been exposed to, it learns to create sounds that are roughly similar to any of the snare drum samples it has been trained on but not exactly the same. This is how entirely new sounds are generated, and these models seem to exhibit sparks of creative genius.

To illustrate this, let's use the sketch below.

Imagine that all possible sounds, from guitar strums to dog barks to white noise, can be plotted on a two-dimensional plane, as shown by the black rectangle in the image above. Within this space, there is an area that represents the sound of a snare drum hit. Due to their similar timbre and transient characteristics, they are grouped together to some extent. The blue dots represent this area, with each blue dot representing a snare drum sample, which we trained the model on. The red dots represent the fully noisy versions of the snare drum that the model was trained on, corresponding to the unnoised blue dots.

Essentially, our model has learned to extract points from the "non-circle" area and bring them into the "circle" area. So, if we take a new green point from the "non-circle" area (e.g., random noise) that does not correspond to any of the blue points and ask the model to bring it into the "circle" area, it will bring it to a new location within the "circle" area. This is how the model generates "new" samples that have some similarities to all the other samples within the area but also have some new unknown features.

This concept can be applied to any type of sound, including complete songs. It is an incredible innovation that can bring countless new creative possibilities. However, it is important to understand that these models cannot replicate human creativity beyond the range of their training. As shown in the image above, while our conceptual model can take in any type of sound, it can only generate snare drum samples similar to those it was trained on. This applies to any audio diffusion model. Therefore, training the model on a diverse dataset is crucial, so that the known area (such as the snare drum area) has enough diversity and scale without simply replicating the training data.

All of this means that no model can remove the human element from the sounds and music we create. Therefore, when approaching these new technologies, we should view them as tools to enhance the creativity of artists rather than replace them.

Applications of Diffusion Models

These models will not magically create new genres or explore unknown sonic landscapes like humans do. With this understanding, we should not see these generative models as replacements for human creativity but rather as tools that can enhance it. Here are several ways to harness creativity using this technology:

  • Unleashing creativity through curation: It is common practice in the production process to search for desired sounds through sample packs. These models can effectively serve as "infinite sample packs" by curating sounds, enhancing the creativity of artists.
  • Sound transfer: Just as diffusion models can transform random noise into recognizable audio, they can also "transfer" other sounds into a different sound. For example, if we input a kick drum sample instead of white noise into the previous snare drum model, it will take the kick drum sample and start morphing it into a snare drum sound. This way, it can combine the characteristics of multiple different sounds to create something very unique.
  • Sound variability (humanization): When a human plays an instrument live, such as striking a high hat on a drum set, each strike has inherent variations. Various virtual instruments attempt to simulate this variation in different ways, but they often sound artificial and lack character. Audio diffusion allows individual sounds to vary infinitely, adding a humanizing element to audio samples. For example, if you have a drum sequencing program, you can utilize audio diffusion techniques to slightly vary each strike in terms of timbre, velocity, attack, and more, making what might sound robotic more humanized.
  • Sound design adjustments: Similar to human variation potential, this concept can be applied to sound design to make subtle adjustments to sounds. Perhaps you love the sound of a door slamming but want it to have more impact or creakiness. Diffusion models can make slight changes to this sample, adding some new characteristics while preserving most of its original qualities. It adds, removes, or alters the spectral content of the sound at a more fundamental level than applying equalizers or filters.
  • Melody generation: Similar to browsing sample packs, audio diffusion models can also generate melodies, inspiring creative ideas.
  • Stereo effects: There are several different mixing techniques to add stereo width to mono (monophonic) sounds. However, they often introduce unwanted coloration, delay, or phase shifts. Audio diffusion can produce sounds that are almost identical to mono sounds but have enough differences in content to expand the stereo width while avoiding many unnecessary artifacts.
  • Super-resolution: Audio diffusion models can enhance the resolution and quality of audio recordings, making them clearer and more nuanced. This is particularly useful in audio restoration or processing low-quality recordings.
  • Restoration: Diffusion models can be used to fill in missing or damaged parts of audio signals, restoring them to their original or improved state. This is valuable for repairing damaged audio recordings, filling in potentially missing audio parts, or adding transitional effects between audio clips.

Conclusion

There is no doubt that these new generative AI models are incredible technological advancements, whether people view them positively or negatively. There are still many aspects of diffusion models that can be optimized in terms of speed, diversity, and quality, but we have discussed the fundamental principles that govern the capabilities of these models. This knowledge allows us to gain a deeper understanding of what it truly means for these models to generate "new sounds".

From a broader perspective, people care not only about the music itself but also about the human factors in music creation. Ask yourself, if you heard a lightning-fast, technically impressive guitar solo recording, would you be deeply impressed? It depends. If it was generated by a virtual MIDI instrument programmed by a producer, you might not be impressed and might even dislike the sound. However, if you knew that a real guitarist played that solo on a real guitar, and even witnessed their performance, you would be completely blown away by their skill and precision. We are drawn to the dexterity in the performance, the thoughts and emotions behind the lyrics, and the considerations behind every decision made in creating the song.

While these incredible advancements have sparked some fears among artists and producers, AI can never remove the human elements from the sounds and music we create. Therefore, we should view these new technologies as tools to enhance the creativity of artists rather than replace them.