DeBERTa: Decoding-Enhanced BERT with Decoupled Attention

2023-11-29

Recently, BERT has become the go-to tool for many natural language processing tasks. Its impressive ability to process, understand information, and build highly accurate word embeddings has achieved state-of-the-art performance.

As we all know, BERT is based on the attention mechanism from the Transformer architecture. Attention is a key component of most large language models today.

However, new ideas and methods continue to emerge in the field of machine learning. In 2021, a highly innovative technique called "decoupled attention" was introduced, which introduced an enhanced version of attention. This concept gave rise to DeBERTa, a model that combines decoupled attention. Despite only introducing a pair of new architectural principles, DeBERTa shows significant improvements over other large models on top NLP benchmarks.

In this article, we will refer to the original DeBERTa paper and cover all the necessary details to understand how it works.

1. Decoupled Attention

In the original Transformer block, each token is represented by a single vector that contains information about the token's content and position in an element-wise embedding and summation form. The drawback of this approach is that it may lose information: the model may not be able to distinguish whether a word itself or its position gives more importance to a certain embedding component.

DeBERTa introduces a novel mechanism where the same information is stored in two different vectors. Additionally, the algorithm for attention calculation is modified to explicitly consider the relationship between the content and position of tokens. For example, when the words "research" and "paper" appear near each other, their interdependence is much stronger than when they appear in different parts of the text. This example clearly demonstrates why it is necessary to consider the relationship between content and position.

Introducing decoupled attention requires modifying the calculation of attention scores. It turns out that this process is straightforward. The calculation of cross-attention scores between two embeddings, each composed of two vectors, can be easily decomposed into the sum of four pairwise multiplications between their sub-vectors:

This approach can be generalized in matrix form. From the diagram, we can observe four different types of matrices (vectors), each representing a combination of content and position information:

  • Content-Content matrix;
  • Content-Position matrix;
  • Position-Content matrix;
  • Position-Position matrix.
The Position-Position matrix does not store any valuable information as it does not have any details about word content. This is why this term is discarded in decoupled attention.

For the remaining three terms, the computation of the final output attention matrix is similar to the original Transformer.

Although the computation process may look similar, there are two subtle differences to consider.

From the diagram above, we can notice that the multiplication symbol * used for the multiplication between the matrix of query content Qc and the matrix of key position Krᵀ, as well as between the matrix of key content Kc and the matrix of query position Qrᵀ, is different from the normal matrix multiplication symbol x. In fact, this is not accidental, as in DeBERTa, the mentioned matrix pairs are multiplied in a slightly different way to take into account the relative positions of words.

  • According to the normal matrix multiplication rule, if C = A x B, then the element C[i][j] is calculated by element-wise multiplication of the i-th row of A and the j-th column of B.
  • In the special case of DeBERTa, if C = A * B, then C[i][j] is calculated by element-wise multiplication of the i-th row of A and the δ(i, j)-th column of B, where δ represents the relative distance function between i and j, defined as follows:

k can be seen as a hyperparameter controlling the maximum possible relative distance between indices i and j. In DeBERTa, k is set to 512. To better understand this formula, let's plot a heatmap visualizing the relative distances between different indices i and j (k = 6).

For example, if k = 6, i = 15, j = 13, then the relative distance δ between i and j is equal to 8. To obtain the content-position score for indices i = 15 and j = 13, in the multiplication of query content Qc and key position Kr matrices, the 15th row of Qc should be multiplied with the 8th column of Krᵀ.

However, for the position-content scores, the algorithm works slightly differently: the relative distance is not δ(i, j), but the value of δ(j, i) is used in the matrix multiplication. As the authors of the paper explain: "This is because for a given position i, position-content computes the attention weight of key content at j relative to the query position at i, so the relative distance is δ(j, i)".

Before applying the softmax transformation, the attention scores are divided by a constant √(3d) to ensure more stable training. This scaling factor differs from the one used in the original Transformer (√d). This √3-fold difference is due to the larger magnitude caused by the sum of three matrices in the DeBERTa attention mechanism (instead of a single matrix in Transformer).

2. Enhanced Masked Decoder

Decoupled attention only considers content and relative position. However, it does not take into account absolute position information, which may actually play an important role in the final predictions. The authors of the DeBERTa paper provide a specific example: a sentence "a new store opened beside the new mall" where the words "store" and "mall" are masked for prediction. Although the masked words have similar meanings and local context (the adjective "new"), they have different linguistic contexts, which is not captured by decoupled attention. There may be many similar cases in a language, which is why incorporating absolute position into the model is crucial.

In BERT, absolute position is considered in the input embeddings. When it comes to DeBERTa, it is introduced after all Transformer layers but before applying the softmax layer. In experiments, capturing relative positions in all Transformer layers and then introducing absolute position improves the model's performance. According to the researchers, doing it the other way around may hinder the model from learning enough relative positional information.

Architecture

According to the paper, the Enhanced Masked Decoder (EMD) has two input blocks:

  • H - the hidden state from the previous Transformer layer.
  • I - any information required for decoding (e.g., hidden state H, absolute position embeddings, or output from the previous EMD layer).

In general, a model can have multiple n EMD blocks. If so, they are constructed according to the following rules:

The output of each EMD layer becomes the input I for the next EMD layer; the output of the last EMD layer is fed into the language modeling head. In the case of DeBERTa, the number of EMD layers is set to n = 2, where position embeddings are used as the I for the first EMD layer.

Another technique commonly used in NLP is weight sharing between different layers to reduce model complexity (e.g., ALBERT). This idea is also implemented in the EMD blocks of DeBERTa.

When I = H and n = 1, the EMD is equivalent to the decoder layer in BERT.

DeBERTa Configuration

Abalation Experiments

Experiments show that all components introduced in DeBERTa (content-position attention, position-content attention, and enhanced masked decoder) improve performance. Removing any of them leads to a decrease in metrics.

Scale-Invariant Fine-Tuning

In addition, the authors propose a new adversarial algorithm called "scale-invariant fine-tuning" to improve the model's generalization. The idea is to introduce small perturbations in the input sequence to make the model more robust to adversarial samples. In DeBERTa, the perturbation is applied to the normalized input word embeddings. This technique works better for larger fine-tuned DeBERTa models.

DeBERTa Variants

The DeBERTa paper proposes three models. Their comparison is shown in the following diagram:

Data

For pretraining, the base and large versions of DeBERTa use a combination of the following datasets:

  • English Wikipedia + BookCorpus (16 GB)
  • OpenWebText (public Reddit content: 38 GB)
  • Stories (31 GB)

After deduplication, the final dataset size is reduced to 78 GB. For DeBERTa 1.5B, the authors used more than double the data (160 GB) and an impressive vocabulary size of 128K.

As a comparison, other large models like RoBERTa, XLNet, and ELECTRA were pretrained on 160 GB of data. Meanwhile, DeBERTa shows comparable or better performance on various NLP tasks compared to these models.

In terms of training, DeBERTa was pretrained for one million steps with 2K samples per step.

We have covered the main aspects of the DeBERTa architecture. By having decoupled attention and an enhanced masked encoding algorithm, DeBERTa has become a highly popular choice for many data scientists in NLP pipelines and a winning factor in many Kaggle competitions. Another surprising fact about DeBERTa is that it is one of the first NLP models to surpass human performance on the SuperGLUE benchmark. This single evidence alone is enough to demonstrate that DeBERTa will hold a prominent place in the history of NLP for a long time.