Root Causes of Bias in Large Language Models

2024-01-16

As artificial intelligence models browse through hundreds of gigabytes of training data to learn subtle differences in language, they also absorb the biases woven into the text.

Computer science researchers at Dartmouth College are designing methods to study the parts of the models that encode these biases, paving the way to mitigate or even eliminate them.

In a recent paper published in the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, co-authors graduate student Guarnini, computer science Ph.D. candidate Weicheng Ma, and computer science assistant professor Soroush Vosoughi investigated how stereotypes are encoded in pre-trained large language models.

Large language models or neural networks are deep learning algorithms designed to process, understand, and generate text and other content when trained on large datasets.

Vosoughi said that pre-trained models have biases, such as stereotypes. These are often positive (e.g., implying that a specific group excels in certain skills) or negative (assuming someone's occupation based on their gender).

Machine learning models are expected to permeate everyday life in various ways. They can assist hiring managers in sifting through piles of resumes, facilitate faster approval or rejection of bank loans, and provide recommendations during parole decisions.

However, built-in stereotypes based on demographics can lead to unfair and unwelcome outcomes. To mitigate this impact, "we asked ourselves what we can do after model training to address these stereotypes," Vosoughi said.

The researchers first hypothesized that stereotypes, like other linguistic features and patterns, are encoded in specific parts of the neural network model called "attention heads." These are similar to a group of neurons; they allow machine learning programs to remember multiple words provided as input, along with other features, some of which are still not fully understood.

Ma, Vosoughi, and their collaborators created a dataset filled with stereotypes and used it to repeatedly fine-tune 60 different pre-trained large language models, including BERT and T5. By amplifying the stereotypes in the models, the dataset acted as a detector, focusing on the attention heads responsible for encoding these biases.

In their paper, the researchers demonstrated that pruning the worst-performing attention heads significantly reduces stereotypes in large language models without significantly affecting their language capabilities.

"Our findings challenge the conventional view that advancements in artificial intelligence and natural language processing require extensive training or complex algorithmic interventions," Ma said. According to Ma, this technique will have broad applicability as it is not inherently specific to language or models.

Importantly, Vosoughi added that the dataset can be adjusted to reveal certain stereotypes while retaining others - "it's not a one-size-fits-all approach."

So, a medical diagnostic model, where age or gender differences may be important for patient assessment, would use a different version of the dataset than a job candidate selector aiming to remove model biases.

This technique is only effective when access to fully trained models is available and does not apply to black-box models like OpenAI's chatbot ChatGPT, whose internal workings are invisible to users and researchers.

Adapting this approach to black-box models is their next immediate goal, Ma said.