A small AI research team from Carnegie Mellon University, Stanford University, Harvard University, and Princeton University in the United States discovered that excessive training of large language models could make fine-tuning more challenging. In their paper published on the arXiv preprint server, the group compared the effects of different training volumes on individual large language models.
In recent years, as AI researchers have sought to enhance the intelligence of their products, many believed that the more training a model received, the better it would become. In this new study, the research team found evidence suggesting there may be a critical point of diminishing returns in language model training.
The researchers arrived at this conclusion while testing the training effectiveness of two different versions of the LLM OLMo-1B. In one case, they trained with 2.3 trillion tokens, while in the other, they used 3 trillion tokens. They then compared these scenarios using multiple benchmarks, such as ARC and AlpacaEval. The results showed that models trained with more tokens performed worse during testing — by up to 3%.
Surprised by their findings, they conducted additional tests, which yielded similar results, indicating that beyond a certain point, more training began to make the models less "intelligent." The research team termed this "catastrophic overtraining" and attributed it to what they described as "progressive sensitivity."
They further suggested that as the number of tokens increased, the models became more fragile, meaning fine-tuning (which can be viewed as adding noise) started reversing the gains observed before the tipping point.
To validate their theory, they introduced Gaussian noise into some models and found that it led to the same type of performance decline observed earlier. They named the point of no return the "inflection point." After that, they suggested, any further training would reduce the stability of the model, making it harder to adjust in ways useful for desired applications.
The researchers concluded by advising that future developers of LLMs might need to estimate how much training is sufficient — or find alternative methods to allow for additional training beyond the inflection point.