"AI2 Updates OLMo Model with Dolma 1.7 Dataset Integration"

2024-04-18

Recently, the Allen Institute for Artificial Intelligence (AI2) announced an update to its open language model OLMo 1.7-7B, which has 7 billion parameters. In this update, the AI system has adopted a more extensive and diverse Dolma dataset and made improvements to the training process.


The OLMo model was first released in February this year and is hailed as "a truly open-source, technologically advanced large-scale language model." Its complete framework includes pre-training data, training code, model weights, and evaluation, providing researchers with rich resources.

From Dolma 1.5 to 1.7

This update enables OLMo 1.7-7B to support longer context lengths, expanding from the original 2048 tokens to 4096 tokens. Additionally, due to improvements in the training process and architecture, its performance has significantly improved. In terms of the dataset, AI2 has developed Dolma 1.7, which contains 23 trillion tokens from multiple channels, covering Dolma CC, Refined Web Pages, StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, Wikipedia, and other domains.


Compared to the previous Dolma 1.5, the new version has a more diverse data source, aiming to better handle tasks that require professional knowledge, complex reasoning, and coding. Furthermore, Dolma 1.7 provides a more powerful deduplication function, effectively removing entire documents with paragraph-level duplicate scores exceeding the threshold α by calculating the length-normalized average.

Dolma 1.7 has also been optimized in terms of quality filtering. By utilizing the FastText classifier, the system can distinguish between high-quality and low-quality text. High-quality text is usually well-formatted and covers multiple useful domains required for language model training, such as Wikipedia, small-scale Web RSS feeds, and Semantic Scholar. On the other hand, low-quality text mainly includes adult entertainment and fake news websites. It is reported that the classifier was trained on approximately 25GB of data, ensuring its accuracy and reliability.

In terms of the training process, OLMo 1.7 adopts a brand-new two-stage curriculum. In the first stage, researchers train the model from scratch to ensure the stability of its foundational performance. In the second stage, the model utilizes a filtered subset of Dolma 1.7 for further training, involving an additional 500 billion tokens. Meanwhile, during the training process, the learning rate gradually decreases to 0 to optimize the model's performance.

AI2 states that with these updates, OLMo 1.7-7B has surpassed Llama 2-7B in the MMLU task and outperforms Llama-2-13B in the GSM8K task.

It is worth mentioning that the updated OLMo model is licensed under Apache 2.0, while Dolma 1.7 is licensed under ODC-BY. Currently, both products have been released on the Hugging Face platform for researchers and developers to use.