"AI2 Updates OLMo Model with Dolma 1.7 Dataset Integration" AI NEWS

Home
AInews
"AI2 Updates OLMo Model with Dolma 1.7 Dataset Integration"

"AI2 Updates OLMo Model with Dolma 1.7 Dataset Integration"

2024-04-18

Recently, the Allen Institute for Artificial Intelligence (AI2) announced an update to its open language model OLMo 1.7-7B, which has 7 billion parameters. In this update, the AI system has adopted a more extensive and diverse Dolma dataset and made improvements to the training process.

The OLMo model was first released in February this year and is hailed as "a truly open-source, technologically advanced large-scale language model." Its complete framework includes pre-training data, training code, model weights, and evaluation, providing researchers with rich resources.

From Dolma 1.5 to 1.7

This update enables OLMo 1.7-7B to support longer context lengths, expanding from the original 2048 tokens to 4096 tokens. Additionally, due to improvements in the training process and architecture, its performance has significantly improved. In terms of the dataset, AI2 has developed Dolma 1.7, which contains 23 trillion tokens from multiple channels, covering Dolma CC, Refined Web Pages, StarCoder, C4, Stack Exchange, OpenWebMath, Project Gutenberg, Wikipedia, and other domains.

Compared to the previous Dolma 1.5, the new version has a more diverse data source, aiming to better handle tasks that require professional knowledge, complex reasoning, and coding. Furthermore, Dolma 1.7 provides a more powerful deduplication function, effectively removing entire documents with paragraph-level duplicate scores exceeding the threshold α by calculating the length-normalized average.

Dolma 1.7 has also been optimized in terms of quality filtering. By utilizing the FastText classifier, the system can distinguish between high-quality and low-quality text. High-quality text is usually well-formatted and covers multiple useful domains required for language model training, such as Wikipedia, small-scale Web RSS feeds, and Semantic Scholar. On the other hand, low-quality text mainly includes adult entertainment and fake news websites. It is reported that the classifier was trained on approximately 25GB of data, ensuring its accuracy and reliability.

In terms of the training process, OLMo 1.7 adopts a brand-new two-stage curriculum. In the first stage, researchers train the model from scratch to ensure the stability of its foundational performance. In the second stage, the model utilizes a filtered subset of Dolma 1.7 for further training, involving an additional 500 billion tokens. Meanwhile, during the training process, the learning rate gradually decreases to 0 to optimize the model's performance.

AI2 states that with these updates, OLMo 1.7-7B has surpassed Llama 2-7B in the MMLU task and outperforms Llama-2-13B in the GSM8K task.

It is worth mentioning that the updated OLMo model is licensed under Apache 2.0, while Dolma 1.7 is licensed under ODC-BY. Currently, both products have been released on the Hugging Face platform for researchers and developers to use.

MathGPT

MathGPT - Solve math problems with step-by-step explanations

Face Detector

Face Detector - Analyze face shape from uploaded photos

Glambase

Glambase - Create and monetize AI influencers.

Aider Chat

Aider Chat - Pair program with AI in terminal.

Tidio Chat

Tidio Chat - Manage customer communications through live chat, email, and chatbots.

Botpress

Botpress - Build and manage AI chatbots.

Theee AI

Theee AI - Use 50,000 AI tools for free online

RECENT AI TOOLS

CopyCopter

MathGPT

Face Detector

Glambase

Aider Chat

RECENT AI NEWS

El Capitan Tops Supercomputer Rankings, Powered by AMD Instinct Chips

Logo Creator: New AI-Powered Design Tool Simplifies Logo Creation Process

AWS Launches Multi-Agent Orchestrator for Managing AI Agents

Microsoft Ignite Conference Unveils Copilot Actions and Multiple AI Enhancements

Microsoft Launches Windows 365 Link, a New Option for Cloud Mini PCs

Niantic Develops Large-Scale Geospatial Models to Redefine Real-World Interactions

Google Gemini Update: Personalized Memory Feature Launched

OpenAI Launches Advanced Voice Mode for ChatGPT Web Version

RECENT AI TOOLS