Hugging Face Launches Largest Open-Source Synthetic Dataset: Cosmopedia AI NEWS

Home
AInews
Hugging Face Launches Largest Open-Source Synthetic Dataset: Cosmopedia

Hugging Face Launches Largest Open-Source Synthetic Dataset: Cosmopedia

2024-02-22

Hugging Face has released Cosmopedia v0.1, the largest open synthetic dataset to date, consisting of over 30 million samples generated by Mixtral 7b. It includes various types of content such as textbooks, blog articles, stories, and WikiHow articles, totaling 25 billion tokens. The dataset aims to compile global knowledge by mapping information from web datasets like RefinedWeb and RedPajama. It contains important details including prompts, synthetic content, seed data sources, token length, text format (e.g., textbooks, blog articles), and target audience. It provides a comprehensive overview of the splitting, distribution, and creation methods, offering researchers an in-depth understanding of the dataset structure and potential applications. Inspired by the Phi1.5 work, this initial version of Cosmopedia lays the foundation for research in the field of synthetic data. Serving as a comprehensive resource on diverse topics, it emphasizes the potential for further enhancement in subsequent iterations. The dataset is divided into eight splits, each derived from different seed samples. These splits include web_samples_v1 and web_samples_v2, which account for approximately 75% of the dataset and originate from an internal web dataset similar to RefinedWeb. The Stanford split utilizes course syllabi scraped from stanford.edu, while the stories split combines narrative generation from UltraChat and OpenHermes2.5. Additionally, the WikiHow, OpenStax, KhanAcademy, and automathtext splits involve prompts associated with their respective sources. To facilitate user access to the dataset, specific splits can be loaded using the provided code snippets. For users seeking to reduce the dataset size, a smaller subset called Cosmopedia-100k is also available. Furthermore, a larger model, Cosmo-1B, has been trained on Cosmopedia, showcasing scalability and versatility. The dataset creation process involves a topic clustering method for web samples, refining prompts iteratively, and addressing contamination issues. The goal is to maximize diversity by customizing prompt styles and target audiences, significantly reducing duplicate content.

Zeroheight

Zeroheight - Centralized design system documentation tool

LockedIn AI

LockedIn AI - AI job interview assistant

Interviewer AI

Interviewer AI - AI video interviews streamline talent screening process

Jules

Jules - AI coding assistant with automatic pull requests

Final Round AI

Final Round AI - Automated job interview preparation and assistance

Sapia

Sapia - AI hiring agent for fair recruitment processes

Magic Motion

Magic Motion - AI transforms text into engaging 3D animations

RECENT AI TOOLS

Completely AI

Zeroheight

LockedIn AI

Interviewer AI

Jules

RECENT AI NEWS

Apple Confirms Launch of Next-Gen AI Assistant with iOS 26

Daniel Gross, Former CEO of Safety Superintelligence, Joins Meta's New AI Lab

Google Launches New Veo 3 Video Generation Model Globally

Meta's New Strategy: Enhancing User Engagement via Proactive Messaging Chatbots

Perplexity AI Launches New "Max" Subscription Service with Monthly Fee of $200

Sam Altman Criticizes Meta's Hiring Strategy as 'Unpalatable,' Calls OpenAI Still Mission-Driven

ChatGPT's News Site Recommendations Rising, but Not Enough to Offset Search Traffic Decline

Google Releases Urgent Chrome Fix for Zero-Day Vulnerability — Users Advised to Update Immediately

RECENT AI TOOLS