Hugging Face Launches Largest Open-Source Synthetic Dataset: Cosmopedia
Hugging Face has released Cosmopedia v0.1, the largest open synthetic dataset to date, consisting of over 30 million samples generated by Mixtral 7b. It includes various types of content such as textbooks, blog articles, stories, and WikiHow articles, totaling 25 billion tokens.
The dataset aims to compile global knowledge by mapping information from web datasets like RefinedWeb and RedPajama. It contains important details including prompts, synthetic content, seed data sources, token length, text format (e.g., textbooks, blog articles), and target audience. It provides a comprehensive overview of the splitting, distribution, and creation methods, offering researchers an in-depth understanding of the dataset structure and potential applications.
Inspired by the Phi1.5 work, this initial version of Cosmopedia lays the foundation for research in the field of synthetic data. Serving as a comprehensive resource on diverse topics, it emphasizes the potential for further enhancement in subsequent iterations.
The dataset is divided into eight splits, each derived from different seed samples. These splits include web_samples_v1 and web_samples_v2, which account for approximately 75% of the dataset and originate from an internal web dataset similar to RefinedWeb.
The Stanford split utilizes course syllabi scraped from stanford.edu, while the stories split combines narrative generation from UltraChat and OpenHermes2.5. Additionally, the WikiHow, OpenStax, KhanAcademy, and automathtext splits involve prompts associated with their respective sources.
To facilitate user access to the dataset, specific splits can be loaded using the provided code snippets. For users seeking to reduce the dataset size, a smaller subset called Cosmopedia-100k is also available. Furthermore, a larger model, Cosmo-1B, has been trained on Cosmopedia, showcasing scalability and versatility.
The dataset creation process involves a topic clustering method for web samples, refining prompts iteratively, and addressing contamination issues. The goal is to maximize diversity by customizing prompt styles and target audiences, significantly reducing duplicate content.