Hugging Face Acquires XetHub to Support Hosting of Large AI Models

2024-08-09

Hugging Face announced today that it has acquired XetHub, a collaborative development platform based in Seattle. XetHub, founded by former Apple researchers, aims to help machine learning teams efficiently handle large datasets and models. Although the specific value of this transaction has not been disclosed, CEO Clem Delangue stated in an interview with Forbes that this is the largest acquisition the company has made to date. The HF team plans to integrate XetHub's technology into its own platform and upgrade its storage backend, enabling developers to easily host more large-scale models and datasets. "The XetHub team will help us unlock the growth of HF datasets and models for the next 5 years by switching to our own, higher-quality LFS version as the storage backend for the Hub repository," wrote Julien Chaumond, the company's CTO, in a blog post. What does XetHub bring to Hugging Face? Founded in 2021 by Yucheng Low, Ajit Banerjee, and Rajat Arya, who previously worked on ML infrastructure at Apple, XetHub is known for providing a platform for enterprises to explore, understand, and process large models and datasets. The platform offers Git-like version control for repositories of up to TB-level storage, allowing teams to track changes, collaborate, and maintain reproducibility in their ML workflows. Over the past three years, XetHub has attracted numerous clients, including well-known companies such as Tableau and Gather AI, with its ability to handle the complex scalability requirements generated by growing tools, files, and artifacts. It has improved storage and transmission processes through advanced technologies such as content-defined chunking, deduplication, instant repository mounting, and file streaming. With this acquisition, the XetHub platform will no longer exist, and its data and model processing capabilities will be integrated into the Hugging Face Hub, providing optimized storage and version control backend for model and dataset sharing. Currently, HF Hub uses Git LFS (Large File Storage) as the backend for storage. It was introduced in 2020, but Chaumond stated that the company has long known that the storage system would eventually become insufficient as the number of large files in the AI ecosystem continues to grow. Git LFS is a good starting point, but the company needs an upgrade, which XetHub will provide. Currently, the XetHub platform supports individual file sizes exceeding 1TB and total repository sizes far exceeding 100TB, which greatly surpasses Git LFS, which only supports a maximum file size of 5GB and a repository size of 10GB. This will allow HF Hub to host larger datasets, models, and files than currently possible. In addition, XetHub's other storage and transmission features will make the software package even more appealing. For example, the platform's content-defined chunking and deduplication capabilities allow users to upload only the selected parts of new rows when updating datasets, without having to upload the entire file set again (which takes a significant amount of time). The same applies to model repositories. "With the field moving towards trillion-parameter models in the coming months (thanks to the new BigLlama-3.1-1T brought by Maxime Labonne?), we hope this new technology will unlock new scales within the community and enterprises," the CTO pointed out. He also added that the two companies will work closely together to launch solutions aimed at helping teams collaborate and track the evolution of their assets on the HF Hub. Currently, the Hugging Face Hub hosts 1.3 million models, 450,000 datasets, and 680,000 spaces, with a total data volume of 12PB in LFS. It will be interesting to see how this number will grow with the introduction of an enhanced storage backend that supports larger models and datasets. The release schedule for integration and other supporting features is currently unclear.