Tech Giants Using YouTube Subtitles for AI Training Without Permission

2024-07-17

According to reports, Apple, Nvidia, and Anthropic have been found to use YouTube subtitles to train AI models, which violates YouTube's policies. Reports from Proof News and Wired show that these companies have used subtitle datasets from thousands of YouTube videos without proper permission.


Research has found that Apple, Nvidia, and Anthropic have used the "YouTube subtitles" dataset. This dataset includes subtitles from 173,536 YouTube videos from 48,000 channels. These videos include educational channels such as Khan Academy and MIT, news channels such as The Wall Street Journal, and top creators' videos such as MrBeast and Marques Brownlee.


Prominent YouTubers react to data misuse


Prominent YouTuber Marques Brownlee commented on this issue on X, stating, "Apple collected data for AI from other companies. One of the companies collected a large amount of data from YouTube videos, including mine." While Apple may not have directly scraped the data, Brownlee pointed out that this issue will persist.


The "YouTube subtitles" dataset was developed by EleutherAI and released in 2020. It contains 5.7GB of data, including subtitles from deleted YouTube videos.


According to YouTube's terms and conditions, accessing videos through "automated means" is prohibited. The existence of subtitles from deleted videos only exacerbates this issue, raising questions about privacy and copyright infringement.


Salesforce, also implicated in the investigation, admitted to using the dataset.


"The Pile dataset mentioned in the research paper was trained in 2021 for academic and research purposes. The dataset is publicly available and released under a license."


However, the unauthorized use of YouTube content remains controversial to this day. In April of this year, YouTube CEO Neal Mohan stated that using YouTube videos, subtitles, or clips for AI training is a "clear violation" of their policies. However, according to The New York Times, OpenAI used one million hours of YouTube videos to train its GPT-4 model.


AI companies using internet content face legal disputes


After the release of ChatGPT, AI companies have faced increasing issues of unauthorized use of internet content. Additionally, content creators have filed lawsuits against Stability AI and Midjourney, accusing them of scraping copyrighted works without permission. Google, the owner of YouTube, is also facing similar class-action lawsuits, claiming that such legal actions threaten the foundation of generative AI.


In an interview with The Wall Street Journal, Mira Murati, the Chief Technology Officer of OpenAI, did not provide detailed information on whether the company used videos from social media platforms to train this new model. Mustafa Suleyman, the CEO of Microsoft AI, stated that content on the open web has been considered fair use based on what he calls a "social contract" since the 1990s.