"OpenAI Transcribes Extensive YouTube Videos to Train GPT-4"

2024-04-07

Earlier this week, The Wall Street Journal reported that AI companies are facing difficulties in collecting high-quality training data. Today, The New York Times provided detailed coverage on how some companies are addressing this issue. Not surprisingly, this involves some activities that fall into the gray area of AI copyright law. According to the report, OpenAI urgently needs training data and has developed its Whisper audio transcription model to overcome this challenge. It has transcribed over one million hours of YouTube videos to train its state-of-the-art language model, GPT-4. The company acknowledges the legal concerns surrounding this approach but considers it a reasonable use. OpenAI President Greg Brockman personally participated in collecting the training videos. OpenAI spokesperson Lindsay Held stated that the company curates "unique" datasets for each model to "help them understand the world" and maintain global research competitiveness. Held added that the company uses "various sources, including publicly available data and non-public data provided by partners," and is exploring the generation of synthetic data. The New York Times' report noted that the company exhausted its useful data resources in 2021 and discussed transcribing YouTube videos, podcasts, and audiobooks after exhausting other resources. At that time, the company had already used data such as computer code from GitHub, chess game databases, and student homework content from Quizlet to train its models. Google spokesperson Matt Bryant stated that the company had seen "unverified reports" about OpenAI's activities and added, "Our robots.txt file and terms of service prohibit unauthorized scraping or downloading of YouTube content," aligning with Google's terms of use. YouTube CEO Neal Mohan also made similar comments this week regarding the possibility of OpenAI using YouTube to train its Sora video generation model. Bryant stated that Google would take "technical and legal measures" to prevent unauthorized use "when we have clear legal or technical grounds." According to sources cited by The New York Times, Google also collects transcription material from YouTube. Bryant stated that the company uses a portion of YouTube content to train its models in accordance with agreements reached with YouTube creators. The report from The New York Times mentioned that Google's legal department requested its privacy team to adjust policy language to expand the scope of handling consumer data (such as data in Google Docs and other office tools). The new policy was deliberately scheduled to be released on July 1 to take advantage of the distraction caused by the Independence Day weekend. Meta has also faced limitations in the availability of high-quality training data. In recordings heard by The New York Times, its AI team discussed their unauthorized use of copyrighted works while trying to catch up with OpenAI. After "almost browsing all available English books, papers, poems, and news articles on the internet," the company is apparently considering measures such as purchasing book licenses or even acquiring large publishers. Additionally, Meta faces restrictions in using consumer data due to privacy protection reforms following the Cambridge Analytica scandal. Google, OpenAI, and the broader AI training field are struggling with the scarcity of training data, as models perform better with more data. The Wall Street Journal wrote this week that by 2028, companies may surpass the rate of generating new content. Possible solutions mentioned by The Wall Street Journal on Monday to address this issue include using "synthetic" data created by their own models or so-called "curriculum learning" to train models. The latter involves providing high-quality data to models in an ordered manner, hoping that they can establish "smarter connections between concepts" with less information. However, both methods have yet to be confirmed. Another option for these companies is to use any data they can find, regardless of whether it is licensed or not, but this approach has sparked significant controversy based on several lawsuits filed in the past year or so.