OpenAI Claims "Impossible" to Train AI Models Without Copyrighted Material

2024-01-12

This should be a proactive defense by OpenAI against multiple copyright infringement claims, but not everyone who sues the company will see it that way.


OpenAI, the company behind the most popular generative AI models on the internet, has taken an interesting stance on the increasing number of copyright infringement claims. In written evidence submitted to the UK Parliament's House of Lords Communications and Digital Committee, OpenAI stated that it is not possible to train tools like ChatGPT without using copyrighted materials.





The Communications and Digital Committee investigates how public policy intersects with the media, digital communications, and creative industries in the UK. Once the investigation is complete, the committee will publish a report on its findings. These reports may then form the basis for wider policy changes by the UK government. In July 2023, the committee launched an inquiry to "review large language models and analyze what needs to be done in the next 1-3 years to ensure the UK can respond to the opportunities and risks they present." This inevitably focused on OpenAI's ChatGPT and DALL-E.


In addition to sharing its views on the potential impact of large language models (LLMs) on society in the coming years, OpenAI also took the opportunity to defend its use of copyrighted materials in training ChatGPT. "Because copyright covers almost all types of human expression today—including blog posts, photos, forum posts, software code snippets, and government documents—it is not possible to train leading AI models without using copyrighted materials," the document stated. "Limiting training data to books and paintings from the public domain created over a century ago might make for an interesting experiment, but it would not provide AI systems that meet the needs of today's citizens."




OpenAI does not view the ubiquity of copyright as a sign that platforms like ChatGPT are not worth infringing upon intellectual property rights, but rather as a temporary shield. Multiple plaintiffs have accused OpenAI of training ChatGPT using their copyrighted works. The New York Times has also sued OpenAI for unauthorized reproduction of its content.


The Communications and Digital Committee is not a court. Nevertheless, the outcome of its inquiry can easily influence the views and approaches of the UK and other Western government entities towards generative AI. OpenAI is aware of this and is using the opportunity of the committee's inquiry to proactively address any copyright issues, as copyright litigation cases pile up in the US.


OpenAI also acknowledges that "there is work to be done to support and empower creators." It is reportedly working on allowing publishers to block GPTBot from crawling their website content and enabling photographers and other artists to exclude their images from future DALL-E training sets.