"OpenAI to Pay Millions in Annual Licensing Fees for News to Train Large Models"

2024-01-05

As news publishers sign agreements with AI companies to use their news reports for training AI models, it is becoming known that companies like OpenAI are willing to pay a price for copyright information.

According to reports, OpenAI offers a price range of $1 million to $5 million per year to obtain licenses for news articles needed to train its AI models. This is the first glimpse of how much AI companies plan to pay for licensing materials. This news comes alongside recent reports that Apple is seeking partnerships with media companies to use content for AI training and offering at least $50 million over several years to acquire data.

These figures seem to be roughly similar to some early non-AI copyright agreements. When Meta launched the Facebook News tab, it reportedly offered up to $3 million per year for licenses to news reports, headlines, and previews. However, it is currently unclear whether these total payment amounts are equivalent to some of the larger figures we have seen. For example, Google announced in 2020 that it would invest $1 billion in partnerships with news organizations. Under pressure from a new law, Google recently agreed to pay Canadian publishers a total of $100 million per year in exchange for linking to their articles.

Based on our knowledge of training data content, today's large-scale language models are primarily trained on information found on the internet. While some AI models do not disclose how they obtain training data, information about which datasets or web crawlers are used can usually be found. The pricing of training datasets varies depending on the provider, size, and content of the dataset. Some data providers, such as LAION, are open source and completely free, and are used by models like Stable Diffusion. AI developers often also set up web crawlers to gather data from the internet to assist in training their models. (AI developers also have to employ personnel to review, label, and sometimes clean training data, which significantly increases operational costs.)

However, this practice is now facing significant challenges. On one hand, OpenAI's GPT crawler has been blocked from accessing data by some companies, including The New York Times and Vox Media, the parent company of The Verge. On the other hand, some organizations believe that training on their data constitutes copyright infringement. Institutions like The New York Times have filed lawsuits against OpenAI and Microsoft for copyright infringement, claiming that ChatGPT and Microsoft's Copilot can generate outputs that are almost identical to their works.

By establishing partnerships, AI companies can avoid these issues, which has become more common in the past year. Publishers such as Axel Springer (parent company of Politico and Business Insider) and The Associated Press have signed agreements with OpenAI to license stories for training models like GPT-4 and develop technologies for news gathering.

OpenAI and Apple are not the only AI developers looking to collaborate with news organizations. According to reports, Google has demonstrated an AI tool called Genesis to executives from The New York Times, The Wall Street Journal, and The Washington Post. This tool is capable of generating news reports from facts.