The Double-Edged Sword of AI: Balancing Copyright and Memory

2024-01-04

The legal conflict between The New York Times and OpenAI regarding AI model copyright has pushed the terms "memory" or "plagiarism" to the forefront. One term closely related to AI is "approximate retrieval," which may be all that OpenAI needs to win the case.


Subbarao Kambhampati, a professor at Arizona State University, discussed ChatGPT and the inability to accurately reproduce the information received from it in a podcast, saying: It's like an AI toothpaste that contains all human wisdom and knowledge, and whatever you need can be easily squeezed out in a convenient form.


LLM is like toothpaste


LLM "does not completely reproduce existing information, but includes them in the answers," he added.


The core of approximate retrieval is that LLM does not conform to the traditional database model that emphasizes exact matching and precision. Instead, they operate as n-gram models, introducing elements of uncertainty into the retrieval process. Prompts are not keys to enter structured databases, but clues for the model to generate the next token based on context.


Kambhampati explained in a recent LinkedIn post that this distinction becomes crucial for legal discussions surrounding The New York Times lawsuit. LLM does not promise exact retrieval, blurring the line between flexibility and unpredictability. They exist in a space that is neither purely a database nor a traditional information retrieval (IR) engine, forcing people to reexamine their characteristics.


Weighing between better AI models and proper attribution


The lawsuit revolves around the subtle issue of memorization. Although LLM cannot guarantee verbatim reproduction, their wide context window and powerful network capacity open the door to potential memorization, raising concerns about unintentional plagiarism. In this lawsuit, if prompted multiple times, it can generate identical sentences.


In an attempt to incorporate "thinking" capabilities into LLM, developers have tried to fine-tune LLM to handle planning problems, reducing the task to memory-based retrieval and diminishing its autonomy. However, this also comes with an increase in the context length of LLM, making the memorization problem worse. Prompting LLM also raises concerns about the reliability of the method in a loop.


In conclusion, commercial LLM creators like OpenAI often change their statements based on different occasions.


In legal discussions, they can emphasize the inability of the model to achieve exact retrieval as a defense against copyright infringement. Meanwhile, when promoting LLM as a search application, they highlight the ability of memorization as a feature.


What is the ultimate goal?


In fact, there is no foolproof solution to control these dual behaviors. Attempting to curb memorization may harm the ability of LLM as a substitute for search engines, putting us in a perplexing dilemma: making AI models better or caring about copyright.


For example, a user on X pointed out, "Especially in news generation, there is a dilemma: if LLM is too creative, it will produce fake news or at least inaccurate news; otherwise, copyright issues will arise. There are problems either way."


Another user pointed out that AI image generators based on diffusion models, such as Midjourney, Stable Diffusion, and DALL-E, also face the same situation. They are not meant to generate identical images, but they ultimately create very similar outputs. The better these models become, the closer the generated images are to the user's prompts, rather than for the inherent need to avoid copyright.


The emergence of Retrieval-Augmented Generation (RAG) introduces an external IR component, attempting to combine it with LLM, which adopts a more structured approach to information retrieval. The hope is to find a balance between the spontaneity of LLM and the orderliness of traditional search methods, reducing illusions in these models.


However, Kambhampati explains that this increases the possibility of LLM, such as GPT-4, retrieving exact information from sources like The New York Times, which are essentially added as vector databases to the model. This is exactly what RAG is designed for, but it goes against the creators' intentions regarding copyright.


"Because of the way n-gram models work, there is never a 100% guarantee that a stored record (whether it's a program or an article from The New York Times) will be retrieved without any modifications. So why did The New York Times sue OpenAI?" Kambhampati asks. The key to the case lies in whether the underlying training dataset actually includes articles from The New York Times, which it apparently does, and whether OpenAI's model truly affects the publication's revenue.


If manufacturers of LLM try to reduce "memorization," they will inevitably find that the ability of LLM to disguise itself as a search engine, which is already quite questionable in authenticity, will further degrade, Kambhampati concludes.