Generative AI: The Internet's Consuming Force
Last summer, Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson wrote "The Curse of Recursion: Training Models to Forget with Generated Data," a paper that suggests that artificial intelligence models will self-poison in the near future. This warning is considered to be theoretically insightful, and there is evidence to suggest that this technology has issues.
This problem is known as "model collapse," where AI chatbots lose the information they initially learned and replace it with synthesized data from AI models.
Last month, a Twitter user posted a screenshot showing that Grok, a large language model chatbot developed by xAI, Elon Musk's AI company, plagiarized OpenAI's response.
When Winterbourne asked Grok to patch malicious software, Grok responded that it couldn't because it violated OpenAI's use case policy.
The user explained in the post, "Grok is essentially just plagiarizing OpenAI's codebase." A technician from xAI denied this explanation, having previously worked for competitors OpenAI and Google DeepMind.
"When we first noticed it, we were shocked," he responded. The employee may not have anticipated this, but the CEO of the company, Musk, certainly did.
This technology has not only sparked competition among tech companies but has also reignited old rivalries, such as the competition between OpenAI and Musk, who was an early supporter of GPT.
Aside from their personal issues, AI-related error messages have also made their way onto online shopping platforms. Users on the e-commerce platform Amazon pointed out that OpenAI error messages appeared in products.
The original copies of these products were named "Sorry, but I can't fulfill this request. It violates OpenAI's use policy." After media publications discovered these listings, the products were archived. Nevertheless, many such posts can still be found on Threads and LinkedIn.
Many argue that Shumailov and the team's research overlooks a crucial point. Daniel Sack, Managing Director and Partner at X Group, Boston Consulting Group's (BCG) Technology, Build, and Design division, is one of them.
He wrote on LinkedIn, "Most of the data used to train future models will not just be copies of raw materials, but completely novel and unprecedented."
In response, his theory is understandable, as technicians often find it difficult to admit faults in the products they are creating or assisting others in creating. Silicon Valley has repeatedly hesitated to acknowledge potential technological threats.
Even Sack's company, BCG X, collaborates with OpenAI, indicating that, at least for now, there is no one supporting this technology who can be trusted, as there are unresolved ethical issues at every level. All of the above issues indicate that boasting about the ability of this technology to solve human problems should not be the top priority at the moment.
Generated AI programs rely on an immeasurable amount of data from every corner of the internet. The internet is already inundated with AI-generated spam emails. Regardless of how VCs or developers of these AI models deny it, the problem still exists, and it will only worsen as hundreds of millions of people use these tools every day.
Catherine Flick, Professor of Ethics and Game Technology at Staffordshire University, commented on the Grok incident, stating, "This does indeed suggest that if these models learn from data generated by LLM, it is unreliable in the long run - if we cannot determine what data the machine has generated, the quality of the output will continue to decline."
Most importantly, humans cannot distinguish between AI-generated content and human-generated content. Similarly, these language models cannot determine if the AI-generated text they see aligns with reality, which may introduce more errors than the current models.