The "Strawberry" Problem: Overcoming AI Limitations

2024-10-14

So far, large language models like ChatGPT and Claude have been widely adopted globally. Many people have begun to worry that artificial intelligence might take over their jobs, making it quite ironic when almost all LLM-based systems perform poorly on a simple task—namely, counting the number of "r"s in the word "strawberry." They not only fail to accurately identify the letter "r," but also struggle with other examples such as counting the number of "m"s in "mammal" and the number of "p"s in "hippopotamus." In this article, I will analyze the reasons behind these failures and provide a straightforward solution.

Large language models are powerful AI systems trained on vast amounts of text, enabling them to understand and generate human-like language. They excel in tasks such as answering questions, translating languages, summarizing content, and even generating creative writing by predicting and constructing coherent responses. Designed to recognize patterns in text, large language models can handle various language-related tasks with impressive accuracy.

Despite their capabilities, the fact that they fail to count the number of "r"s in the word "strawberry" serves as a reminder that large language models do not possess human-like "thinking" abilities. They process the information we provide in ways that differ from humans.

Almost all high-performance large language models today are built on the Transformer architecture. This deep learning framework does not take text as direct input. Instead, it employs a process called tokenization, converting text into numerical representations known as tokens. Some tokens may represent entire words (e.g., "monkey"), while others may represent parts of words (e.g., "mon" and "key"). Each token acts like a code that the model can understand. By breaking everything down into tokens, the model can better predict the next token in a sentence.

Large language models do not memorize words; instead, they attempt to understand how these tokens can be combined in different ways, making them adept at guessing what comes next. For example, in the word "hippopotamus," the model might see the letter tokens "hip," "pop," "o," and "tamus," without knowing that the word "hippopotamus" is composed of the letters "h," "i," "p," "p," "o," "p," "o," "t," "a," "m," "u," "s."

A model architecture capable of directly viewing individual letters without tokenization might potentially avoid this issue, but it is computationally unfeasible with today's Transformer architectures.

Furthermore, let's examine how large language models generate output text: they predict the next word based on previous input and output tokens. While this approach is suitable for generating context-aware, human-like text, it does not work well for simple tasks like counting letters. When asked how many "r"s are in the word "strawberry," large language models rely entirely on predicting an answer based on the structure of the input sentence.

The following is a solution:

Although large language models may lack the ability to "think" or perform logical reasoning, they excel at understanding structured text. A prime example of structured text is computer code in many programming languages. If we ask ChatGPT to use Python to count the number of "r"s in "strawberry," it is likely to provide the correct answer. When large language models need to perform counting or any other tasks that may require logical reasoning or arithmetic calculations, broader software can be designed to include prompts that instruct the large language models to use a programming language to handle the input queries.

Conclusion

A simple letter-counting experiment reveals a fundamental limitation of large language models like ChatGPT and Claude. While they demonstrate impressive abilities in generating human-like text, writing code, and answering a wide range of questions, these AI models still cannot "think" like humans. This experiment highlights their essential nature as pattern-matching prediction algorithms rather than "intelligent" entities capable of understanding or reasoning. However, knowing in advance which types of prompts yield better results can help mitigate this issue to some extent. As AI continues to integrate into our lives, recognizing these models' limitations is crucial for using them responsibly and maintaining realistic expectations.