GPT-4 and other artificial intelligence systems can now learn and use human language, but they learn from language inputs of astronomical proportions - far surpassing what children acquire when learning to understand and speak language. The best AI systems are trained on trillions of texts, while children receive only millions of words per year.
Due to this vast data gap, researchers have been skeptical about whether recent advances in AI can tell us much about human learning and development. An ideal test of the connections being made would involve training AI models not on massive amounts of data from the internet, but solely on the input received by a single child. So what can the models learn?
A group of researchers at New York University conducted exactly this experiment. They trained a multimodal AI system using video recordings from a head-mounted camera, capturing the child's experiences from 6 months to their second birthday through their eyes and ears. They investigated whether the AI model could learn the words and concepts present in a child's everyday experiences.
Their findings, published in the journal Science, indicate that the model or neural network can actually learn a large number of words and concepts using limited snippets of a child's experiences. In other words, the video only captured about 1% of the child's awake time, but this was sufficient for genuine language learning.
"We demonstrate for the first time that a neural network trained on this kind of developmental reality input from a single child can learn to associate words with visual referents," said Wai Keen Vong, a research scientist at the NYU Center for Data Science and the first author of the paper.
"Our findings suggest that recent algorithmic advances, combined with a child's natural experiences, have the potential to reshape our understanding of early language and concept acquisition.
"By using AI models to study the real language learning problems faced by children, we can address classic debates about what conditions are necessary for children to learn words - whether they require specific language biases, innate knowledge, or associative learning abilities to start using words," added Brenden Lake, an assistant professor in the NYU Center for Data Science and the Department of Psychology, and a senior author of the paper. "It seems we can learn more than we typically imagine through learning."
Vong, Lake, and their NYU colleagues Wentao Wang and Emin Orhan analyzed footage captured by a lightweight head-mounted camera, totaling over 60 hours per week, from 6 to 25 months of age, documenting the child's learning process through first-person videos.
These videos contained about a quarter of a million instances of words (i.e., the number of words spoken, many of which were repeated), which were associated with video frames of what the child saw when saying those words, and included various activities throughout the developmental process, such as mealtime, reading books, and playing.
The NYU researchers then trained a multimodal neural network using two separate modules: one module received individual video frames (visual encoder), and the other module received transcribed child-directed speech (language encoder).