From baby talk to baby artificial intelligence

We ask a lot of ourselves as babies. Somehow, we must grow from sensory blobs into mobile, rational, attentive communicators in just a few years.

Here you are, a baby without a vocabulary, in a room cluttered with toys and stuffed animals. You pick up a Lincoln Log, and your caretaker tells you, “This is a ‘log’.” Eventually, you come to understand that “log” does not refer strictly to this particular brown plastic cylinder or to brown plastic cylinders in general, but to brown plastic cylinders that embody the characteristics of felled, denuded tree parts, which are also, of course, “logs”.

There has been much research and heated debate around how babies accomplish this. Some scientists have argued that most of our language acquisition can be explained by associative learning, as we relate sounds to sensibilia, much like dogs associate the sound of a bell with food. Others claim that there are features built into the human mind that have shaped the forms of all language and are crucial to our learning. Still others contend that toddlers build their understanding of new words on top of their understanding of other words.

This discourse advanced on a recent Sunday morning, as Tammy Kwan and Brenden Lake delivered blackberries from a bowl into the mouth of their one-year-old daughter, Luna. Luna was dressed in pink leggings and a pink tutu, with a silicone bib around her neck and a soft pink hat on her head. A lightweight GoPro-type camera was attached to the front.

“Babooga,” she said, pointing a round finger at the berries. Kwan gave her the rest, and Lake looked at the empty bowl, amused. “That’s like US$10,” he said. A light on the camera blinked.

For an hour each week over the past 11 months, Lake, a psychologist at New York University whose research focuses on human and artificial intelligence, has been attaching a camera to Luna and recording things from her point of view as she plays.

His goal is to use the videos to train a language model using the same sensory input that a toddler is exposed to – a LunaBot, so to speak. By doing so, he hopes to create better tools for understanding both AI and ourselves.

“We see this research as finally making that link, between those two areas of study,” Lake said. “You can finally put them in dialogue with each other.”

There are many roadblocks to using AI models to understand the human mind. The two are starkly different, after all. Modern language and multimodal models – such as OpenAI’s GPT-4 and Google’s Gemini – are assembled on neural networks with little built-in structure and have improved mostly as a result of increased computing power and larger training data sets. Google’s most recent large language model, Llama 3, is trained on more than 10 trillion words; an average five-year-old is exposed to more like 300,000.

Such models can analyse pixels in images but are unable to taste cheese or berries or feel hunger, important kinds of learning experiences for children. Researchers can try their best to turn a child’s full sensory stream into code, but crucial aspects of their phenomenology will inevitably be missed.

“What we’re seeing is only the residue of an active learner,” said Michael Frank, a psychologist at Stanford University who for years has been trying to capture the human experience on camera. His lab is working with more than 25 children around the country, including Luna, to record their experiences at home and in social settings.

Humans are also not mere data receptacles, as neural nets are, but intentional animals. Everything we see, every object we touch, every word we hear couples with the beliefs and desires we have in the moment.

“There is a deep relationship between what you’re trying to learn and the data that come in,” said Linda Smith, a psychologist at Indiana University. “These models just predict. They take whatever is put into them and make the next best step.”

While you might be able to emulate human intentionality by structuring training data – something Smith’s lab has been attempting to do recently – the most competent AI models, and the companies that make them, have long been geared toward efficiently processing more data, not making more sense out of less.

There is also a more conceptual issue, which stems from the fact that the abilities of AI systems can seem quite human, even though they arise in nonhuman ways. Recently, dubious claims of consciousness, general intelligence and sentience have emerged from industry labs at Google and Microsoft after the release of new models.

In March, Claude 3, the newest model from an AI research startup called Anthropic, stirred up debate when, after analysing a random sentence about pizza toppings hidden in a long list of unrelated documents, it expressed the suspicion that it was being tested. Such reports often smell like marketing ploys rather than objective scientific projects, but they highlight our eagerness to attribute scientific meaning to AI.

But human minds are converging with virtual ones in other ways. Tom Griffiths, a cognitive scientist at Princeton University, has suggested that, in describing the limitations of human intelligence, and building models that have similar limitations, we could end up with a better comprehension of ourselves and more interpretable, efficient AI.

“A better understanding of human intelligence helps us better understand and model computers, and we can use these models to understand human intelligence,” Griffiths said. “All of this is very new. We’re exploring the space of possibilities.”

In February, Lake and his collaborators created the first AI model trained on the experiences of a child, using videos captured in Frank’s lab more than a decade ago. The model was published in the journal Science and, based on 60 hours of footage, was able to match different moments with words.

Type in “sand” and the model will recall the moment, 11 years ago, when the boy whose experiences the model was trained on visited the beach with his mother. Type in “car” and the model brings up a first-person video of the boy sitting in his booster seat.

The training videos are old and grainy, and the data are fairly sparse, but the model’s ability to form some kind of conceptual mapping of the world suggests that it might be possible for language to be picked up mostly through association.

“We had one reviewer on the paper who said, ‘Before I read this, I would’ve thought this was impossible’,” said Wai Keen Vong, a researcher at NYU who helped lead the work.

For Lake, and for others investigators like him, these interlocking questions – How humanlike can we make AI? What makes us human? – present the most exciting research on the horizon. To pursue the former question piece by piece, by modelling social interactions, intentions and biases, by collecting comprehensive video footage from a headcam mounted on a one-year-old, is to move closer to answering the latter.

“If the field can get to the place where models are trained on nothing but the data that a single child saw, and they do well on a huge set of tasks, that would be a huge scientific achievement,” Lake said.

In their apartment, Lake and Kwan were gathering Luna and her older brother, Logan, for a birthday party. The children, crowding into the doorway, pulled on their socks and shoes. Lake stopped the recording on Luna’s camera and handed her a pair of fuzzy white mittens with sheep faces on them. “What are those, Luna?” he asked.

“Baa baa,” Luna said.

Kwan said, “There was a time when she didn’t know the word ‘no’, and it was just ‘yes’ to everything.” She addressed Luna: “Kisses, do you want kisses?”

“No,” Luna said.

“Oh,” Lake said, laughing. “I do miss the ‘yes’ phase.” – The New York Times

Tagged