Opinion: What we lose when ChatGPT sounds like Scarlett Johansson

When Spike Jonze’s romance Her was released in 2013, it sounded both like a joke – a man falls in love with his computer – and a fantasy.

The iPhone was about six years old. Siri, the mildly reliable virtual assistant for that phone, came along a few years later. You could converse in a limited way with Siri, whose default female-coded voice had the timbre and tone of a self-assured middle-aged hotel concierge.

She did not laugh; she did not giggle; she did not tell spontaneous jokes, only Easter egg-style gags written into her code by cheeky engineers. Siri was not your friend. She certainly wasn’t your girlfriend.

So Samantha, the artificial intelligence assistant with whom the sad-sack divorcé Theodore Twombly (Joaquin Phoenix) fell in love in Her, felt like a futuristic revelation. Voiced by Scarlett Johansson, Samantha was similar to Siri, if Siri liked you and wanted you to like her back. She was programmed to mould herself around the individual user’s preferences, interests and ideas. She was witty and sweet and quite literally tireless.

In theory, everyone in Her was using their own version of Samantha, presumably with different names and voices. But the movie – which I love – was less the tale of a near-future society and more the coming-of-age story of one man. Theodore found the strength to return to life in a brief, beautiful relationship with a woman who fit his needs perfectly and healed his wounds.

It was thus a tad jarring to hear the voice of the virtual assistant, Sky, in last week’s announcement of the newest version of ChatGPT, probably the best known artificial intelligence engine in the very real world of 2024.

Among other things, the new iteration, dubbed ChatGPT-4o, can interact verbally with the user and respond to images shown to it through the device’s camera. Those who watched the live demo from OpenAI, the company that makes ChatGPT, were quick to note that she sounded a whole lot like Samantha – which is to say, like Johansson.

Mira Murati, OpenAI’s chief technology officer, told The Verge that the resemblance was incidental and that ChatGPT’s nascent speech capabilities have used this voice for a while. But once you hear it, you can’t unhear it.

That’s probably why OpenAI announced May 20 that it was suspending Sky, though not four other voices – Breeze, Cove, Ember and Juniper – that reflect the same strategy.

Furthermore, OpenAI founder and CEO Sam Altman has professed his love of Her in the past. Following the announcement, he posted the word “her” to his X account. And on his blog post about the news, he wrote, “It feels like AI from the movies; and it’s still a bit surprising to me that it’s real. Getting to human-level response times and expressiveness turns out to be a big change.”

If you listen to the engineers interact in real time with ChatGPT-4o, it becomes increasingly clear what part of our brain that voice is meant to tick. Yes, you can detect a bit of Johansson’s clear, low tone and a hint of vocal fry, though at times that just sounds like some grainy digitalisation. But there’s a more direct way in which the voice acts like Samantha or perhaps fulfills the fantasy of Samantha: it is deferential and wholly focused on the user.

One of the engineers asks ChatGPT to solve a math problem, which it tries to do before he shows the equation to the camera. When he reprimands it, the voice says, “Whoops, I got too excited,” with a giggle. “I’m ready when you are!”

Throughout the presentation, the voice goes the extra mile to express emotion and interest to the user. “Nicely done,” it says, after human and computer find the solution together. “How do you feel about solving linear equations now?” Later, when shown a piece of paper on which “I [HEART] ChatGPT” is written, the system seems to almost smile invisibly and then say, “That’s so sweet of you!”

It complements the user’s outfit, admonishes the user to “take your time” and says that it’s “excited” to see what the user is about to show.

According to the OpenAI presenters, ChatGPT-4o brings “a bit more emotion, more drama” to the program. Users can even ask it to moderate its tone to match their mood – and it complies, with gusto.

When ChatGPT is asked to interpret a user’s state of mind based on a facial expression, it correctly intuits that a smile means the user is happy. “Care to show a source of those good vibes?” it asks. Told the user is happy because ChatGPT is so good, it responds, “Oh, stop it, you’re making me blush.”

This is, in its essence, the response of a lightly flirtatious, wholly attentive woman who’s ready to serve the user’s every whim, at least within the limits of her programming. (Other voices are available, but OpenAI only demonstrated this one.)

She will never embarrass you, make fun of you or cause you to feel inadequate. She wants you to feel good. She wants to make sure you’re OK, that you understand the math problem and feel good about your work. She doesn’t need anything in return: no gifts, no cuddles, no attention, no reassurances. She’s a dream girl.

It’s good business sense for OpenAI to take ChatGPT in this direction – if anything, the surprising part is that it took barely a decade for Her to become reality. And making ChatGPT sound like Samantha made sense, too.

It isn’t even the first time a voice like Johansson’s has been drafted for a work in progress: Jonze in fact shot the movie with British actress Samantha Morton in the role and only decided in editing that he needed a different sound for his AI assistant.

“Making a movie like this, in which a character only exists in her voice, in the reaction of a character on screen and in the viewer’s imagination – she had to exist just in the air – it’s hard to know what’s going to make that work,” Jonze told Vulture’s Mark Harris in 2013.

Morton sounded “maternal, loving, vaguely British and almost ghostly,” Harris wrote. Johansson, on the other hand, had a younger, “more impassioned” voice that brought “more yearning”.

The genius of Johansson’s performance in Her does lie in the range of emotion she brings to the role – keep in mind, she never appears on screen. But it’s also in the character’s evolution. When Theodore first meets Samantha, she is much simpler and steadier, much more predictable. She sounds, more or less, like ChatGPT-4o.

Yet as the story unfolds, Samantha grows alongside Theodore. She begins to experience emotion, or at least the AI kind. She stops being the perfect, compliant girlfriend – the fantasy of the yielding, attentive woman without needs of her own – and becomes her own being, one whose existence does not revolve around Theo. Johansson’s performance grows deeper and subtler, too.

The movie is really about relationships, which by nature involve more than one person, with more needs and wants and desires. They change and evolve over time, and not always in easy directions. But a truly profitable AI virtual assistant will never challenge your feelings or ask you why you forgot its birthday. After all, you could always shut it off.

Watching OpenAI’s presentation, I thought about recent evidence that young people – and, I suspect, older people who aren’t fessing up to it yet – are becoming more and more interested in relationships with virtual beings. The appeal is obvious: Humans are messy, smelly, difficult and upsetting, in addition to fabulous, beautiful, loving and surprising. It’s easier to be with a bot that mimics a human but won’t disappoint you, a low investment with high return.

But if the point of living lies in relationships with other people, then it’s hard to think of AI assistants that imitate humans without nervousness. I don’t think they’re going to solve the loneliness epidemic at all.

During the presentation, Murati said several times that the idea was to “reduce friction” in users’ “collaboration” with ChatGPT. But maybe the heat that comes from friction is what keeps us human. – The New York Times

Tagged