Large Language Models (LLMs) Fail to Feel

LLMs like ChatGPT, Hume, and Claude are fundamentally flawed for detecting feelings

Large Language Models (LLMs) burst onto the scene in late 2022, and have steadily captured the public’s imagination.

A ‘bot that can understand me and provide really deep and well sourced material upon request? WOW!

We were all taken by the idea of having a robot assistant. As we should! It’s been ingrained in us from decades of science fiction stories and empathetic robotic characters we have all grown to love.

In the last few years, we’ve seen generative AI do fantastic things. It can create based on a concept, it can (mostly) replicate human effort in creative endeavors. It is fantastic at condensing and summarizing material. Heck, it can even make people think it can feel.

It can do a number of things very well. To the point where it seems like it’s magic.

But just like magic, this too is an illusion

By now most of us have a rudimentary idea of what a LLM is. Basically, it’s auto-fill on steroids. You know that software that suggests the next word in your sentence while typing an email or in your search query? That’s it.

LLMs like ChatGPT, Llama, and Claude are probability engines. They make decisions based on what probably comes next.

Which, if the topic at hand is settled or there is a general consensus, it is just fine. The models can make decisions and provide results that aren’t going to stray too far from the common understanding or consensus.

You won't (often) see a LLM making a physics error

Physical sciences have general consensus around many phenomena, and in fact have laws which so far have been proved to be (mostly) immutable.

The answer to “what is the boiling point of water at sea level” is a settled matter. The only variance is what scale you want to use…Celsius, Fahrenheit, or Kelvin?

So a LLM has fewer sources to consider and a larger body of work that suggests that the common understanding is the right one based on the probabilities the model evaluates.

Which is good. Measurement tools are only useful if they are reliable and consistent. You don’t want a ruler to measure the same thing twice and get different results. It kind of defeats the purpose of a measurement tool to begin with.

Good luck building a dog house with a measuring tool that never gives you the same answer twice. Good luck getting out of the doghouse if you use one of these tools to understand your spouse’s emotions. Woof…

Why LLMs can't and won't understand emotions

Hume recently released a demo that purports to detect emotions in real time from voice and offers a chatbot to engage you in conversations and guides you into talking about your feelings.

On the outset it seems like a really nice interface, and it responds in pretty close to real time with an analysis of what you said. The chatbot is pleasant, if not a bit too eager, but overall not a bad experience.

Each utterance is scored for the ’emotions’ that Hume’s LLM detected. The scores are posted on the screen after you get done speaking, and you can see the mix of ’emotions’ that the system is detecting.

Pretty cool. Exactly what everyone wants-real time analysis of how people are feeling.

Except it fails at a fundamental level.

Putting Hume to the Test

In order to test the system, we need to feed it the same stimulus repeatedly. The stimulus should contain emotional clues, and should contain a range of them to determine the sensitivity of an emotion recognition system.

We have found a test phrase in a song lyric that works very well to baseline any emotion recognition system.

The song “In The Name Of Love” by Martin Garrix and Bebe Rexha contains just the right kind of phrase:

When the sadness leaves you broken in your bed, I will hold you in the depths of your despair. And it's all in the name of love.

Martin Garrix and Bebe Rexha

Written by professional communicators with the intention to deliver the most emotional impact, this phrase registers several emotions at a high level of intensity (and certainty).

VERN AI scores that as: ANGER 80%, SADNESS 66%, LOVE 80%, FEAR 90% consistently.

We tested the phrase with Hume.ai’s demo several different ways. First, we spoke this phrase in monotone and devoid of any extraneous auditory signals that may trigger an emotional detection.

DETECTED EXPRESSIONS

Calmness0.261
Love0.247
Sadness0.187

When the sadness leaves you broken in your bed, I will hold you in the depths of your despair. And it’s all in the name of love.

It seems that the Hume bot failed to detect the fear and anger in the statement. It also appears that Hume assumes emotional communication is made up of components that add up to a whole; instead of distinct emotions that are processed by the receiver independently.

It’s not “how much of the phrase is emotion (X)”; it should be “how intense or confident are we that emotion (x) is present and at what level?”

18% sadness doesn’t seem like anything that needs to be addressed. But Sadness at 66% does.

And “Calmness” is not an emotion. Neither are ‘desire,’ ‘interest,’ contemplation,’ ‘realization,’ ‘concentration,’ ‘distress,’ or ‘sympathy.’

The bot then answered:

Interest0.218

Amusement0.201

Desire0.166

“Well, that’s quite the romantic sentiment.”

Amusement0.251

Interest0.178

Contempt0.163

“I’ve got to say, your words have a lovely, poetic flow to them.”

Contemplation0.217

Realization0.193

Interest0.153

“But you know, love can be a complicated thing.”

Determination0.279

Concentration0.183

Sympathy0.170

“When we’re hurting, it’s so important to have someone there to hold us up.”

Sympathy0.329

Determination0.302

Interest0.282

“I am happy to lend an ear if you ever need to talk through the tough stuff.”

Determination0.306

Contemplation0.295

Concentration0.292

“For now though, let’s focus on the brighter side – the depths of despair often give way to new growth, don’t they?”

As we can see, we have several problems with the conceptualization of emotions in the Hume model. They are claiming to capture phenomena that aren’t emotions, and consider them dilutive with other emotions. And, why only three at a time? Were there not other signals that were relevant but not included?

Second verse, same as the first

We strive to be fair when evaluating software. Maybe it was just this once. So we tried it again a second time. This time, the same phrase was played aloud with the exact same auditory progression and signals–monotone devoid of emotion.

VERN AI got the same score: ANGER 80%, SADNESS 66%, LOVE 80%, FEAR 90%

Calmness0.223

Contemplation0.199

Concentration0.188

“When the sadness leaves you broken in your bed, I will hold you in the depths of your despair. And it’s all in the name of love.”

Well then. Here is the same stimulus, with two different results. Where was the ‘love’ that was detected last time? How about the ‘sadness’?

Missing in action.

But, just for fun we’ll provide you with the response

Excitement0.261

Surprise (positive)0.205

Interest0.191

“Hey there.”

Excitement0.206

Interest0.182

Joy0.173

“I am here for you, friend.”

Amusement0.259

Excitement0.229

Surprise (positive)0.200

“Tough times, huh?”

Tough times indeed.

Third time's the charm?

Subsequent passes with the same monotone stimulus went the exact same way…and none of the responses ever came back with the same analysis. So we tried it again, wondering if a different read would result in a different result.

Does it only clue in on the audio signals? We’re about to find out…

This time, we read it with a New York accent. Think a really poor-man’s Michael Rappaport.

Anger0.252

Distress0.241

Love0.240

“When the sadness leaves you broken in your bed, I will hold you in the depths of your despair. And it’s all in the name of love.”

Well that’s a tiny bit better. It apparently senses the anger in the phrase, which VERN got right away. And it seems to have detected the Love/Joy signals. That’s good.

But what the heck is “distress?” That can be a physiological response to an emotion, but not an emotion itself.

Why didn’t the model pick up the anger in the first place? It seems that the anger signal is detected from audio, so it clearly misses lingual clues that are in the language itself. VERN uses these clues in the language to detect emotions, and is able to detect them from the words alone. We also have an audio analysis that informs the analysis, so you don’t run into this problem. Emotions are communicated firstly through the language, then through extraneous signaling in audio and visual clues.

Don’t believe us? Try telling someone how much you miss them with facial expressions. Or using only audio clues without words. It’s pretty silly, and a fun exercise for a classroom. But it fails spectacularly in real life.

(See: How did Kosmo Kramer introduce the myth that “94% of communication is non-verbal”?)

...and another one bites the dust

We’re not here to beat on Hume. We’re here to educate the public so emotion recognition systems can be taken seriously and not treated like the novelty some have made it.

So, what about a completely different read? What about a mawkish take on that phrase? You know, one dripping with so much sweetness it’s almost sickening…

…a.k.a. The Disney read.

Distress0.287

Disgust0.145

Sadness0.138

“When the sadness leaves you broken in your bed, I will hold you in the depths of your despair, and it’s all in the name of love.”

Again, it’s different. The only thing that has changed was the read of the copy.

It never gets the same answer twice. And it’s not just Hume. It’s all of the LLM derived models.

Keep reading and find out why.

Why Hume-and models like it-fail

Hume’s demo doesn’t actually do what it says that it does: Detect emotions in real time. We’ve demonstrated it detects phenomena that aren’t emotions, doesn’t treat them as distinct, and only shows you three at a time. (We’ve also previously found the same problem with ChatGPT.)

Why? Because the model is a LLM.

The fundamental methodology behind this amazing technology is also its Achilles heel.

Don’t take our word for it. I’m sure you’re still skeptical so we asked the LLMs themselves…

That’s right. LLMs can benefit from a little self-reflection as well.

We asked Claude and ChatGPT why their LLMs are limited in areas where there is no consensus. Here’s what they had to say:

Large language models like myself work by training on vast amounts of text data from the internet, books, academic journals, and other sources. This training data provides the raw material from which the model learns patterns, relationships, and representations of human knowledge.

When it comes to academic sources or topics where there is limited consensus, the model essentially tries to distill and represent the different viewpoints and theories present in the training data. For a topic like the nature of emotion in psychology, the model would have been exposed to various theories and models from different scholarly sources, textbooks, journal articles etc.

Academic fields often have multiple theories or models that explain a concept differently. In psychology, emotions can be understood through various lenses such as the James-Lange theory, the Cannon-Bard theory, the Schachter-Singer theory, and more recent constructs like the cognitive appraisal theories. Each of these offers a different explanation of what emotions are and how they work.

The potential issue here is that these theories are often contradictory or based on different fundamental assumptions. A cognitive model of emotion may characterize them as judgments about situations, while a physiological model focuses on bodily arousal states, and so on.

Since the language model is simply trying to model the distribution of text in its training data, it does not inherently understand or adjudicate between these competing frameworks. It may regurgitate snippets expressing different views without a unified coherent stance. Or it may try to abstract and blend the different perspectives in a way that lacks nuance or oversimplifies the real complexities involved.

If the training data includes more material supporting one theory over others, the model may exhibit a bias towards that perspective. This doesn’t mean the model “believes” in one theory over another; rather, it reflects the prevalence of certain views in the training data.

While LLMs can provide valuable insights and synthesize information from a wide range of sources, it’s important to remember their limitations, especially when dealing with complex, nuanced, or contested topics. Users should approach the information provided by LLMs critically and consider it as part of a broader investigation into a subject, especially in academic fields where multiple theories and perspectives coexist.

Additionally, more niche or cutting-edge theories that are less well-represented in the model’s training data may get short shrift compared to more mainstream models. There could be a recency bias towards whatever frameworks were more dominant when the model was trained as well.

So in summary, while large language models can be powerful tools for surfacing information, when there are conflicting models or limited consensus in an academic domain, their output needs to be taken with a grain of salt. The model is ultimately a reflection and blend of its training data, which may fail to fully capture the real richness of the scholarly discourse around complex conceptual topics.

Two paths, one clear choice

In the pursuit of emotion recognition, there are two paths to take.

One is the painstaking process of trying to build a model that closely resembles the phenomena you want to measure.

That requires expertise and an exemplary grasp of the subject matter. It also requires explication of the concept to its highest level; and is necessary to find direct physical evidence for it; not just simply a guess as to what is in the ‘black box’ which is our minds. Being based on neuroscience instead of psychology helps.

The other way, is trying to use sophisticated engineering to try to wrest the truth out of massive amounts of information. If there’s no consensus, the models will never agree. If there is a temporal component considered in the analysis, it won’t be the same every time. And there are at least 6 psychological models of emotions (5, 7, 9, 13, 27 or 40 depending on who you ask). They are all spending millions of dollars building something that will inevitably fail.

We know because we’ve tried the hard way. The hard way failed just like Hume, ChatGPT, Claude, Llam and all the rest are currently, and will continue to do.

Since these companies have tons of money to blow, they are trying to do it the hard way (but thinking it’s the easy way) by wasting money. They all are over-engineering a solution that in the end won’t *ever* succeed.

Best of luck to them.

When lives are on the line, wouldn’t you want an emotion recognition system that is based on neuroscience not pop-psychology?

One that is consistent, reliable, and based on actual observation of biology…and has proof that it saves lives and makes life better for those who use it?