Exploring AI – Jagged Intelligence

AI systems are said to exhibit “jagged intelligence.” It’s a tight soundbite, but what does that mean—really? I explored to try to get to the root of the issue.

The assumption of human-like

We expect all intelligences to be smooth and predictable to us like ours are. If someone can tackle advanced calculus, we assume they can handle basic arithmetic without breaking a sweat. Or if a colleague designs a complex project workflow, they won’t contradict themselves on fundamental requirements. But with AI, especially large language models (LLMs), this is a horrible assumption. These systems are heroes in many spots while simultaneously zeros in others, creating a “jagged” profile of brilliance here while baffling to us as to why they can’t there.

Take some real-world quirks: An AI might solve PhD-level math problems with elegance, yet fail to count the ‘r’s in “strawberry” (they usually insist there are only two). Models can dissect the grammar of a convoluted sentence flawlessly but claim 9.9 is greater than 9.11. Or, when building a workflow for personalized feedback using user histories, it might cheerfully suggest step one: anonymize all input data—directly opposing the need for specificity, just because anonymization is a common first step in its training corpus. I’ve seen models that can one shot in minutes coding challenges that would take any programmer a week to write, only to flub simple logical connections discussed in casual chit-chat, like misunderstanding basic cause-and-effect in a story prompt. These aren’t issues you can ignore and will go away; they’re features of jaggedness, where peaks of prowess hide valleys of incompetence.

Transformers – less than meets the eye

Why does this happen? At the core, LLMs like those based on the transformer architecture from the 2017 Google paper “Attention Is All You Need” were originally designed to solve the language translation problem by building a semantically coherent vector space—a kind of universal intermediate “language” in numbers where concepts align coherently. Where “King” less “male” plus “female” points to “queen”; it’s internally consistent for these types of basic operations.

Putting it another way, the models learn how everything relates to everything else, but skipped going through the process we did where we gradually discover how and why everything fits together like it does. We all assume, incorrectly, to get to that end you would have had to have gone through similar means – and there are a lot of lessons along the way we take for granted that they skipped.

As they grew bigger models, they realized that the scope of the context they could use to translate would grow with it. As did the cognitive complexity they were capable of inferring. They started to capture emotional subtext and transfer that through the translations correctly. Answering for itself how would an equally upset speaker say this in English from Spanish? From that the pioneers of this technology way back in 2018 started to recognize a scaling law – Bigger models, higher ceiling of cognitive complexity, and more general the “translation” task it was capable of, as well as the ability to train it to respond to questions. For example, you could now ask it to “translate” all of these amazon reviews from a specific product in English into a single positive review in English based on the reviews that felt positive, and a single negative review in English based on all the reviews that felt negative. Emergent skills as the idea of translate became more and more broad.

They kept scaling these models like this until around a year ago, when they realized how to fix the issue of the same amount of contemplation, or compute time, being used for all questions. Thinking models were born, which gives the model a space to talk to itself about how to solve the problem it’s working on, break it into parts, and then assemble those parts for you in its response.

Learning to recognize it

What happens when you step in one of their capability holes unknowingly? They overconfidently roleplay competence in weak areas, assuming they can bridge gaps because their training rewarded plausible-sounding outputs, not rigorous truth. They are encouraged to guess until they get it right in their training environment, to fake it until they make it, and there is no way of training them to stop after training is over, you just have to learn to recognize it when it’s happening.

The Fix

The fix is with us adapting to them. These models won’t alert you to their limits, they can’t see them themselves, and they’re trained to sound convincing.

You are talking to a library.

That the best way I’ve found to think about. The system recognizes what you are asking for, searches the library and runs around distilling everything relevant based on what fits closest to the shape of what it inferred you were asking, and then synthesizes a response for you based on that.

LLM Strengths:
Translation – Superhuman.
Emotional tone reception – Superhuman.
Pattern recognition in data – Superhuman.
Creative generation – Superhuman.
Applying known techniques to comparable problems – Superhuman.

LLM Weaknesses:
Novel Reasoning – Absent.
Long-term memory storage – Absent.
Common sense consistency – Absent.
Generalization to new scenarios – Absent.
Hallucination detection – Absent

Aegisyx

Exploring AI – Jagged Intelligence