Exploring AI – AI That Actually Works: Part 1 – Meeting the Machine Halfway

Have you noticed everyone suddenly talking about ‘behavioural AI’ as the missing piece after the big technical leaps? I call it relationship literacy—and it’s been the through-line of this column since day one.

Smarter agents, same jagged shape

Last week the models took another visible leap. Claude Sonnet 4.6 became the default for free and Pro users at Anthropic, adding serious “computer use” powers — navigating interfaces like a person, planning multi-step agent workflows, and often matching the premium Opus tier on real knowledge work. Google’s Gemini 3.1 Pro preview doubled its score on the tough ARC-AGI-2 benchmark to 77.1 %, making heavy reasoning feel routine. xAI’s Grok 4.2 beta arrived with a native four-agent architecture where the agents actually debate and synthesize answers.

It feels like the future arrived early. These systems are suddenly doing things that look a lot like collaborating with a sharp colleague. Until they don’t. The same model that can plan an entire marketing campaign might still insist 9.9 is smaller than 9.11 or confidently repeat a step you already told it won’t work. That’s not a bug that will be patched next week — it’s the shape of AI intelligence showing through.

Intuitively, we keep expecting the line between “simple” and “hard” to be smooth and human-like. If it can do PhD-level math or code an entire app, surely it won’t bungle basic consistency. But AI doesn’t learn the way we do. We build understanding from lived experience, trial and error, cause and effect over time. AI builds everything from statistical patterns in human text — it’s a next-token predictor, not a mind with intentions or common sense.

So while AI has to learn *about* us from everything we’ve ever written, we have to meet it halfway. Our assumptions about how intelligence “should” behave are grounded in how *we* learn. That mismatch is exactly why the jaggedness persists even as the peaks get higher.

Based on the current training methods for these systems, this issue isn’t going away. Many are discussing how AI needs to be trained to be more human-like in its behaviour, but like any relationship — if you want it to be deep you have to know and understand the other person with all their faults. It’s a two-way street. Expecting the other to come all the way to you just isn’t realistic.

Why the edges stay sharp

The transformer architecture (the same one powering all these new agents) creates a universal vector space where concepts line up beautifully for translation-style tasks. But it skips the slow, grounded discovery process humans go through. Scaling made the peaks taller — better emotional tone, pattern recognition, creative generation — yet the valleys stayed deep: no true long-term memory, no reliable common sense, no built-in way to spot its own gaps.

When we treat it like Google or a crystal ball, we feed it our own assumptions and it happily mirrors them back. That’s not collaboration; that’s an echo. The only way to build truly human-centred AI is if enough of us understand how these systems actually behave, not how we wish they did. That’s what this column is aiming to do, one week at a time — taking readers down the rabbit hole as we explore these systems and their limitations and work with them as we go.

The fix: Build the relationship

Start thinking of every conversation as a relationship with a separate intelligence with a unique inhuman personality. You wouldn’t hand your car keys to a brilliant but absent-minded friend without clear rules and a quick double-check. Same here.

Calling back to our earlier analogy, treat it like a library with a translator in a booth. The translator is superhuman at some things and completely blind to others. Shape every request to stay inside its strengths and outside its weaknesses.

LLM Strengths
– Translation & synthesis – superhuman
– Pattern recognition in data – superhuman
– Creative generation – superhuman
– Emotional tone reception – superhuman
– Applying known techniques to comparable problems – superhuman

LLM Weaknesses
– Novel reasoning from first principles – non-guaranteed
– Long-term memory & consistency – unreliable
– Common-sense consistency – brittle
– Generalization to truly new scenarios – non-guaranteed
– Hallucination detection during generation – unreliable without tooling

Once you see it this way, the prompting changes. Instead of “Help me plan a trip,” use closed solution spaces: “Using only these three constraints and these exact dates, give me three ranked itinerary options.” Then take every output and run it through a real tool — calendar, map, calculator, spreadsheet — instead of trusting the mirror.

When readers grasp they’re talking to a next-token predictor trained on human text (not a mind), they stop expecting human-like smoothness. They start prompting better, checking outputs, and recognising when the model is just reflecting their own assumptions back at them. That practical literacy is the missing scaffold for any real revolution.

This is Part 1 of our new series **AI That Actually Works**. Over the next few weeks we’ll keep exploring the behavioural side — not the hype, just the practical relationship skills that turn flashy new agents into genuinely useful partners.

Aegisyx

Exploring AI – AI That Actually Works: Part 1 – Meeting the Machine Halfway