By Sahaib Singh in Notes & Trends — 29 Dec 2025

Why LLMs Feel Randomly Fast or Painfully Slow

If you have ever used the same AI model, same prompt style, same app, and still felt wildly different performance day to day, you are not imagining things.

Most people blame “server load” or say the model is inconsistent.

That is lazy thinking.

What actually decides whether an LLM feels instant or unusable has very little to do with the model itself and a lot to do with what happens before and after your prompt hits the model.

This week I want to break down the real reasons behind that experience.

The illusion of “model performance”

Here is the uncomfortable truth.

When people say:

“This model feels slow”

What they are really reacting to is:

How expensive their input was to process
Whether the system could reuse past computation
How the request was shaped before inference even started

The model is often the least interesting part of the story.

Three things dominate perceived speed:

Tokenization
KV cache
How you structure interactions over time

Miss these, and you will chase the wrong optimizations forever.

Tokenization is the first silent tax

Before a model “thinks”, your text is broken into tokens.

Not characters. Not words. Tokens.

Two prompts that look similar to a human can have very different token counts. Code blocks, JSON, logs, stack traces, and verbose instructions explode token counts fast.

Why this matters:

More tokens means more computation
More computation means more latency
More latency makes the model feel slow even before generation begins

Most teams do not even log token counts.

They benchmark models while blindly inflating inputs.

That is not benchmarking. That is guessing.

KV cache is the reason chat feels fast or broken

KV cache is the real hero of conversational AI.

In simple terms:

The model stores previous context representations
When you continue a conversation, it reuses that work
Reuse equals speed

When KV cache works:

Follow up questions feel instant
Iterative refinement feels smooth

When it breaks:

Every message feels like a cold start
Latency spikes without obvious reason

Common ways teams accidentally kill KV cache:

Rewriting the full prompt every turn
Injecting dynamic system messages repeatedly
Resending large static context on each request

From the outside, it looks like model inconsistency.

From the inside, it is just bad prompt architecture.

KV Cache explained:

Why this feels random to users

Now combine both effects.

One day:

Short prompt
Clean conversational flow
KV cache intact

Next day:

Slightly longer input
New system instruction injected
Cache invalidated

Same model. Same app. Completely different experience.

Users do not care why. They just say “it feels worse today”.

Engineers panic and start model hopping.

That is backwards.

The builder mistake I see everywhere

Teams obsess over:

GPT vs Claude vs open models
Benchmarks
Leaderboards

But ignore:

Input shaping
Token discipline
Context reuse

This is like tuning a race car engine while dragging a parachute.

The fastest teams I see optimize inputs first, models second.

They treat tokens like money.

They treat cache like gold.

Practical takeaways you can apply immediately

If you build or integrate LLMs, do this:

Log token counts for every request
Minimize repeated static context
Structure conversations so cache can survive
Do not resend what the model already knows
Benchmark with real prompts, not toy examples

If your AI feels slow, fix your pipeline before blaming the model.

Final thought

LLMs do not feel fast because they are “smart”.

They feel fast because the system around them is disciplined.

Speed is an architectural decision, not a model feature.

InsideTheStack will keep breaking down these hidden layers every week.

Quietly. Precisely. Without hype

Why LLMs Feel Randomly Fast or Painfully Slow