Why LLMs Feel Randomly Fast or Painfully Slow

Why LLMs Feel Randomly Fast or Painfully Slow

Why LLMs Feel Randomly Fast or Painfully Slow

If you have ever used the same AI model, same prompt style, same app, and still felt wildly different performance day to day, you are not imagining things.

Most people blame “server load” or say the model is inconsistent.

That is lazy thinking.

What actually decides whether an LLM feels instant or unusable has very little to do with the model itself and a lot to do with what happens before and after your prompt hits the model.

This week I want to break down the real reasons behind that experience.


The illusion of “model performance”

Here is the uncomfortable truth.

When people say:

“This model feels slow”

 What they are really reacting to is:

  • How expensive their input was to process
  • Whether the system could reuse past computation
  • How the request was shaped before inference even started

The model is often the least interesting part of the story.

Three things dominate perceived speed:

  1. Tokenization
  2. KV cache
  3. How you structure interactions over time

Miss these, and you will chase the wrong optimizations forever.


Tokenization is the first silent tax

Before a model “thinks”, your text is broken into tokens. 

Not characters. Not words. Tokens.

Two prompts that look similar to a human can have very different token counts. Code blocks, JSON, logs, stack traces, and verbose instructions explode token counts fast. 

Why this matters:

  • More tokens means more computation
  • More computation means more latency
  • More latency makes the model feel slow even before generation begins

Most teams do not even log token counts.

They benchmark models while blindly inflating inputs.

That is not benchmarking. That is guessing.


KV cache is the reason chat feels fast or broken

KV cache is the real hero of conversational AI.

In simple terms:

  • The model stores previous context representations
  • When you continue a conversation, it reuses that work
  • Reuse equals speed

When KV cache works:

  • Follow up questions feel instant
  • Iterative refinement feels smooth

When it breaks:

  • Every message feels like a cold start
  • Latency spikes without obvious reason

Common ways teams accidentally kill KV cache:

  • Rewriting the full prompt every turn
  • Injecting dynamic system messages repeatedly
  • Resending large static context on each request

From the outside, it looks like model inconsistency.

From the inside, it is just bad prompt architecture.

KV Cache explained:

KV Cache: Why Models Become Fast
The hidden mechanism that makes modern LLMs feel instant Most people think LLM speed comes from bigger GPUs or better models. That’s only half the truth. The real reason ChatGPT feels responsive is KV cache. Without it, every response would feel like AI from 2020. The slides above show

Why this feels random to users

Now combine both effects.

One day:

  • Short prompt
  • Clean conversational flow
  • KV cache intact

Next day:

  • Slightly longer input
  • New system instruction injected
  • Cache invalidated

Same model. Same app. Completely different experience.

Users do not care why. They just say “it feels worse today”.

Engineers panic and start model hopping.

That is backwards.


The builder mistake I see everywhere

 Teams obsess over:

  • GPT vs Claude vs open models
  • Benchmarks
  • Leaderboards

But ignore:

  • Input shaping
  • Token discipline
  • Context reuse 

This is like tuning a race car engine while dragging a parachute.

The fastest teams I see optimize inputs first, models second.

They treat tokens like money.

They treat cache like gold.


Practical takeaways you can apply immediately

If you build or integrate LLMs, do this:

  1. Log token counts for every request
  2. Minimize repeated static context
  3. Structure conversations so cache can survive
  4. Do not resend what the model already knows
  5. Benchmark with real prompts, not toy examples

If your AI feels slow, fix your pipeline before blaming the model.


Final thought

LLMs do not feel fast because they are “smart”.

They feel fast because the system around them is disciplined.

Speed is an architectural decision, not a model feature.

InsideTheStack will keep breaking down these hidden layers every week.

Quietly. Precisely. Without hype