Why Are LLMs So Slow? And How We’re Making Them Faster

When it comes to Large Language Models (LLMs), training them is only half the battle. How you serve them is just as critical. The efficiency with which you can generate output, or “inference,” directly impacts your throughput and, ultimately, your costs.

Imagine you need to handle 100 requests per second, but your system can only manage 10. What do you do? The simplest solution is to use 10 servers. But what if a developer improves the software so each server can handle 20 requests per second? You can now get by with just 5 servers, cutting your costs in half.

It’s a little-known fact that the cost of LLM inference is often much greater than the cost of training. While training is a massive, one-time expense, inference is a continuous cost that scales directly with user growth.

For this reason, there’s a huge push to develop techniques that accelerate LLM inference. In this post, we’ll dive into some of the most prominent technologies aimed at speeding up this process.

Before we get to the solutions, let’s briefly explore why LLM inference is so sluggish in the first place. After all, you can’t fix a problem until you understand it.

The Models Are Massive and Computationally Intensive

The first reason is obvious: the models are huge, which means there’s a ton of computation to be done.

Even models used for on-device applications can have a billion to 3 billion parameters. Server-grade models scale into the tens and hundreds of billions. Because neural networks rely on a series of multiplications and additions for every single parameter, the amount of computation scales proportionally with the number of parameters. The larger the model, the more calculations are required, and the longer it takes to get a result.

The catch is that making a model smaller often compromises its quality. It’s a classic trade-off: model size versus performance.

Self-Attention

The second reason is inherent to the Transformer architecture itself. The Transformer’s greatest strength is its self attention mechanism. Self-Attention allows the model to understand the context of the entire input, which is key to generating contextually appropriate responses.

The downside? The computations required for Self-Attention are incredibly demanding.

The computational complexity of Self-Attention scales quadratically with the length of the input. For example, if an input has 10 tokens, the number of operations is proportional to 10^2=100. If the input grows to 100 tokens, the operations balloon to 100^2=10,000. The input size increased by a factor of 10, but the computation increased by a factor of 100.

Autoregressive Nature

The third reason stems from the way modern LLMs operate: they are autoregressive. This means the model’s output becomes part of the new input to generate the next output. Let’s look at an example.

Input: “What is the capital of the United States?”
Model generates: “The”
New input: “What is the capital of the United States? The”
Model generates: “capital”
New input: “What is the capital of the United States? The capital”
Model generates: “is”
New input: “What is the capital of the United States? The capital is”
Model generates: “Washington,”
New input: “What is the capital of the United States? The capital is Washington,”
Model generates: “D.C.”
New input: “What is the capital of the United States? The capital is Washington, D.C.”
Model generates: “end_of_sentence”
Final Output: “The capital is Washington, D.C.”

As you can see, LLMs repeat this process: they append the newly generated token to the input and feed it back into the model to produce the next token.

This autoregressive approach introduces two major problems for inference speed.

First, because the model generates one token at a time, the total time required is directly proportional to the length of the desired output. This is why the time it takes to get a response from an LLM is far more dependent on the output length than the prompt length.

Second, it’s inefficient due to redundant calculations, especially with Self-Attention. An LLM’s fundamental task is to predict the most likely next token based on a given input. The input goes through several stages (Tokenizer, Embedding, Self-Attention, FNN, etc.), with Self-Attention being the most computationally intensive and time-consuming step.

The problem with the autoregressive approach is that it leads to redundant Self-Attention calculations. Let’s revisit our example:

Input: “What is the capital of the United States?” -> The model calculates Self-Attention for these 9 tokens. An Attention Matrix of size 9×9 is created.
Model generates: “The”
New input: “What is the capital of the United States? The” -> The model calculates Self-Attention for these 10 tokens. A 10×10 Attention Matrix is created.
Model generates: “capital”

Intuitively, it feels like the model is recalculating a lot of what it already computed in step 1. Out of the 10 tokens in step 3, 9 of them are duplicates from step 1. Repeatedly performing the same expensive calculations is a serious bottleneck.

To tackle these problems, the community has developed a number of effective solutions, several of which are now widely adopted. Below, we’ll introduce some of the most popular methods.

Like other posts on this blog, I’ll try to keep the explanations conceptual and avoid complex math or architectures.

Grouped Query Attention (GQA)

To understand GQA, we need to first talk about Multi-Headed Attention, which in turn builds on Self-Attention. Let’s break it down one step at a time.

Self-Attention

At its core, Self-Attention is a way to numerically represent the relationships between each token in an input. Consider the sentence: “Welcome to this beautiful planet Earth.” Which tokens are most related to “Earth”? Intuitively, “beautiful” and “planet” seem highly related, while “Welcome” or “this” seem less so. We could assign a score to this:

(“Earth”, “beautiful”): 0.3
(“Earth”, “planet”): 0.4
(“Earth”, “Welcome”): 0.001
(“Earth”, “this”): 0.0005

he beauty of Self-Attention is that it assigns a score to the relationship between each token pair, helping the model decide how much to “pay attention” to other tokens when interpreting a specific token.

So, how are these scores calculated? First, each token is assigned three values: Q (Query), K (Key), and V (Value). The final attention score is then derived from these values. The main takeaway is that to compute Self-Attention, the model must first generate these Q, K, and V values.

Multi-Headed Attention (MHA)

The concept of “relatedness” between tokens is a bit ambiguous. In our first example, we gave a low score to the relationship between “Earth” and “this.” We probably did this because “this” is a determiner, not a noun referring to “Earth.”

However, we could look at it from a different perspective. “Earth” and “planet” are both nouns, so maybe their grammatical relationship should get a higher score. Let’s try scoring again:

(“Earth”, “beautiful”): 0.001
(“Earth”, “planet”): 0.4
(“Earth”, “Welcome”): 0.001
(“Earth”, “this”): 0.3

In this scoring system, we’ve given high scores to tokens that are grammatically related (nouns) and low scores to others. This highlights a limitation: a single Self-Attention score can’t capture all the nuances of token relationships, which could be semantic, grammatical, or even based on word distance.

This is where Multi-Headed Attention (MHA) comes in. The idea is simple: create multiple sets of Self-Attention, or “heads.” Let’s revisit our example with two heads:

Head 1 (Semantic)
- (“Earth”, “beautiful”): 0.3
- (“Earth”, “planet”): 0.4
- (“Earth”, “Welcome”): 0.001
- (“Earth”, “this”): 0.0005
Head 2 (Grammatical)
- (“Earth”, “beautiful”): 0.001
- (“Earth”, “planet”): 0.4
- (“Earth”, “Welcome”): 0.001
- (“Earth”, “this”): 0.3

Now, with multiple heads, the model can capture different kinds of relationships between tokens. However, this comes at a cost. Each head requires its own set of Q, K, and V values, which dramatically increases the computational load.

GQA

With MHA, we improved quality but exacerbated the already demanding computational requirements of Self-Attention.

Someone then had a brilliant idea: “Instead of creating separate K and V values for every single head, why don’t we share them? Let’s keep the multiple Q’s (Queries) but use a single set of K and V values for all heads. We can call it Multi-Query Attention (MQA)!”

MQA successfully reduced the computational cost, but it came with a noticeable drop in model quality. So, another idea emerged: “Sharing K and V across all heads is too aggressive. What if we group a few heads together and share K and V within each group? We’ll call it Grouped-Query Attention (GQA)!”

Fortunately, GQA strikes a much better balance. It significantly reduces computation compared to MHA without the steep quality drop seen with MQA.

As you can see in the graph, comparing MHA-XXL and GQA-XXL, the inference time drops from 1.5ms to 0.5ms, while the quality score remains nearly identical (around 47). In contrast, MQA-XXL is faster but its quality is noticeably lower.

GQA was first introduced by Google and popularized by its use in LLaMA 2.

Sliding Window Attention (SWA)

Sliding Window Attention (SWA) is another approach to address the high computational cost of Self-Attention. As mentioned, Self-Attention calculates the relationship score between every token and every other token in the input. Let’s go back to our example: “Welcome to this beautiful planet Earth.”

The token “Welcome” calculates a score for its relationship with:

(“Welcome”, “Welcome”)
(“Welcome”, “to”)
(“Welcome”, “this”)
(“Welcome”, “beautiful”)
(“Welcome”, “planet”)
(“Welcome”, “Earth”)

Since there are 6 tokens in this phrase, a total of 6 x 6 = 36 calculations would be needed for a standard (global) attention mechanism.

Someone had another idea: “What if we only calculate attention scores for tokens that are close to each other? The relationship between a token and a distant token is probably low anyway.” This approach seems promising. The idea is to define a “window” size for each token, and then “slide” this window along the input to calculate attention scores only within that limited scope.

For example, let’s set the window size to 3 (the token itself plus one token to the left and one to the right).

(“Welcome”, “Welcome”)
(“Welcome”, “to”)
(“to”, “Welcome”)
(“to”, “to”)
(“to”, “this”)
…

With this method, each token only needs to calculate 3 attention scores, for a total of 6 x 3 = 18 calculations. (The actual count for the first and last token would be one less, making it 16). This is a huge reduction from the 36 calculations of Global Attention.

Like GQA, SWA is widely used to speed up attention calculations while minimizing the impact on quality. SWA was first introduced in a paper on the Longformer model and gained popularity with its adoption by Mistral.

Flash Attention

Flash Attention tackles the Self-Attention problem from a different angle. While GQA, SWA, and KV Cache optimize the model’s architecture or workflow, Flash Attention uses hardware-level optimizations to make Self-Attention faster.

To understand Flash Attention, you first need a basic grasp of how modern GPUs work.

The diagram above shows the hierarchy of GPU memory.

SRAM (Static RAM): This is the fastest memory, directly accessible by the GPU cores. It’s super fast but has very limited capacity (e.g., 20MB).
HBM (High-Bandwidth Memory): The main memory for the GPU. It’s slower than SRAM but has a much larger capacity (e.g., 40GB).
DRAM (Dynamic RAM): The system’s main memory, used by the CPU. It’s much slower than HBM but has the largest capacity (e.g., 1TB or more).

Typically, a GPU computation follows these steps:

Data is read from the disk and loaded into CPU DRAM.
Data is moved from CPU DRAM to GPU HBM.
Data is moved from GPU HBM to the tiny, fast SRAM.
The GPU cores perform calculations on the data in SRAM.
The results are written back to HBM.
This cycle (steps 3-5) repeats until the computation is complete.
The final result is returned.

The critical observation here is the constant movement of data between the HBM and the SRAM. Every time a calculation is needed, data is loaded from HBM, computed in SRAM, and then saved back to HBM. This data transfer is a major bottleneck.

Flash Attention’s core idea is to minimize this data transfer. How? By reorganizing the Self-Attention computation to perform as many operations as possible directly within the fast SRAM.

The Flash Attention paper goes into the details, but the key takeaway is that the developers analyzed the standard Self-Attention implementation and re-wrote it in CUDA to maximize in-SRAM computation.

As seen in the graph, the traditional PyTorch implementation of Self-Attention involves multiple separate steps (Matmul, Mask, Softmax), each requiring data transfer. Flash Attention fuses these operations into a single “Fused Kernel,” which performs the entire sequence of calculations without saving intermediate results back to HBM.

This approach drastically reduces the data movement bottleneck, leading to significant speed improvements.

A practical challenge with Flash Attention is that implementing these fused kernels requires low-level programming (like CUDA) and is highly dependent on the specific GPU architecture. This is why early versions of Flash Attention didn’t support older GPUs like the V100.

However, modern libraries like Hugging Face have integrated Flash Attention, making it much easier for developers to use.

Speculative Decoding

While the previous methods focused on speeding up the Self-Attention calculation itself, Speculative Decoding takes a completely different approach.

The core idea is this:

Instead of using your large, slow model for every step, use a smaller, faster “draft” model.
Have this small model generate a sequence of tokens. Because it’s small, it’s fast.
Then, have the large, slow model quickly check if each of the small model’s generated tokens is correct.
If all tokens are correct, you use the small model’s output.
If a token is incorrect, you switch back to the large model to generate from that point onward, using standard greedy decoding.

In essence, the small model handles the heavy lifting of generation, while the large model acts as a fast checker (and a fallback generator only when needed).

Why is this so effective?

First, it operates on the assumption that some tokens are easier to predict than others. For example, after the phrase “The sky is,” the next word is almost certainly “blue.” This is an easy prediction, and a small model can handle it with high accuracy. Other phrases, however, might require the full power of a massive model. The beauty of this approach is that it uses the fast, small model for easy predictions and only engages the slow, large model when the task is difficult.

Second, autoregressive generation is slow, but text validation is fast. Let’s say we’re generating the sentence “The capital of the United States is Washington, D.C.”

Using only the large model, it would need to run 5 times (one for each new token) to generate the full sentence. If each run takes 20ms, that’s a total of 100ms.
With Speculative Decoding, the small model generates the entire phrase “Washington, D.C.” in one go, which is much faster (let’s say 20ms).
The large model then needs to perform a single check on the entire generated sequence. This validation step is very fast (e.g., 20ms).
The total time is 20ms (small model) + 20ms (validation) = 40ms, a significant speedup.

For an even better visual explanation of Speculative Decoding, check out this great video from Google (Click it to watch the video):

The success of Speculative Decoding hinges on the quality of the small “draft” model. If it generates incorrect tokens frequently, you’ll constantly be switching to the slow, large model, defeating the purpose.

A more recent development is Self-Speculative Decoding, which uses only a single large model. Instead of a separate draft model, it uses a few layers of the large model as the “draft” and the full model as the “verifier.”

LLM training might grab the headlines, but inference is where the real battle is fought day after day. Every millisecond shaved off response time, every optimization that reduces server load, and every clever trick to reuse past computations directly translates into lower costs and a better user experience.

From architectural innovations like GQA and SWA, to hardware-aware methods like Flash Attention, and clever workflows like Speculative Decoding, the industry is racing to make inference faster, cheaper, and more scalable.

The takeaway is simple: if you want your LLM service to survive in production, don’t just focus on training bigger models. Pay equal—if not greater—attention to how you serve them. After all, training might be a one-time marathon, but inference is a never-ending sprint.

Source link

Trending News

Travel

Web

Category Collection

Why Are LLMs So Slow? And How We’re Making Them Faster