A New Axis of Sparsity for Large Language Models

// Reading time: 5 min read

Introduction: The Memory Problem in AI

Imagine having to solve a complex math problem while simultaneously trying to remember basic facts like "2+2=4" or "Paris is the capital of France." That's essentially what current Large Language Models (LLMs) do every single time they process text. They waste precious computational power "recomputing" simple facts they've seen thousands of times before.

DeepSeek AI and Peking University have just published a groundbreaking paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" that introduces Engram – a revolutionary approach that gives AI models a dedicated "memory bank" for instant recall, freeing up their computational power for actual reasoning.

What is Engram? The TL;DR

Engram is a smart memory module that lets AI models instantly "look up" common patterns and facts instead of wasting time recomputing them.

Think of it like this:

Before Engram: Your AI is like a student who has to re-derive every formula from scratch during an exam, even for basic equations they've seen a million times.
With Engram: Your AI is like a student with a well-organized formula sheet – they can instantly reference known facts and spend their mental energy on solving the actual problem.

Before Engram vs After Engram comparison

Why Do We Need This?

The Inefficiency of Traditional Transformers

Current LLMs based on the Transformer architecture have a fundamental limitation: they lack a native way to efficiently store and retrieve static knowledge.

When a model encounters common phrases like:

"The capital of France is..."
"E = mc²"
"Once upon a time..."

It has to process these through multiple layers of neural networks, consuming:

✗ Valuable GPU compute power (FLOPs)
✗ Expensive high-bandwidth memory (HBM)
✗ Attention capacity that could be used for understanding context

This is like using a supercomputer to look up a phone number instead of just checking a phone book.

The Sparsity Revolution

The AI community has been exploring "sparsity" – the idea that you don't need to activate the entire model for every task. The most famous example is Mixture-of-Experts (MoE), which routes different inputs to specialized "expert" sub-networks.

MoE gave us conditional computation (only use the parts of the brain you need).

Engram introduces conditional memory (only store and retrieve what you need, when you need it).

Together, they form a powerful duo: smart computation + smart memory.

How Does Engram Work?

1. The Memory Bank: Modernized N-gram Embeddings

Engram creates a searchable database of word sequences (N-grams) – think of it as a massive lookup table for common patterns.

Example:
• Input: "The capital of France"
• Engram lookup: → "is Paris" (retrieved in O(1) constant time)
• No need to "compute" this through 40+ transformer layers!

2. Three Key Innovations

A. Tokenizer Compression: Reducing Redundancy

Before storing information, Engram "normalizes" tokens:

"Apple", "apple", "APPLE" → all treated as the same concept
This reduces the effective vocabulary by ~23%
Less storage, faster lookups

B. Multi-Head Hashing: Collision-Free Storage

Storing every possible word combination would require infinite memory. Engram uses multiple hash functions (like having several phone books):

If one hash gives the wrong result, others provide the correct one
Minimizes "collisions" where different patterns map to the same location
Enables massive memory tables stored efficiently in system RAM (DRAM) instead of expensive GPU memory

C. Context-Aware Gating: Smart Retrieval

Not all retrieved memories are relevant. Engram includes a gating mechanism:

Evaluates: "Does this retrieved pattern fit the current context?"
If yes → use it
If no → ignore it
This prevents noise and ensures only helpful information is used

3. Integration with the Transformer

Engram doesn't replace the Transformer – it augments it:

Injected early in the model's layers
Offloads static pattern reconstruction before the model wastes compute
Retrieved memory is lightly processed and added back into the model's state
The rest of the network focuses on complex reasoning

The U-Shaped Scaling Law

The researchers discovered something fascinating: the optimal balance between MoE (computation) and Engram (memory) follows a U-shaped curve.

Key Finding:
Allocating 20-25% of sparse parameters to Engram + the rest to MoE = optimal performance

Too little memory → model wastes compute on static patterns
Too much memory → not enough compute for reasoning
Just right → best of both worlds

This is called the Sparsity Allocation Problem, and it's now solved.

Performance: The Numbers Don't Lie

The researchers scaled Engram to 27 billion parameters and compared it against pure MoE models with the same parameter count and computational budget (FLOPs).

Engram Performance Gains Across Benchmarks

Benchmark	Improvement
MMLU (knowledge)	+3.4 points
CMMLU (Chinese knowledge)	+4.0 points
BBH (reasoning)	+5.0 points
ARC-Challenge (reasoning)	+3.7 points
HumanEval (coding)	+3.0 points
MATH (mathematics)	+2.4 points
Multi-Query NIAH (long-context)	84.2 → 97.0

Surprising Discovery: While Engram was expected to help with knowledge retrieval (MMLU, CMMLU), it showed even larger gains in reasoning, coding, and math – tasks that require complex thinking, not just memorization.

Why Does Engram Boost Reasoning?

The researchers conducted "mechanistic analyses" to understand why Engram improves reasoning:

1. Deepening the Network

By handling static patterns early, Engram frees up the model's later layers
These layers can now focus entirely on complex, multi-step reasoning
Effect: The network becomes "effectively deeper" for reasoning tasks

2. Freeing Attention for Global Context

Attention mechanisms no longer waste capacity on local, repetitive patterns
They can focus on understanding global context and long-range dependencies
Effect: Massive boost in long-context retrieval (84.2 → 97.0 on NIAH benchmark)

Analogy: It's like a CEO who stops micromanaging daily tasks (delegated to assistants) and can now focus on strategic, high-level decisions.

Infrastructure Efficiency: The Hidden Superpower

Deterministic Addressing = Runtime Prefetching

Engram's lookups are deterministic (predictable)
The system can prefetch data from host memory (DRAM) before it's needed
This happens in the background with negligible overhead

What This Means:
• You can store massive memory tables in cheap system RAM
• Not limited by expensive GPU memory (HBM)
• Decouples compute power from memory capacity
• Scale memory almost infinitely without buying more GPUs

Real-World Impact: This makes Engram incredibly cost-effective for deployment at scale.

The Future: Conditional Memory as a Modeling Primitive

The researchers envision conditional memory as an indispensable component for next-generation AI models.

Just as:

Attention mechanisms became standard in 2017 (Transformers)
Mixture-of-Experts became standard for scaling (GPT-4, Gemini)

Engram-style conditional memory could become the third pillar of modern LLM architecture.

Potential Applications:

Retrieval-Augmented Generation (RAG): Built-in, no external database needed
Domain-Specific Models: Pre-load specialized knowledge (medical, legal, scientific)
Personalized AI: Store user-specific patterns and preferences
Multimodal Models: Efficient storage of visual patterns, audio signatures, etc.

Code and Reproducibility

The best part? The code is open-source!

🔗 GitHub Repository:

github.com/deepseek-ai/Engram

The research community can now:

Experiment with Engram
Integrate it into existing models
Extend the concept to new domains

Key Takeaways

What: Engram is a conditional memory module that enables O(1) knowledge lookup in LLMs
Why: Current Transformers waste compute recomputing static patterns; Engram separates memory from reasoning
How: Uses modernized N-grams with tokenizer compression, multi-head hashing, and context-aware gating
Impact: Significant gains in knowledge (+3-4 points), reasoning (+5 points), and long-context handling (84→97)
Efficiency: Deterministic addressing enables prefetching from cheap DRAM, not expensive GPU memory
Future: Conditional memory is poised to become a standard primitive in next-gen AI architectures

Conclusion: A New Era of Efficient AI

The Engram paper represents a fundamental shift in how we think about LLM architecture. By introducing conditional memory as a complement to conditional computation, DeepSeek has opened a new dimension for scaling AI models efficiently.

The message is clear: The future of AI isn't just about bigger models – it's about smarter architectures that know when to remember and when to reason.

As we move toward increasingly capable AI systems, innovations like Engram will be crucial for making them not just more powerful, but also more efficient, cost-effective, and accessible.

The age of conditional memory has begun.

References

Paper: "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models"
arXiv: arxiv.org/abs/2601.07372
Authors: DeepSeek AI & Peking University
Published: January 12, 2026
Code: github.com/deepseek-ai/Engram