The Latents

A tiny AI that beats OpenAI on reasoning puzzles

An interactive and illustrated explanation of Hierarchical Reasoning Models

In transformer models, the input embedding moves through a series of blocks, that transform it into the output embedding.

The input embedding comes from your input text tokens, and the output embeddings can be multiplied with a matrix to get probabilities for the next token.

I'm pretty sure you've seen countless explanations of how transformers work, so we'll not go into more details. Let us instead focus on how the transformer architecture can evolve.

Transformers have a drawback. They work well when you have a lot of blocks stacked up. Having a large number of blocks adds a lot of parameters in the model.

One way to solve this is by using a Looped Transformer.

In a looped transformer, we just have one block. The input embeddings go through this block multiple times in a loop.

Let's go through the steps one by one. (Use the prev and next buttons below)

Step 0

This significantly reduces the number of parameters involved as we reuse the same block for modifying the embeddings.

Plus, we can spend more computation on an input by simply increasing the number of loops.

You might be wondering if using fewer parameters results in performance that is better, or at least on par with regular transformers.

Let's look at an experiment.

5 3 7 6 1 9 5 9 8 6 8 6 3 4 4 8 3 1 7 6 7 2 5 8 6 1 2 8 4 1 9 5 3 8 7 9

Sudoku puzzles are a tough task for transformer models. This includes both transformer models directly trained on sudoku puzzles and popular large langauge models.

The authors of the paper Hierarchical Reasoning Models introduced "Sudoku Extreme": a dataset of challenging sudoku puzzles.

The sudoku puzzles are challenging enough that o3-mini, Claude 3.7 and DeepSeek R1 are able to solve exactly 0% of them!

What about transformer models that are directly trained on this task? The dataset provides 1000 training examples. The authors trained transformers with different number of layers on these examples and evaluated the performance.

100 80 60 40 20 8 16 32 64 128 256 512 No. of layers

Okay, so the transformer models are able to learn this task to some extent. Models with more layers/blocks perform better.

What about looped transformer? What if we replace, say the 64 layered transformer with a looped transformer where we loop the embedding through its block 64 times?

100 80 60 40 20 8 16 32 64 128 256 512 No. of layers/loops Looped Transformer Regular Transformer

The looped transformer performs much better. But the authors improve it further with a modification. This modified architecture is called Hierarchical Reasoning Models.

Let's go through each step within this model.

Step 0

The intuition behind having these two blocks comes from biology. As the paper states: "The brain processes information across a hierarchy of cortical areas. Higher-level areas integrate information over longer timescales and form abstract representations, while lower-level areas handle more immediate, detailed sensory and motor processing."

The two blocks are inspired by the idea of having one block (block 1) process the information at a higher frequency (similar to the lower-level areas in our brain) and one block (block 2) at a lower frequency. (similar to the higher-level areas)

Does this improve the performance on sudoku puzzles against comparable models?

100 80 60 40 20 8 16 32 64 128 256 512 No. of layers/loops Looped Transformer Regular Transformer HRM

The performance increase is substantial. The authors also attempted to use this architecture on the ARC challenge.

The ARC challenge is a set of unique puzzles that are designed to be really easy for humans, but tough for LLMs and any form of AI system available right now. Each puzzle is unique and only provides 2-4 examples of input output pairs, which makes it impractical to train most machine learning models.

In the next post, we will show the authors applied this architecture on the ARC challenge, and managed to outperform some of the latest LLMs like Deepseek R1, Claude 3.7 and o3-mini-high. We will also go over ablation studies conducted by the ARC team comparing the contribution of this architecture, and other parts of the process used to train this model on the ARC tasks.