Theoretical Aspects of Transformers (Complete)

Theoretical Aspects of Transformers

Precision, Depth, Encoders vs. Decoders, & CoT

A Complexity Theory Perspective

Topic 0: Background

Formalizing the Transformer

1. The Layer Function

Each layer maps a matrix to a matrix:

$$ \Phi_i :\mathbb{R}^{n\times d} \to \mathbb{R}^{n\times d} $$

Model Composition:

$$ \Phi = \Phi_L \circ \Phi_{L-1} \circ \cdots \circ \Phi_1 $$

2. Tokens vs. Vectors

graph LR I["Tokens {0,1}"] -->|Embedding| E["Vectors R^d"] E --> P["Phi: Layers"] P -->|Unembedding| O["Tokens {0,1}"]

3. Two Types of Layers

Local Layers (MLP)

Applies the same $\mathsf{TC}^0$ circuit to every vector independently.

Global Layers (Attention)

Transfers information between vectors (Routing).

4a. Attentin (Q-K-V associative memory)

Query ($q$)

What I wnat to know

Key ($k$)

The key for an item

Value ($v$)

The content of an item

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$

4b. Interactive QKV Mechanism

Adjust Query Vector: Angle:

Dot Product Match

→

Softmax Weights

Output (Mixed Value)

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right)V $$

4c. Matrix Attention Demo

Causal Mask:

Temp:

Raw Logits (Editable)

→

Attn Weights

5. Uniformity

The Uniform Algorithm View:
By fixing parameters, the Transformer becomes a uniform algorithm working on inputs of any length.

Part 1: Complexity Upper Bounds

The Complexity Hierarchy

Visualizing $\mathsf{NC}^1$: Branching Programs

Any $\mathsf{NC}^1$ computation can be visualized as a width-5 Branching Program over $S_5$.

Interactive: Click labels (e.g., "Id", "(1 2)") to change the permutation.
Follow the colored lines to track state evolution.

Barrington's Theorem (1989)

The Statement

$\mathsf{NC}^1$ is exactly equivalent to width-5 branching programs over $S_5$.

$$ \mathsf{NC}^1 = \mathsf{BWBP}_5 $$

Implication

Computing product of $S_5$ elements is complete for $\mathsf{NC}^1$.

Limit

Standard Transformers (const-depth) cannot solve this.

Part 2: Capabilities (No CoT)

Think-dot-by-dot & Log-Depth

Hidden Computation: "Think Dot by Dot"

Can meaningless tokens provide computational power?

graph LR I[Input] --> F1[...] --> F2[...] --> F3[...] --> O[Answer] style F1 fill:#444,stroke:#666,color:#aaa style F2 fill:#444,stroke:#666,color:#aaa style F3 fill:#444,stroke:#666,color:#aaa style O fill:#50fa7b,color:#000

The Finding

Transformers can use filler tokens ("...") to solve hard tasks (e.g., 3SUM) they can't solve directly.

Implication

Benefit of CoT is partly just extra "depth" (time to think), not just semantic reasoning.

[Pfau-Merrill-Bowman’24] "Let's Think Dot by Dot"

The Power of Depth: $\Theta(\log n)$

If we increase depth to $O(\log n)$, we unlock $\mathsf{NC}^1$.

Reachability: Solved via matrix doubling ($A^2, A^4...$).
Regex: Solved via parallel DFA state composition.

graph TD S((s)) --> A A --> B B --> T((t)) S -.->|Layer 1| A S -.->|Layer 2| B S -.->|Layer 3| T

[Merrill-Sabharwal’25]

Part 3: Encoder vs Decoder

Communication Protocols & Trade-offs

Encoders = Massively Parallel Computation

Connecting Transformers to Distributed Computing (MPC).

Transformer	$\longleftrightarrow$	MPC Model
Token $x_i$	$\leftrightarrow$	Machine $M_i$
Layer	$\leftrightarrow$	Communication Round
Attention	$\leftrightarrow$	Message Passing

Theorem: Const-layer Transformers can simulate and be simulated by Const-round MPC.
$\implies$ Log-depth Transformers can solve Graph Connectivity (which is hard for const-round MPC/Transformers).

[Sanford-Hsu-Telgarsky’24] "Transformers, parallel computation, and logarithmic depth"

The Visual Difference

Encoder

Full Visibility (MPC Broadcast)

Decoder

Causal Masking (Restricted)

The Exponential Separation

Theorem [Chen-Peng-Wu’25]: There exist tasks (function composition) where:

Encoder: Solves with $log L$ layers with width $Hdp = O(\log n)$.
Decoder: Requires width $Hdp = n^{\Omega(1)}$ (exponential blowup) for same $L$.

Part 4: Chain-of-Thought (CoT)

Breaking the Barrier

CoT = Circuit Simulation

With CoT, we are no longer bound by the depth of the Transformer.

graph LR Input --> T1[Step 1] T1 --> C1[CoT Token] C1 --> T2[Step 2] T2 --> C2[CoT Token]

Result: CoT + Const-precision + $O(\log n)$ embedding can simulate $\mathsf{P/poly}$! [Li-Liu-Zhou-Ma’24]

Benefits & Limits

Benefit

Solves the Encoder/Decoder separation gap. Solves composition tasks with just 2 layers + CoT.

Limit

Length Matters. To solve Parity or Reachability, CoT length must be proportional to circuit size (poly n).

Summary Table

Model Config	Complexity	Key Limit
Const-Bit, Const-Depth	$\mathsf{AC}^0$	No Parity
Log-Bit, Const-Depth	$\mathsf{TC}^0$	No Reachability ($NC^1$)
Log-Depth Transf.	$\supseteq \mathsf{NC}^1$	Can do Regex
Transf. + CoT	$\mathsf{P/poly}$	Universal (given time)