Precision, Depth, Encoders vs. Decoders, & CoT
Formalizing the Transformer
Each layer maps a matrix to a matrix:
Model Composition:
Applies the same $\mathsf{TC}^0$ circuit to every vector independently.
Transfers information between vectors (Routing).
What I wnat to know
The key for an item
The content of an item
Dot Product Match
Softmax Weights
Output (Mixed Value)
Raw Logits (Editable)
Attn Weights
Any $\mathsf{NC}^1$ computation can be visualized as a width-5 Branching Program over $S_5$.
$\mathsf{NC}^1$ is exactly equivalent to width-5 branching programs over $S_5$.
Computing product of $S_5$ elements is complete for $\mathsf{NC}^1$.
Standard Transformers (const-depth) cannot solve this.
Think-dot-by-dot & Log-Depth
Can meaningless tokens provide computational power?
Transformers can use filler tokens ("...") to solve hard tasks (e.g., 3SUM) they can't solve directly.
Benefit of CoT is partly just extra "depth" (time to think), not just semantic reasoning.
[Pfau-Merrill-Bowman’24] "Let's Think Dot by Dot"
If we increase depth to $O(\log n)$, we unlock $\mathsf{NC}^1$.
[Merrill-Sabharwal’25]
Communication Protocols & Trade-offs
Connecting Transformers to Distributed Computing (MPC).
| Transformer | $\longleftrightarrow$ | MPC Model |
|---|---|---|
| Token $x_i$ | $\leftrightarrow$ | Machine $M_i$ |
| Layer | $\leftrightarrow$ | Communication Round |
| Attention | $\leftrightarrow$ | Message Passing |
[Sanford-Hsu-Telgarsky’24] "Transformers, parallel computation, and logarithmic depth"
Full Visibility (MPC Broadcast)
Causal Masking (Restricted)
Breaking the Barrier
With CoT, we are no longer bound by the depth of the Transformer.
Solves the Encoder/Decoder separation gap. Solves composition tasks with just 2 layers + CoT.
Length Matters. To solve Parity or Reachability, CoT length must be proportional to circuit size (poly n).
| Model Config | Complexity | Key Limit |
|---|---|---|
| Const-Bit, Const-Depth | $\mathsf{AC}^0$ | No Parity |
| Log-Bit, Const-Depth | $\mathsf{TC}^0$ | No Reachability ($NC^1$) |
| Log-Depth Transf. | $\supseteq \mathsf{NC}^1$ | Can do Regex |
| Transf. + CoT | $\mathsf{P/poly}$ | Universal (given time) |