Make every attention head queryable in plain English.
We package each of GPT-2's 144 attention heads as a typed tuple (E, S, R, D, G, T), index it, and expose it through hybrid retrieval. Ask a question. Get a grounded answer with traceable citations.
Every attention head becomes a typed tuple.
Each of the 144 heads is profiled across induction, IOI, and WikiText probes, then packaged into the six-field schema (E, S, R, D, G, T). The matrix below shows ablation importance for each head; the cyan outline marks the 13 reference heads from Wang et al. and Olsson et al.
S - the four functional concepts
The S-primitive must match the architecture's natural feature axes. Surface linguistic features (POS, dependency relations, NER) recovered only 3 / 16 reference heads. The four functional primitives below recovered 13 / 16.
Each MU renders into four deterministic templates.
144 heads x 4 doc-types per head + 7 relational templates = 583 indexed documents. Templates substitute stored field values directly - every rendered statement traces back to a specific number in the JSON.
Sample: L00_H00 / L05_H05 / L09_H06 / L07_H03 / L11_H02. Each shown across all four doc-types (summary, semantics, transformer, guidance).
Hybrid retrieval - exact match meets semantic search.
The retriever runs two rails in parallel: exact match on E for entity-bearing queries, dense semantic search over S / R / G / T for conceptual queries. Hits are aggregated at the unit level before generation.
Pre-registered, falsifiable, matched-budget.
Every claim below was registered with a SHA-256 hash before outcomes were computed. Numbers are read directly from the result JSON files.
Per-layer IOI mediation
BABA path-patching, n = 80 prompt pairs. Hotspot at L7-L9 reproduces Wang et al.
Field roles in the schema
Per-field ablation. S+R is the minimal-optimal core.
Necessary: removing them drops recall significantly. Interfere: they hurt retrieval when blended in.
Behavioural lifts (matched-budget random baseline)
Lift = |framework shift| / |random shift|. Mean of 3 random seeds.
Field correlations
Head-ranking Spearman ρ. The S ↔ R cell is the strongest negative (ρ=−0.20): the two necessary fields retrieve antiparallel head populations.
No field pair exceeds the redundancy threshold (redundant_pairs = ∅). Fields are functionally distinct — interference arises from ranking conflict, not duplication.
Scope and limitations
- Per-head causal mediation: set-overlap with the canonical IOI circuit, not strict per-head path-patching enrichment. The 7-head oracle itself restores only 11.7 % of the IOI logit-difference - the circuit is intrinsically distributed.
- T-field is correlational: the T primitives (induction / copying / QK / OV) correlate with mechanistic roles but are not optimised to predict path-patching scores. A T populated directly from path-patching scores would close the gap (costly in practice).
- Concept vocabulary lesson: the 3 / 16 -> 13 / 16 reference-head recovery story is a methodological lesson, not a finding. Surface POS / DEP / NER vocabularies are inappropriate for positional routers like attention heads.
- Scale: no frontier-scale validation. SAE-feature S (Lieberum et al., Gemma Scope) is the natural next step.
- Per-field causal enrichment: all 6 fields produce p = 1.0 after Bonferroni correction on the path-patching null. Retrieval enrichment is not the same as causal enrichment for distributed circuits.