A typed-tuple protocol for mechanistic interpretability

Make every attention head queryable in plain English.

We package each of GPT-2's 144 attention heads as a typed tuple (E, S, R, D, G, T), index it, and expose it through hybrid retrieval. Ask a question. Get a grounded answer with traceable citations.

144 heads 12 layers x 12 heads 583 indexed templates 4 functional concepts

Ask a question See the framework

manifestation - hybrid retrieve + generate

try:

0.411

S+R oracle recall@30

vs 0.20-0.22 random partitions / p < 0.01

5/7

IOI L7-L9 name-mover heads recovered

canonical Wang et al. circuit

2.0-4.5x

lift over matched-budget random

across 5 IOI-style behaviours, pre-registered

01 / Profile heads

Every attention head becomes a typed tuple.

Each of the 144 heads is profiled across induction, IOI, and WikiText probes, then packaged into the six-field schema (E, S, R, D, G, T). The matrix below shows ablation importance for each head; the cyan outline marks the 13 reference heads from Wang et al. and Olsson et al.

low importance

high importance literature reference set

S - the four functional concepts

The S-primitive must match the architecture's natural feature axes. Surface linguistic features (POS, dependency relations, NER) recovered only 3 / 16 reference heads. The four functional primitives below recovered 13 / 16.

02 / Index

Each MU renders into four deterministic templates.

144 heads x 4 doc-types per head + 7 relational templates = 583 indexed documents. Templates substitute stored field values directly - every rendered statement traces back to a specific number in the JSON.

Sample: L00_H00 / L05_H05 / L09_H06 / L07_H03 / L11_H02. Each shown across all four doc-types (summary, semantics, transformer, guidance).

03 / Ask

Hybrid retrieval - exact match meets semantic search.

The retriever runs two rails in parallel: exact match on E for entity-bearing queries, dense semantic search over S / R / G / T for conceptual queries. Hits are aggregated at the unit level before generation.

04 / Verify

Pre-registered, falsifiable, matched-budget.

Every claim below was registered with a SHA-256 hash before outcomes were computed. Numbers are read directly from the result JSON files.

Per-layer IOI mediation

BABA path-patching, n = 80 prompt pairs. Hotspot at L7-L9 reproduces Wang et al.

Field roles in the schema

Per-field ablation. S+R is the minimal-optimal core.

Necessary: removing them drops recall significantly. Interfere: they hurt retrieval when blended in.

Behavioural lifts (matched-budget random baseline)

Lift = |framework shift| / |random shift|. Mean of 3 random seeds.

Field correlations

Head-ranking Spearman ρ. The S ↔ R cell is the strongest negative (ρ=−0.20): the two necessary fields retrieve antiparallel head populations.

No field pair exceeds the redundancy threshold (redundant_pairs = ∅). Fields are functionally distinct — interference arises from ranking conflict, not duplication.

Scope and limitations

Per-head causal mediation: set-overlap with the canonical IOI circuit, not strict per-head path-patching enrichment. The 7-head oracle itself restores only 11.7 % of the IOI logit-difference - the circuit is intrinsically distributed.
T-field is correlational: the T primitives (induction / copying / QK / OV) correlate with mechanistic roles but are not optimised to predict path-patching scores. A T populated directly from path-patching scores would close the gap (costly in practice).
Concept vocabulary lesson: the 3 / 16 -> 13 / 16 reference-head recovery story is a methodological lesson, not a finding. Surface POS / DEP / NER vocabularies are inappropriate for positional routers like attention heads.
Scale: no frontier-scale validation. SAE-feature S (Lieberum et al., Gemma Scope) is the natural next step.
Per-field causal enrichment: all 6 fields produce p = 1.0 after Bonferroni correction on the path-patching null. Retrieval enrichment is not the same as causal enrichment for distributed circuits.