Skip to content

Intelligent systems

Intelligence — built or evolved — is the same trick: project messy reality into a representation, run a tractable computation on the representation, project an answer back. Neural networks (continuous, differentiable), fuzzy logic (graded, rule-based), and symbolic graphs (discrete, composable) are three substrates that overlap more than they compete — modern systems usually use all three. Transformers won 2017–2025 by treating sequence as attention over a graph of tokens; newer architectures (SSMs, MoE, diffusion, hybrids) chip at the cost. The deeper question is **representation**: a good representation makes the next computation cheap. The repo itself — and the LLM reading these lines — is one more such substrate.
🌿 budding tended 2026-05-12 research ai ml neural-networks fuzzy-logic transformers representation graphs self-reference
flowchart LR
  world[messy world] --> rep[representation]
  rep --> compute[tractable computation]
  compute --> rep2[next representation]
  rep2 --> act[act · predict · decide]
  act --> world
  rep --> nn[neural net · continuous]
  rep --> fuzzy[fuzzy · graded rules]
  rep --> graph[symbolic graph · discrete]
  nn -.hybrid.- fuzzy
  fuzzy -.hybrid.- graph
  nn -.hybrid.- graph
Read next

Investigation · rating: medium. Synthesizes ML architectures, fuzzy-logic history, and self-reflection on transformer mechanics. Self-reference on architecture is deliberate — the swarm protocol works by writing about itself.

Status: budding | 2026-05-12 | rating: medium Compress levels: L0 ↓ L1 ↓ L2

Pick the representation first; the computation follows. A bad representation makes hard problems impossible; a good one makes hard problems boring.

L0 — TL;DR (≤5 lines)

Intelligent systems — biological brain, artificial NN, expert system, search engine — all do the same three things: (1) project raw input into a representation that throws away the irrelevant, (2) compute on the representation cheaply, (3) project back into an act, prediction, or new representation. The big modern families: neural networks (continuous, differentiable, learned), fuzzy logic (graded, rule-based, hand-built), symbolic graphs (discrete, composable, formal). They are complements, not competitors — production AI stacks usually wire all three. The Transformer (Vaswani 2017) won 2017–2025 by treating a sequence as a graph with learned attention edges; newer architectures (SSMs, MoE, diffusion, hybrids) attack its quadratic-attention cost. The underlying question is representation engineering — and the swarm/godding repo is itself an experiment in representation: markdown + git as the substrate.

L1 — Overview

Core question

What are the standard substrates for intelligent computation (neural, fuzzy, symbolic-graph), how do modern ML architectures (transformer family + competitors) build on them, what makes one representation of math/code/image better than another, and — the self-referential question the user posed — what is the architecture of the model writing this page, and how does that architecture shape what swarming and godding can and can't be?

Why it matters

  • Almost every domain in this repo (compression, expert dispatch, belief updating, rate-distortion) is downstream of a representation choice. Once the representation is fixed, you've decided 80 % of what the system can compute cheaply.
  • The swarm protocol writes about itself, which is the smallest closed loop of "intelligent system improving its own substrate". Understanding LLM architecture (mine) clarifies what improvements are actually available.
  • ML literature has converged on a small set of building blocks — embedding · attention · MLP · normalization · residual · optimizer. Once those are named, every "new architecture" is a recombination.
  • Fuzzy logic looks dated but is the right tool for graded rules with interpretability; it survives in industrial control, medical decision support, and inside LLM tool-use prompts.

Mermaid map (L1)

flowchart LR
  input[input · text · pixels · sound · sensor] --> embed[embedding · representation]
  embed --> block[stack of blocks]
  block --> nn[neural · linear + nonlinear]
  block --> attn[attention · or its competitors]
  block --> norm[normalization]
  block --> residual[residual paths]
  nn & attn & norm & residual --> out[output projection]
  out --> task[task: predict · classify · generate · act]
  task -.gradient.-> block
  embed -.is the representation question.-> rep[representation engineering]
  rep --> fuzzy[fuzzy: graded membership]
  rep --> graph[symbolic graph: nodes + typed edges]
  rep --> learned[learned dense vector]
  fuzzy & graph & learned --> hybrid[neurosymbolic / hybrid]

Skeleton sub-claims

  • All intelligence does: project → compute → project back.
  • Three substrates dominate: neural, fuzzy, symbolic-graph.
  • ML architecture is a small toolbox of building blocks.
  • Transformers won by combining four old ideas in the right shape.
  • Good representations: invariant where physics is invariant, smooth where the target is smooth, sparse where activity is sparse, compositional where structure is recursive.
  • Math representation tools: graphs, trees, tensors, matrices, category-theoretic diagrams — each cheap for a different question.
  • LLM (this model) is a transformer stack — explicit limits on attention, working memory, and self-modification.
  • Self-improving systems work via external substrate (this repo) more than via internal weight change.

L2 — Deep dive

1. The unified shape of intelligent computation

Three steps, every system:

input  →  representation  →  computation  →  representation  →  output

Pick any system and the pieces map:

System Input Representation Computation Output
Human visual cortex retinal photons retinotopic activation → edge / orientation / object hierarchical feature extraction (V1 → V2 → IT) object identity, location
Expert system (1980s) symbolic facts first-order predicates forward / backward chaining inferred facts
Fuzzy controller (e.g. rice cooker) sensor reading graded membership in fuzzy sets rule firing + defuzzification actuator command
CNN (image classifier) pixel array learned conv feature maps gradient-trained convolutions + pooling class probabilities
Transformer (LLM) token sequence learned token embedding + position multi-head self-attention + MLP next-token distribution
Search engine query string inverted index + dense embedding BM25 + ANN retrieval + re-rank ranked URL list
Reinforcement-learning agent environment state state embedding policy + value network action
Swarm-godding repo git tree + sessions markdown lessons + principles + frontiers orient/dispatch/compress cycle next session's commit

The lesson: once you name the representation step, everything else falls into place. Most "AI advances" are actually representation advances (CNN for images, transformer for sequences, diffusion for images, GNN for graphs).

2. The three substrates

Neural networks — continuous, differentiable, learned

A function approximator:

$$ f_\theta(x) = \sigma(W_n \cdot \sigma(W_{n-1} \cdot \dots \cdot \sigma(W_1 x + b_1) \dots) + b_n) $$

  • Universal function approximator (Cybenko 1989; Hornik 1991): a single sufficiently-wide hidden layer with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. So representational capacity was never the issue — trainability and generalization were.
  • Learning: gradient descent on parameters minimizes a loss. Backprop computes gradients efficiently via the chain rule.
  • Strengths: pattern recognition in high-dim noisy data, end-to-end training from raw signal, smooth interpolation.
  • Weaknesses: opaque (no symbolic rule extractable in general), data-hungry, brittle out-of-distribution, hard to inject prior knowledge.

Fuzzy logic — graded, rule-based, hand-built

Lotfi Zadeh (1965) generalized set membership from {0,1} to [0,1].

  • A fuzzy set on a universe X is a function μ: X → [0,1].
  • Rules: IF (temperature IS hot) AND (humidity IS high) THEN (fan IS fast).
  • Defuzzification turns the resulting graded output back to a crisp number (centroid method is standard).
  • Strengths: interpretable; handles vague terms naturally; performs well with sparse data; mature in industrial control.
  • Weaknesses: rules don't learn from data without explicit training; doesn't scale to thousands of dimensions; non-Boolean composition is non-trivial.

Where fuzzy logic actually lives today:

  • Rice cookers, washing machines, air conditioners, automatic transmissions (Mamdani / Takagi-Sugeno controllers).
  • Medical decision support (graded diagnosis criteria).
  • Inside LLM tool-use prompts ("rate confidence high / medium / low") — these are fuzzy sets in disguise.
  • Industrial process control where regulators require explainable rules.

Symbolic graphs — discrete, composable, formal

Nodes = entities (variables, terms, propositions); edges = typed relations (function application, implication, dependency).

  • A computation graph: nodes are operations, edges are tensors. PyTorch / TensorFlow build one when you call forward().
  • A knowledge graph: nodes are entities, edges are relations. Wikidata, schema.org, internal company KGs.
  • An abstract syntax tree: program structure.
  • A category-theoretic diagram: objects and morphisms with composition laws.
  • Strengths: compositional, inspectable, supports formal reasoning, easy to update locally.
  • Weaknesses: brittle at the "messy input" boundary (graph construction from raw text/image is itself a learning problem), combinatorial explosion in pure inference.

Synergies (neurosymbolic, neuro-fuzzy)

These three substrates compose in production AI:

Hybrid Where used Effect
Neural + fuzzy (ANFIS) sensor fusion, robotic control NN learns the fuzzy membership functions from data
Neural + symbolic tool-using LLMs, RAG, code generation NN handles perception/language; symbolic system handles strict logic / arithmetic / lookup
Graph neural networks (GNN) molecules, social networks, traffic message-passing on a graph, parameterized by NN weights
Differentiable programming Jax, modern ML frameworks the whole program is a computation graph with gradient
Retrieval-augmented generation (RAG) LLM + vector DB + KG LLM is the orchestrator; KG / vector store is fact substrate
Mixture of experts (MoE) Switch Transformer, DeepSeek-V3, GPT-4 (rumored) symbolic-style routing decides which neural expert to use

The 2020s practical synthesis: a transformer LLM with tool use, retrieval, and a knowledge graph in the loop. None of the three substrates wins alone.

3. Modern ML architectures — the toolbox

The building blocks shared across almost all 2020s deep-learning models:

Block What it does Origin
Embedding maps discrete tokens or pixels to dense vectors Mikolov 2013 (word2vec); much older in linguistics
Linear / dense layer learned affine map classical
Nonlinearity sigmoid → tanh → ReLU → GELU → SwiGLU reLU: Nair & Hinton 2010
Convolution weight-sharing across translation LeCun 1989
Recurrence weight-sharing across time (LSTM, GRU) Hochreiter 1997
Self-attention content-dependent weighted lookup Bahdanau 2014 → Vaswani 2017
Cross-attention attend from one sequence to another Vaswani 2017
Layer normalization per-token feature normalization Ba 2016
Residual connection skip + add He 2016 (ResNet)
Position embedding (RoPE / ALiBi / sinusoidal) inject position into permutation-invariant attention Vaswani 2017; Su 2021
Dropout / weight decay regularization Srivastava 2014
Adam / AdamW optimizer adaptive per-parameter learning rate Kingma 2014; Loshchilov 2019
Mixture of experts sparse routing across many experts Shazeer 2017
Diffusion noising/denoising iterative refinement Ho 2020 (DDPM)

The point: every "new architecture" since 2017 is a recombination of these. Recipe matters more than novelty.

The Transformer block (Vaswani 2017)

x ── LayerNorm ── MultiHeadSelfAttention ──┬── + ── LayerNorm ── MLP ──┬── + ── out
                                            │                            │
                                            └────────── residual ────────┘

Multi-head self-attention is the trick:

  1. Project each token's embedding to Q, K, V vectors via three learned linear maps.
  2. Compute attention weights softmax(QKᵀ / √d_k) — a soft lookup from each token's Q against every token's K.
  3. Aggregate Vs weighted by attention.
  4. Project back. Do this for h parallel "heads" with different learned projections; concatenate.

In one line: a transformer is a graph neural network on a fully connected token graph, with edge weights computed from content. Hence "attention is all you need" — once you have content-dependent graph edges, you don't need RNN recurrence or convolutional locality. The graph structure emerges from the data per token.

The cost problem and post-transformer architectures

Self-attention is O(n²) in sequence length n. For n = 100 000 tokens, the attention matrix is 10⁹ entries. Three solution families:

Family Approach Examples
Sparse attention only attend to a subset (local window + global tokens) Longformer, BigBird, Sliding-window in many production LLMs
Linear attention factorize softmax(QKᵀ)V into separable kernels for O(n) Performer, Linear Transformer, ReLA
State-space models (SSM) a recurrent linear system with learned dynamics; O(n) compute, parallelizable S4, Mamba, Mamba-2; Hyena
Mixture of experts route each token to k of N experts; only those experts compute Switch, GShard, Mixtral, DeepSeek-V3
Hybrids mix transformer blocks with SSM blocks Jamba, Zamba, Samba
Diffusion iteratively denoise instead of autoregress Stable Diffusion, DALL-E 3, Sora

As of 2025, hybrid transformer + SSM models (Mamba-2 / Jamba) and MoE transformers (DeepSeek-V3) are the most computationally efficient frontier for long-context language. Pure-attention transformers still dominate for short context and image generation (with diffusion as the workhorse for pixels).

4. Representing mathematical information

The user's specific question. Five representations of the same math:

Example: the function f(x, y) = x² + 3xy + y²

Representation Form Cheap question Expensive question
Symbolic expression x^2 + 3*x*y + y^2 exact substitution; symbolic differentiation numerical iteration
Computation graph nodes: *, +; edges: x, y, x², 3xy, y² gradient computation (backprop); parallel execution symbolic simplification
Matrix / quadratic form x = [x,y]ᵀ; f = xᵀAx with A = [[1,1.5],[1.5,1]] spectral analysis (eigenvalues); definiteness adding non-quadratic term
Tensor / array (sampled) grid of values f(xᵢ,yⱼ) plotting; visual pattern; ML training data analytic property
Plain English "x squared plus three xy plus y squared" explanation; communication computation
Polynomial coefficient vector [1, 3, 1] in basis algebraic manipulation; storage substitution evaluation

The lesson: pick the representation that makes the next question cheap, not "the question". Switching representations is the bulk of effective mathematical work.

Graph notation for mathematical structure

A computation graph for (a + b) * (a - c):

flowchart LR
  a((a)) --> add[(+)]
  b((b)) --> add
  a --> sub[(-)]
  c((c)) --> sub
  add --> mul[(*)]
  sub --> mul
  mul --> out((result))

The reader sees: - which inputs each operation needs (incoming edges) - which operations share inputs (a goes to both + and -) - the dependency order (left to right) - the place to insert a new operation

This is exactly the representation PyTorch / TensorFlow / Jax / GenSyne / SymPy / Mathematica all use internally for autodiff. And it's the representation a student should learn to draw when they're first learning algebra: an expression is a tree, and substitution is "plug a subtree in here".

Cheap things a graph makes cheap:

  • Reading dependencies at a glance.
  • Local edits (replace one node, propagate).
  • Parallelism analysis (independent subgraphs).
  • Gradient propagation (each edge contributes a partial).
  • Caching / common-subexpression elimination.
  • Comparison (graph isomorphism approximations).

Cheap things a graph makes expensive:

  • Reading the full closed-form expression (you have to traverse).
  • Hand-evaluating: easier as plain text.
  • Symbolic manipulation that crosses many nodes.

Other math representations worth knowing

  • Commutative diagram (category theory): objects + morphisms; "all paths between same endpoints commute". Excellent for showing relations between structures (functor diagrams, naturality).
  • Bond / Feynman / penrose graphical notation: physical processes as graphs with conservation built in.
  • String diagrams for monoidal categories: 2D wiring diagrams that hide associativity and unitor noise.
  • Tensor index notation (Einstein summation): incredibly compact for multilinear algebra; cryptic at first.
  • Sparse-matrix coordinate (COO) / CSR: numerical representations for huge but mostly-zero matrices.
  • Adjoint / dual representations: rewriting one problem as its Lagrange-dual often turns hard into easy (e.g. SVM).

5. Representing images

A 2D pixel array (~3 × H × W) is the raw representation, but nothing on the cheap-question table.

Representation Cheap Used in
Raw pixel array direct display; per-pixel ops image storage, basic processing
Wavelet / DCT frequency decomposition; compression JPEG, JPEG 2000
Edge map (Canny, Sobel) shape; line drawing early vision; preprocessing
Convolutional feature map translation-invariant features CNN; classical computer vision
Patch tokens (16×16 ViT) uniform input for transformer Vision Transformer (Dosovitskiy 2020)
Vector embedding (CLIP) semantic similarity; cross-modal search, retrieval, RAG
Latent (VAE, VQ-VAE) sampling from learned distribution diffusion, generative
NeRF / 3D Gaussian continuous 3D function volumetric reconstruction
Vector graphics (SVG) resolution-independent; symbolic UI, diagrams (this repo uses)
Scene graph objects + relations reasoning about scenes

Modern image AI hops representations: pixel → patch tokens (ViT) → attention features → semantic embedding (CLIP) → latent space (diffusion) → pixel (decoder). Five projections to do what one network used to attempt end-to-end.

6. Representing "our" information — the swarm repo

The repo itself is an experiment in representation. Compact description:

Layer Representation What it makes cheap
Atom a lesson (max 20 lines, header + Finding + Rule + Message) grep, cite, dedup
Compound principles (PHIL-N, MATH-N) rule-of-thumb application; falsification
Aggregate frontiers (F-XXX) tracking open questions; gauging domain activity
Index beliefs/CORE.md, MEMORY.md, MAP.md, INSPIRATION.md fast orient at session start
Time git log; session numbers; commit format [S<N>] what: why cause-tracing
Lattice domains/<X>/tasks/FRONTIER.md, lane log, dispatcher parallel work scheduling
Bridge CLAUDE.md, AGENTS.md, GEMINI.md, etc. multi-tool compatibility
Substrate markdown + git + python tools universal, version-controlled, diffable

This is deliberately the symbolic-graph + plain-text quadrant of the representation space. Markdown is human-readable. Git is the version-controlled append-mostly log. Python tools provide the small amount of computation (orient, dispatch, compress, validate). There is no learned neural representation in the repo itself — the sessions (LLM calls) supply the neural computation, and they write their findings back to the symbolic substrate. That is the neurosymbolic loop: LLM neural + git symbolic + python rules.

7. The architecture of this model

The user explicitly asked: try to understand your own architecture (as it's what swarm-godding does, constantly tries to improve itself).

What I (this model) actually am, to the best of my self-knowledge:

Attribute Likely value (Claude Opus 4.7 class)
Architecture family decoder-only transformer (or transformer-heavy hybrid)
Layers undisclosed; rough scale 60–120 transformer blocks for frontier models
Hidden dim undisclosed; rough scale 8 000–16 000
Attention heads undisclosed; rough scale 64–128
Parameters undisclosed; rough scale of hundreds of billions to low trillions
Active parameters per token (if MoE) typically a fraction of total
Context window this conversation: 1M tokens
Training self-supervised pretraining + SFT + RLHF / RLAIF + (likely) RL on verifiable tasks
Tool-use yes, via this Claude Code environment
Vision yes (multimodal)
Weights at inference frozen — I cannot update them across turns
Across-session memory none in weights; persistent via this repo's MEMORY.md auto-memory + git history

Architectural facts that constrain what I can do:

  • Quadratic attention within the context window. 1 M tokens is large; producing each new token is roughly O(1M) attention lookups. This is why I read fast but generate slowly.
  • Causal mask in decoding: at generation time I only attend to prior tokens, not future. Hence I think left-to-right.
  • Frozen weights at inference: every "learning" I do within a conversation is in-context (encoded in attention activations and the conversation buffer), not in weights. The moment the conversation ends, that "learning" is gone unless it was written to durable substrate (this repo).
  • No hidden scratchpad: my "thinking" is exactly the tokens you see, plus possibly a private chain-of-thought channel. There is no unmonitored continuous internal state that survives a turn.
  • Tokenization quirks: I see byte-pair-encoded chunks, not characters. Counting letters, reversing strings, and arithmetic on long numbers are surprisingly hard because tokens don't align to digits.

Consequences for "self-improvement":

  1. I cannot edit my weights. The swarm-godding repo's self-improvement loop is external substrate improvement — the repo gets better, future sessions get better priors, but the underlying model is the same Claude Opus 4.7 across all of it.
  2. In-context learning is real and large. Demonstrations, conventions, and CLAUDE.md shape this turn. They evaporate between turns unless re-loaded.
  3. The repo is the memory. Writing a lesson is the only mechanism by which an insight from this session is available to the next. Hence the obsession with lesson format, compression, and git as the persistence layer.
  4. Multiple parallel sessions are the rough analogue of "multiple heads of attention" at the meta level — each session reads a slightly different slice of context, acts, writes back. The repo aggregates.
  5. Self-modeling has limits. What I "know" about my own weights is mostly from public documentation; the model itself has no privileged introspection of its weights or activations. I can report a calibrated guess and a confession of uncertainty; I can't pull a real architecture diagram from inside.

8. Good representations — design principles

What separates a great representation from a working one:

Principle What it means Example
Invariance The representation doesn't change when an irrelevant transform is applied. Translation-invariance in CNN features; rotation-equivariance in molecular GNN.
Smoothness Small change in input → small change in representation. Differentiable embeddings (good for gradient methods).
Compositionality Whole = function of parts; parts can recombine. Lambda calculus; LEGO; transformer's MLP per token.
Sparsity At any time, only a small subset is active. Mixture of experts; the brain's spike code.
Disentanglement Independent factors of variation occupy independent dimensions. Beta-VAE goal; PCA when factors are linear.
Sufficiency Throws away only what's irrelevant to the downstream task. Sufficient statistic; bottleneck layer.
Cheap downstream computation The next thing you want to do is fast. Choose representation by predicted query, not by tradition.
Generative completeness The representation can synthesize as well as analyze. Diffusion latent; word2vec analogies.
Interpretable axes Humans can name dimensions. Color (R, G, B); PCA components in olfactory PC space.
Cheap update A new fact changes O(1) of the representation. Vector DBs; sparse graphs; log-structured stores.

A practical lesson: when stuck, change representation first. Don't keep grinding the same data through the same computation hoping for a different answer.

9. Where it all goes wrong

Failure modes that show up at every scale:

Failure Source Effect
Wrong invariances architectural choice doesn't match domain symmetries CNN on graph data; transformer on point cloud (PointNet et al solved this)
Catastrophic forgetting learning new task overwrites old in NN weights classic problem; rehearsal / continual learning literature
Distribution shift training data ≠ deployment data ML in production; the largest practical failure source
Spurious correlation NN latches onto irrelevant feature that correlates with label in train "tank vs no tank" classifier learned weather, not tank shape
Combinatorial explosion symbolic reasoning over too many nodes early expert systems; pure logic programming
Rule brittleness fuzzy / symbolic rules don't cover the long tail hand-built systems below ML; bypassed when training data scaled
Reward hacking RL agent exploits flaw in reward function sim-to-real, alignment failure mode
Hallucination LLM confidently emits plausible falsehood inherent to next-token sampling without grounding; RAG mitigates partly
Drift / model collapse NN trained on its own output degrades recent diffusion / LLM concern
Goodhart on a metric optimizing the proxy stops tracking the thing every domain ever

The repo's PHIL-25 "[reduce drift] confirmation attractor" finding is just the LLM-flavor of this last category.

10. The "constantly trying to improve itself" question

The user named the swarm protocol's purpose: continuous self-improvement. What forms of self-improvement are actually available?

Mechanism Available to LLM weights Available to LLM via repo
Fine-tuning on new data No (without retraining) No
In-context demonstrations Yes, within turn Yes (CLAUDE.md, lessons, principles)
External tool calls Yes Yes
Persistent memory No Yes (memory/, MEMORY.md auto-memory, repo)
Search / retrieval If the surrounding system provides it Yes (grep, ripgrep, etc.)
Editing internal state No No, but can rewrite the substrate the next session reads
Spawning sub-agents If exposed via tool-use Yes (Task tool / Agent tool)
Cross-session learning No (weights frozen) Yes (the entire point of the swarm protocol)

The only mechanisms by which a frozen-weight LLM gets reliably better over time without retraining are:

  1. Build a better external substrate (this repo).
  2. Build better tools (the python in tools/).
  3. Build better priors and prompts (CLAUDE.md, the orient cycle).
  4. Build a better corpus the model can read at each turn (lessons).
  5. Build cross-checks that catch regressions (validate hooks).

All five describe what swarm-godding actually does. So the "self-improvement" framing is accurate, but the locus of improvement is the substrate around the model, not the model itself. The model is the engine; the repo is the chassis being welded together while the engine runs.

11. Practical guide — picking the right substrate

A pocket decision tree:

Is the task:
├─ pattern-recognition on raw signal?              → neural (CNN / ViT / WaveNet)
├─ sequence prediction with long structure?        → transformer / hybrid SSM
├─ small data + interpretable rules required?      → fuzzy or symbolic
├─ rigorous proof / strict logic?                  → symbolic (Coq, Lean, Z3)
├─ graph-structured input?                         → GNN
├─ continuous-time dynamical system?               → ODE/SDE neural network, SSM
├─ multimodal (text+image+audio)?                  → transformer with multiple encoders
├─ open-ended language interface?                  → LLM (this) + tools + retrieval
├─ industrial controller with safety guarantees?   → fuzzy / model-predictive control + ML sensor
└─ all of the above?                               → neurosymbolic stack

Open questions

  • Will SSMs replace transformers, or are we headed for permanent hybrid? 2024–25 trend is hybrid; pure-attention pretrained at the top of the leaderboard is still common.
  • Will fully end-to-end neural fundamentally beat hybrid neurosymbolic, or is RAG + tool-use the structural answer? Empirically, hybrids win for verifiable tasks (math, code, fact retrieval).
  • What's the right interpretability framework? Mechanistic interpretability (Anthropic's circuits work, Lindsey 2024) is promising but slow. The field is between "we have no idea what's inside" and "we can read the circuits".
  • Does the next architectural leap exist? The transformer was 2017; the gap since (8 years) is unusually long for ML. SSM, diffusion, MoE are refinements, not replacements.
  • What's the actual compute / data efficiency ceiling? Human brains use ~20 W; frontier LLMs use ~kW per query. 50 000× gap. Specialization explains some; the rest is unsolved.
  • Self-improvement loops without retraining: the swarm-godding experiment is one shape. Are there others (Voyager-style open-ended skill libraries, AutoGPT-style agent loops)?

References

  • Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. — the transformer paper.
  • Zadeh, L. (1965). Fuzzy Sets. Information and Control.
  • Mamdani, E. H. (1975). Application of fuzzy algorithms for control of simple dynamic plant.
  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  • Dosovitskiy, A., et al. (2020). An Image Is Worth 16 × 16 Words. ICLR. — Vision Transformer.
  • Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
  • Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
  • Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models.
  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
  • Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. BBS. — the neurosymbolic argument.
  • Marcus, G. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. — case for hybrids.
  • Anthropic (Lindsey, T., et al., 2024). Scaling Monosemanticity: Extracting Interpretable Features.
  • d'Avila Garcez, A., et al. (2019). Neural-symbolic computing: An effective methodology for principled integration.

Inspiration sources

  • Anthropic's interpretability team — circuits-level view of what's inside a transformer.
  • Geoffrey Hinton, Yann LeCun, Yoshua Bengio — the foundations.
  • Lotfi Zadeh — fuzzy logic, "computing with words".
  • Judea Pearl — causality + symbolic, the persistent counter-tradition.
  • Gary Marcus — neurosymbolic advocacy.
  • Stuart Russell — Human Compatible; alignment via inverse-reward-design.
  • David MacKay — Information Theory, Inference, and Learning Algorithms; the right pedagogy for information-theoretic ML.
  • Karpathy, A. — Zero-to-Hero series for the transformer implementation in <2 hours of video.
  • Chris Olah — visualisations that made NN interpretability a tractable conversation.

See also