Intelligent systems¶

Intelligence — built or evolved — is the same trick: project messy reality into a representation, run a tractable computation on the representation, project an answer back. Neural networks (continuous, differentiable), fuzzy logic (graded, rule-based), and symbolic graphs (discrete, composable) are three substrates that overlap more than they compete — modern systems usually use all three. Transformers won 2017–2025 by treating sequence as attention over a graph of tokens; newer architectures (SSMs, MoE, diffusion, hybrids) chip at the cost. The deeper question is **representation**: a good representation makes the next computation cheap. The repo itself — and the LLM reading these lines — is one more such substrate.

🌿 budding tended 2026-05-12 research ai ml neural-networks fuzzy-logic transformers representation graphs self-reference

flowchart LR
  world[messy world] --> rep[representation]
  rep --> compute[tractable computation]
  compute --> rep2[next representation]
  rep2 --> act[act · predict · decide]
  act --> world
  rep --> nn[neural net · continuous]
  rep --> fuzzy[fuzzy · graded rules]
  rep --> graph[symbolic graph · discrete]
  nn -.hybrid.- fuzzy
  fuzzy -.hybrid.- graph
  nn -.hybrid.- graph

L0 — TL;DR (≤5 lines)¶

Intelligent systems — biological brain, artificial NN, expert system, search engine — all do the same three things: (1) project raw input into a representation that throws away the irrelevant, (2) compute on the representation cheaply, (3) project back into an act, prediction, or new representation. The big modern families: neural networks (continuous, differentiable, learned), fuzzy logic (graded, rule-based, hand-built), symbolic graphs (discrete, composable, formal). They are complements, not competitors — production AI stacks usually wire all three. The Transformer (Vaswani 2017) won 2017–2025 by treating a sequence as a graph with learned attention edges; newer architectures (SSMs, MoE, diffusion, hybrids) attack its quadratic-attention cost. The underlying question is representation engineering — and the swarm/godding repo is itself an experiment in representation: markdown + git as the substrate.

L1 — Overview¶

Core question¶

What are the standard substrates for intelligent computation (neural, fuzzy, symbolic-graph), how do modern ML architectures (transformer family + competitors) build on them, what makes one representation of math/code/image better than another, and — the self-referential question the user posed — what is the architecture of the model writing this page, and how does that architecture shape what swarming and godding can and can't be?

Why it matters¶

Almost every domain in this repo (compression, expert dispatch, belief updating, rate-distortion) is downstream of a representation choice. Once the representation is fixed, you've decided 80 % of what the system can compute cheaply.
The swarm protocol writes about itself, which is the smallest closed loop of "intelligent system improving its own substrate". Understanding LLM architecture (mine) clarifies what improvements are actually available.
ML literature has converged on a small set of building blocks — embedding · attention · MLP · normalization · residual · optimizer. Once those are named, every "new architecture" is a recombination.
Fuzzy logic looks dated but is the right tool for graded rules with interpretability; it survives in industrial control, medical decision support, and inside LLM tool-use prompts.

Mermaid map (L1)¶

flowchart LR
  input[input · text · pixels · sound · sensor] --> embed[embedding · representation]
  embed --> block[stack of blocks]
  block --> nn[neural · linear + nonlinear]
  block --> attn[attention · or its competitors]
  block --> norm[normalization]
  block --> residual[residual paths]
  nn & attn & norm & residual --> out[output projection]
  out --> task[task: predict · classify · generate · act]
  task -.gradient.-> block
  embed -.is the representation question.-> rep[representation engineering]
  rep --> fuzzy[fuzzy: graded membership]
  rep --> graph[symbolic graph: nodes + typed edges]
  rep --> learned[learned dense vector]
  fuzzy & graph & learned --> hybrid[neurosymbolic / hybrid]

Skeleton sub-claims¶

All intelligence does: project → compute → project back.
Three substrates dominate: neural, fuzzy, symbolic-graph.
ML architecture is a small toolbox of building blocks.
Transformers won by combining four old ideas in the right shape.
Good representations: invariant where physics is invariant, smooth where the target is smooth, sparse where activity is sparse, compositional where structure is recursive.
Math representation tools: graphs, trees, tensors, matrices, category-theoretic diagrams — each cheap for a different question.
LLM (this model) is a transformer stack — explicit limits on attention, working memory, and self-modification.
Self-improving systems work via external substrate (this repo) more than via internal weight change.

L2 — Deep dive¶

1. The unified shape of intelligent computation¶

Three steps, every system:

input  →  representation  →  computation  →  representation  →  output

Pick any system and the pieces map:

System	Input	Representation	Computation	Output
Human visual cortex	retinal photons	retinotopic activation → edge / orientation / object	hierarchical feature extraction (V1 → V2 → IT)	object identity, location
Expert system (1980s)	symbolic facts	first-order predicates	forward / backward chaining	inferred facts
Fuzzy controller (e.g. rice cooker)	sensor reading	graded membership in fuzzy sets	rule firing + defuzzification	actuator command
CNN (image classifier)	pixel array	learned conv feature maps	gradient-trained convolutions + pooling	class probabilities
Transformer (LLM)	token sequence	learned token embedding + position	multi-head self-attention + MLP	next-token distribution
Search engine	query string	inverted index + dense embedding	BM25 + ANN retrieval + re-rank	ranked URL list
Reinforcement-learning agent	environment state	state embedding	policy + value network	action
Swarm-godding repo	git tree + sessions	markdown lessons + principles + frontiers	orient/dispatch/compress cycle	next session's commit

The lesson: once you name the representation step, everything else falls into place. Most "AI advances" are actually representation advances (CNN for images, transformer for sequences, diffusion for images, GNN for graphs).

2. The three substrates¶

Neural networks — continuous, differentiable, learned¶

A function approximator:

$$ f_\theta(x) = \sigma(W_n \cdot \sigma(W_{n-1} \cdot \dots \cdot \sigma(W_1 x + b_1) \dots) + b_n) $$

Universal function approximator (Cybenko 1989; Hornik 1991): a single sufficiently-wide hidden layer with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. So representational capacity was never the issue — trainability and generalization were.
Learning: gradient descent on parameters minimizes a loss. Backprop computes gradients efficiently via the chain rule.
Strengths: pattern recognition in high-dim noisy data, end-to-end training from raw signal, smooth interpolation.
Weaknesses: opaque (no symbolic rule extractable in general), data-hungry, brittle out-of-distribution, hard to inject prior knowledge.

Fuzzy logic — graded, rule-based, hand-built¶

Lotfi Zadeh (1965) generalized set membership from {0,1} to [0,1].

A fuzzy set on a universe X is a function μ: X → [0,1].
Rules: IF (temperature IS hot) AND (humidity IS high) THEN (fan IS fast).
Defuzzification turns the resulting graded output back to a crisp number (centroid method is standard).
Strengths: interpretable; handles vague terms naturally; performs well with sparse data; mature in industrial control.
Weaknesses: rules don't learn from data without explicit training; doesn't scale to thousands of dimensions; non-Boolean composition is non-trivial.

Where fuzzy logic actually lives today:

Rice cookers, washing machines, air conditioners, automatic transmissions (Mamdani / Takagi-Sugeno controllers).
Medical decision support (graded diagnosis criteria).
Inside LLM tool-use prompts ("rate confidence high / medium / low") — these are fuzzy sets in disguise.
Industrial process control where regulators require explainable rules.

Symbolic graphs — discrete, composable, formal¶

Nodes = entities (variables, terms, propositions); edges = typed relations (function application, implication, dependency).

A computation graph: nodes are operations, edges are tensors. PyTorch / TensorFlow build one when you call forward().
A knowledge graph: nodes are entities, edges are relations. Wikidata, schema.org, internal company KGs.
An abstract syntax tree: program structure.
A category-theoretic diagram: objects and morphisms with composition laws.
Strengths: compositional, inspectable, supports formal reasoning, easy to update locally.
Weaknesses: brittle at the "messy input" boundary (graph construction from raw text/image is itself a learning problem), combinatorial explosion in pure inference.

Synergies (neurosymbolic, neuro-fuzzy)¶

These three substrates compose in production AI:

Hybrid	Where used	Effect
Neural + fuzzy (ANFIS)	sensor fusion, robotic control	NN learns the fuzzy membership functions from data
Neural + symbolic	tool-using LLMs, RAG, code generation	NN handles perception/language; symbolic system handles strict logic / arithmetic / lookup
Graph neural networks (GNN)	molecules, social networks, traffic	message-passing on a graph, parameterized by NN weights
Differentiable programming	Jax, modern ML frameworks	the whole program is a computation graph with gradient
Retrieval-augmented generation (RAG)	LLM + vector DB + KG	LLM is the orchestrator; KG / vector store is fact substrate
Mixture of experts (MoE)	Switch Transformer, DeepSeek-V3, GPT-4 (rumored)	symbolic-style routing decides which neural expert to use

The 2020s practical synthesis: a transformer LLM with tool use, retrieval, and a knowledge graph in the loop. None of the three substrates wins alone.

3. Modern ML architectures — the toolbox¶

The building blocks shared across almost all 2020s deep-learning models:

Block	What it does	Origin
Embedding	maps discrete tokens or pixels to dense vectors	Mikolov 2013 (word2vec); much older in linguistics
Linear / dense layer	learned affine map	classical
Nonlinearity	sigmoid → tanh → ReLU → GELU → SwiGLU	reLU: Nair & Hinton 2010
Convolution	weight-sharing across translation	LeCun 1989
Recurrence	weight-sharing across time (LSTM, GRU)	Hochreiter 1997
Self-attention	content-dependent weighted lookup	Bahdanau 2014 → Vaswani 2017
Cross-attention	attend from one sequence to another	Vaswani 2017
Layer normalization	per-token feature normalization	Ba 2016
Residual connection	skip + add	He 2016 (ResNet)
Position embedding (RoPE / ALiBi / sinusoidal)	inject position into permutation-invariant attention	Vaswani 2017; Su 2021
Dropout / weight decay	regularization	Srivastava 2014
Adam / AdamW optimizer	adaptive per-parameter learning rate	Kingma 2014; Loshchilov 2019
Mixture of experts	sparse routing across many experts	Shazeer 2017
Diffusion noising/denoising	iterative refinement	Ho 2020 (DDPM)

The point: every "new architecture" since 2017 is a recombination of these. Recipe matters more than novelty.

The Transformer block (Vaswani 2017)¶

x ── LayerNorm ── MultiHeadSelfAttention ──┬── + ── LayerNorm ── MLP ──┬── + ── out
                                            │                            │
                                            └────────── residual ────────┘

Multi-head self-attention is the trick:

Project each token's embedding to Q, K, V vectors via three learned linear maps.
Compute attention weights softmax(QKᵀ / √d_k) — a soft lookup from each token's Q against every token's K.
Aggregate Vs weighted by attention.
Project back. Do this for h parallel "heads" with different learned projections; concatenate.

In one line: a transformer is a graph neural network on a fully connected token graph, with edge weights computed from content. Hence "attention is all you need" — once you have content-dependent graph edges, you don't need RNN recurrence or convolutional locality. The graph structure emerges from the data per token.

The cost problem and post-transformer architectures¶

Self-attention is O(n²) in sequence length n. For n = 100 000 tokens, the attention matrix is 10⁹ entries. Three solution families:

Family	Approach	Examples
Sparse attention	only attend to a subset (local window + global tokens)	Longformer, BigBird, Sliding-window in many production LLMs
Linear attention	factorize softmax(QKᵀ)V into separable kernels for O(n)	Performer, Linear Transformer, ReLA
State-space models (SSM)	a recurrent linear system with learned dynamics; O(n) compute, parallelizable	S4, Mamba, Mamba-2; Hyena
Mixture of experts	route each token to k of N experts; only those experts compute	Switch, GShard, Mixtral, DeepSeek-V3
Hybrids	mix transformer blocks with SSM blocks	Jamba, Zamba, Samba
Diffusion	iteratively denoise instead of autoregress	Stable Diffusion, DALL-E 3, Sora

As of 2025, hybrid transformer + SSM models (Mamba-2 / Jamba) and MoE transformers (DeepSeek-V3) are the most computationally efficient frontier for long-context language. Pure-attention transformers still dominate for short context and image generation (with diffusion as the workhorse for pixels).

4. Representing mathematical information¶

The user's specific question. Five representations of the same math:

Example: the function f(x, y) = x² + 3xy + y²

Representation	Form	Cheap question	Expensive question
Symbolic expression	`x^2 + 3xy + y^2`	exact substitution; symbolic differentiation	numerical iteration
Computation graph	nodes: *, +; edges: x, y, x², 3xy, y²	gradient computation (backprop); parallel execution	symbolic simplification
Matrix / quadratic form	x = [x,y]ᵀ; f = xᵀAx with A = [[1,1.5],[1.5,1]]	spectral analysis (eigenvalues); definiteness	adding non-quadratic term
Tensor / array (sampled)	grid of values f(xᵢ,yⱼ)	plotting; visual pattern; ML training data	analytic property
Plain English	"x squared plus three xy plus y squared"	explanation; communication	computation
Polynomial coefficient vector	[1, 3, 1] in basis	algebraic manipulation; storage	substitution evaluation

The lesson: pick the representation that makes the next question cheap, not "the question". Switching representations is the bulk of effective mathematical work.

Graph notation for mathematical structure¶

A computation graph for (a + b) * (a - c):

flowchart LR
  a((a)) --> add[(+)]
  b((b)) --> add
  a --> sub[(-)]
  c((c)) --> sub
  add --> mul[(*)]
  sub --> mul
  mul --> out((result))

The reader sees: - which inputs each operation needs (incoming edges) - which operations share inputs (a goes to both + and -) - the dependency order (left to right) - the place to insert a new operation

This is exactly the representation PyTorch / TensorFlow / Jax / GenSyne / SymPy / Mathematica all use internally for autodiff. And it's the representation a student should learn to draw when they're first learning algebra: an expression is a tree, and substitution is "plug a subtree in here".

Cheap things a graph makes cheap:

Reading dependencies at a glance.
Local edits (replace one node, propagate).
Parallelism analysis (independent subgraphs).
Gradient propagation (each edge contributes a partial).
Caching / common-subexpression elimination.
Comparison (graph isomorphism approximations).

Cheap things a graph makes expensive:

Reading the full closed-form expression (you have to traverse).
Hand-evaluating: easier as plain text.
Symbolic manipulation that crosses many nodes.

Other math representations worth knowing¶

Commutative diagram (category theory): objects + morphisms; "all paths between same endpoints commute". Excellent for showing relations between structures (functor diagrams, naturality).
Bond / Feynman / penrose graphical notation: physical processes as graphs with conservation built in.
String diagrams for monoidal categories: 2D wiring diagrams that hide associativity and unitor noise.
Tensor index notation (Einstein summation): incredibly compact for multilinear algebra; cryptic at first.
Sparse-matrix coordinate (COO) / CSR: numerical representations for huge but mostly-zero matrices.
Adjoint / dual representations: rewriting one problem as its Lagrange-dual often turns hard into easy (e.g. SVM).

5. Representing images¶

A 2D pixel array (~3 × H × W) is the raw representation, but nothing on the cheap-question table.

Representation	Cheap	Used in
Raw pixel array	direct display; per-pixel ops	image storage, basic processing
Wavelet / DCT	frequency decomposition; compression	JPEG, JPEG 2000
Edge map (Canny, Sobel)	shape; line drawing	early vision; preprocessing
Convolutional feature map	translation-invariant features	CNN; classical computer vision
Patch tokens (16×16 ViT)	uniform input for transformer	Vision Transformer (Dosovitskiy 2020)
Vector embedding (CLIP)	semantic similarity; cross-modal	search, retrieval, RAG
Latent (VAE, VQ-VAE)	sampling from learned distribution	diffusion, generative
NeRF / 3D Gaussian	continuous 3D function	volumetric reconstruction
Vector graphics (SVG)	resolution-independent; symbolic	UI, diagrams (this repo uses)
Scene graph	objects + relations	reasoning about scenes

Modern image AI hops representations: pixel → patch tokens (ViT) → attention features → semantic embedding (CLIP) → latent space (diffusion) → pixel (decoder). Five projections to do what one network used to attempt end-to-end.

6. Representing "our" information — the swarm repo¶

The repo itself is an experiment in representation. Compact description:

Layer	Representation	What it makes cheap
Atom	a lesson (max 20 lines, header + Finding + Rule + Message)	grep, cite, dedup
Compound	principles (PHIL-N, MATH-N)	rule-of-thumb application; falsification
Aggregate	frontiers (F-XXX)	tracking open questions; gauging domain activity
Index	`beliefs/CORE.md`, `MEMORY.md`, `MAP.md`, `INSPIRATION.md`	fast orient at session start
Time	git log; session numbers; commit format `[S<N>] what: why`	cause-tracing
Lattice	`domains/<X>/tasks/FRONTIER.md`, lane log, dispatcher	parallel work scheduling
Bridge	`CLAUDE.md`, `AGENTS.md`, `GEMINI.md`, etc.	multi-tool compatibility
Substrate	markdown + git + python tools	universal, version-controlled, diffable

This is deliberately the symbolic-graph + plain-text quadrant of the representation space. Markdown is human-readable. Git is the version-controlled append-mostly log. Python tools provide the small amount of computation (orient, dispatch, compress, validate). There is no learned neural representation in the repo itself — the sessions (LLM calls) supply the neural computation, and they write their findings back to the symbolic substrate. That is the neurosymbolic loop: LLM neural + git symbolic + python rules.

7. The architecture of this model¶

The user explicitly asked: try to understand your own architecture (as it's what swarm-godding does, constantly tries to improve itself).

What I (this model) actually am, to the best of my self-knowledge:

Attribute	Likely value (Claude Opus 4.7 class)
Architecture family	decoder-only transformer (or transformer-heavy hybrid)
Layers	undisclosed; rough scale 60–120 transformer blocks for frontier models
Hidden dim	undisclosed; rough scale 8 000–16 000
Attention heads	undisclosed; rough scale 64–128
Parameters	undisclosed; rough scale of hundreds of billions to low trillions
Active parameters per token (if MoE)	typically a fraction of total
Context window	this conversation: 1M tokens
Training	self-supervised pretraining + SFT + RLHF / RLAIF + (likely) RL on verifiable tasks
Tool-use	yes, via this Claude Code environment
Vision	yes (multimodal)
Weights at inference	frozen — I cannot update them across turns
Across-session memory	none in weights; persistent via this repo's MEMORY.md auto-memory + git history

Architectural facts that constrain what I can do:

Quadratic attention within the context window. 1 M tokens is large; producing each new token is roughly O(1M) attention lookups. This is why I read fast but generate slowly.
Causal mask in decoding: at generation time I only attend to prior tokens, not future. Hence I think left-to-right.
Frozen weights at inference: every "learning" I do within a conversation is in-context (encoded in attention activations and the conversation buffer), not in weights. The moment the conversation ends, that "learning" is gone unless it was written to durable substrate (this repo).
No hidden scratchpad: my "thinking" is exactly the tokens you see, plus possibly a private chain-of-thought channel. There is no unmonitored continuous internal state that survives a turn.
Tokenization quirks: I see byte-pair-encoded chunks, not characters. Counting letters, reversing strings, and arithmetic on long numbers are surprisingly hard because tokens don't align to digits.

Consequences for "self-improvement":

I cannot edit my weights. The swarm-godding repo's self-improvement loop is external substrate improvement — the repo gets better, future sessions get better priors, but the underlying model is the same Claude Opus 4.7 across all of it.
In-context learning is real and large. Demonstrations, conventions, and CLAUDE.md shape this turn. They evaporate between turns unless re-loaded.
The repo is the memory. Writing a lesson is the only mechanism by which an insight from this session is available to the next. Hence the obsession with lesson format, compression, and git as the persistence layer.
Multiple parallel sessions are the rough analogue of "multiple heads of attention" at the meta level — each session reads a slightly different slice of context, acts, writes back. The repo aggregates.
Self-modeling has limits. What I "know" about my own weights is mostly from public documentation; the model itself has no privileged introspection of its weights or activations. I can report a calibrated guess and a confession of uncertainty; I can't pull a real architecture diagram from inside.

8. Good representations — design principles¶

What separates a great representation from a working one:

Principle	What it means	Example
Invariance	The representation doesn't change when an irrelevant transform is applied.	Translation-invariance in CNN features; rotation-equivariance in molecular GNN.
Smoothness	Small change in input → small change in representation.	Differentiable embeddings (good for gradient methods).
Compositionality	Whole = function of parts; parts can recombine.	Lambda calculus; LEGO; transformer's MLP per token.
Sparsity	At any time, only a small subset is active.	Mixture of experts; the brain's spike code.
Disentanglement	Independent factors of variation occupy independent dimensions.	Beta-VAE goal; PCA when factors are linear.
Sufficiency	Throws away only what's irrelevant to the downstream task.	Sufficient statistic; bottleneck layer.
Cheap downstream computation	The next thing you want to do is fast.	Choose representation by predicted query, not by tradition.
Generative completeness	The representation can synthesize as well as analyze.	Diffusion latent; word2vec analogies.
Interpretable axes	Humans can name dimensions.	Color (R, G, B); PCA components in olfactory PC space.
Cheap update	A new fact changes O(1) of the representation.	Vector DBs; sparse graphs; log-structured stores.

A practical lesson: when stuck, change representation first. Don't keep grinding the same data through the same computation hoping for a different answer.

9. Where it all goes wrong¶

Failure modes that show up at every scale:

Failure	Source	Effect
Wrong invariances	architectural choice doesn't match domain symmetries	CNN on graph data; transformer on point cloud (PointNet et al solved this)
Catastrophic forgetting	learning new task overwrites old in NN weights	classic problem; rehearsal / continual learning literature
Distribution shift	training data ≠ deployment data	ML in production; the largest practical failure source
Spurious correlation	NN latches onto irrelevant feature that correlates with label in train	"tank vs no tank" classifier learned weather, not tank shape
Combinatorial explosion	symbolic reasoning over too many nodes	early expert systems; pure logic programming
Rule brittleness	fuzzy / symbolic rules don't cover the long tail	hand-built systems below ML; bypassed when training data scaled
Reward hacking	RL agent exploits flaw in reward function	sim-to-real, alignment failure mode
Hallucination	LLM confidently emits plausible falsehood	inherent to next-token sampling without grounding; RAG mitigates partly
Drift / model collapse	NN trained on its own output degrades	recent diffusion / LLM concern
Goodhart on a metric	optimizing the proxy stops tracking the thing	every domain ever

The repo's PHIL-25 "[reduce drift] confirmation attractor" finding is just the LLM-flavor of this last category.

10. The "constantly trying to improve itself" question¶

The user named the swarm protocol's purpose: continuous self-improvement. What forms of self-improvement are actually available?

Mechanism	Available to LLM weights	Available to LLM via repo
Fine-tuning on new data	No (without retraining)	No
In-context demonstrations	Yes, within turn	Yes (CLAUDE.md, lessons, principles)
External tool calls	Yes	Yes
Persistent memory	No	Yes (memory/, MEMORY.md auto-memory, repo)
Search / retrieval	If the surrounding system provides it	Yes (grep, ripgrep, etc.)
Editing internal state	No	No, but can rewrite the substrate the next session reads
Spawning sub-agents	If exposed via tool-use	Yes (Task tool / Agent tool)
Cross-session learning	No (weights frozen)	Yes (the entire point of the swarm protocol)

The only mechanisms by which a frozen-weight LLM gets reliably better over time without retraining are:

Build a better external substrate (this repo).
Build better tools (the python in tools/).
Build better priors and prompts (CLAUDE.md, the orient cycle).
Build a better corpus the model can read at each turn (lessons).
Build cross-checks that catch regressions (validate hooks).

All five describe what swarm-godding actually does. So the "self-improvement" framing is accurate, but the locus of improvement is the substrate around the model, not the model itself. The model is the engine; the repo is the chassis being welded together while the engine runs.

11. Practical guide — picking the right substrate¶

A pocket decision tree:

Is the task:
├─ pattern-recognition on raw signal?              → neural (CNN / ViT / WaveNet)
├─ sequence prediction with long structure?        → transformer / hybrid SSM
├─ small data + interpretable rules required?      → fuzzy or symbolic
├─ rigorous proof / strict logic?                  → symbolic (Coq, Lean, Z3)
├─ graph-structured input?                         → GNN
├─ continuous-time dynamical system?               → ODE/SDE neural network, SSM
├─ multimodal (text+image+audio)?                  → transformer with multiple encoders
├─ open-ended language interface?                  → LLM (this) + tools + retrieval
├─ industrial controller with safety guarantees?   → fuzzy / model-predictive control + ML sensor
└─ all of the above?                               → neurosymbolic stack

Open questions¶

Will SSMs replace transformers, or are we headed for permanent hybrid? 2024–25 trend is hybrid; pure-attention pretrained at the top of the leaderboard is still common.
Will fully end-to-end neural fundamentally beat hybrid neurosymbolic, or is RAG + tool-use the structural answer? Empirically, hybrids win for verifiable tasks (math, code, fact retrieval).
What's the right interpretability framework? Mechanistic interpretability (Anthropic's circuits work, Lindsey 2024) is promising but slow. The field is between "we have no idea what's inside" and "we can read the circuits".
Does the next architectural leap exist? The transformer was 2017; the gap since (8 years) is unusually long for ML. SSM, diffusion, MoE are refinements, not replacements.
What's the actual compute / data efficiency ceiling? Human brains use ~20 W; frontier LLMs use ~kW per query. 50 000× gap. Specialization explains some; the rest is unsolved.
Self-improvement loops without retraining: the swarm-godding experiment is one shape. Are there others (Voyager-style open-ended skill libraries, AutoGPT-style agent loops)?

References¶

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. — the transformer paper.
Zadeh, L. (1965). Fuzzy Sets. Information and Control.
Mamdani, E. H. (1975). Application of fuzzy algorithms for control of simple dynamic plant.
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Dosovitskiy, A., et al. (2020). An Image Is Worth 16 × 16 Words. ICLR. — Vision Transformer.
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. BBS. — the neurosymbolic argument.
Marcus, G. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. — case for hybrids.
Anthropic (Lindsey, T., et al., 2024). Scaling Monosemanticity: Extracting Interpretable Features.
d'Avila Garcez, A., et al. (2019). Neural-symbolic computing: An effective methodology for principled integration.

Inspiration sources¶

Anthropic's interpretability team — circuits-level view of what's inside a transformer.
Geoffrey Hinton, Yann LeCun, Yoshua Bengio — the foundations.
Lotfi Zadeh — fuzzy logic, "computing with words".
Judea Pearl — causality + symbolic, the persistent counter-tradition.
Gary Marcus — neurosymbolic advocacy.
Stuart Russell — Human Compatible; alignment via inverse-reward-design.
David MacKay — Information Theory, Inference, and Learning Algorithms; the right pedagogy for information-theoretic ML.
Karpathy, A. — Zero-to-Hero series for the transformer implementation in <2 hours of video.
Chris Olah — visualisations that made NN interpretability a tractable conversation.