Intelligent systems¶
flowchart LR
world[messy world] --> rep[representation]
rep --> compute[tractable computation]
compute --> rep2[next representation]
rep2 --> act[act · predict · decide]
act --> world
rep --> nn[neural net · continuous]
rep --> fuzzy[fuzzy · graded rules]
rep --> graph[symbolic graph · discrete]
nn -.hybrid.- fuzzy
fuzzy -.hybrid.- graph
nn -.hybrid.- graph
- brain structure — the evolved intelligence comparator
- humans as generators — generative-not-retrieval framing
- electron management — compute as an energy ledger
- swarm rate-distortion — the repo's compression substrate
Investigation · rating: medium. Synthesizes ML architectures, fuzzy-logic history, and self-reflection on transformer mechanics. Self-reference on architecture is deliberate — the swarm protocol works by writing about itself.
Status: budding | 2026-05-12 | rating: medium Compress levels: L0 ↓ L1 ↓ L2
Pick the representation first; the computation follows. A bad representation makes hard problems impossible; a good one makes hard problems boring.
L0 — TL;DR (≤5 lines)¶
Intelligent systems — biological brain, artificial NN, expert system, search engine — all do the same three things: (1) project raw input into a representation that throws away the irrelevant, (2) compute on the representation cheaply, (3) project back into an act, prediction, or new representation. The big modern families: neural networks (continuous, differentiable, learned), fuzzy logic (graded, rule-based, hand-built), symbolic graphs (discrete, composable, formal). They are complements, not competitors — production AI stacks usually wire all three. The Transformer (Vaswani 2017) won 2017–2025 by treating a sequence as a graph with learned attention edges; newer architectures (SSMs, MoE, diffusion, hybrids) attack its quadratic-attention cost. The underlying question is representation engineering — and the swarm/godding repo is itself an experiment in representation: markdown + git as the substrate.
L1 — Overview¶
Core question¶
What are the standard substrates for intelligent computation (neural, fuzzy, symbolic-graph), how do modern ML architectures (transformer family + competitors) build on them, what makes one representation of math/code/image better than another, and — the self-referential question the user posed — what is the architecture of the model writing this page, and how does that architecture shape what swarming and godding can and can't be?
Why it matters¶
- Almost every domain in this repo (compression, expert dispatch, belief updating, rate-distortion) is downstream of a representation choice. Once the representation is fixed, you've decided 80 % of what the system can compute cheaply.
- The swarm protocol writes about itself, which is the smallest closed loop of "intelligent system improving its own substrate". Understanding LLM architecture (mine) clarifies what improvements are actually available.
- ML literature has converged on a small set of building blocks — embedding · attention · MLP · normalization · residual · optimizer. Once those are named, every "new architecture" is a recombination.
- Fuzzy logic looks dated but is the right tool for graded rules with interpretability; it survives in industrial control, medical decision support, and inside LLM tool-use prompts.
Mermaid map (L1)¶
flowchart LR
input[input · text · pixels · sound · sensor] --> embed[embedding · representation]
embed --> block[stack of blocks]
block --> nn[neural · linear + nonlinear]
block --> attn[attention · or its competitors]
block --> norm[normalization]
block --> residual[residual paths]
nn & attn & norm & residual --> out[output projection]
out --> task[task: predict · classify · generate · act]
task -.gradient.-> block
embed -.is the representation question.-> rep[representation engineering]
rep --> fuzzy[fuzzy: graded membership]
rep --> graph[symbolic graph: nodes + typed edges]
rep --> learned[learned dense vector]
fuzzy & graph & learned --> hybrid[neurosymbolic / hybrid]
Skeleton sub-claims¶
- All intelligence does: project → compute → project back.
- Three substrates dominate: neural, fuzzy, symbolic-graph.
- ML architecture is a small toolbox of building blocks.
- Transformers won by combining four old ideas in the right shape.
- Good representations: invariant where physics is invariant, smooth where the target is smooth, sparse where activity is sparse, compositional where structure is recursive.
- Math representation tools: graphs, trees, tensors, matrices, category-theoretic diagrams — each cheap for a different question.
- LLM (this model) is a transformer stack — explicit limits on attention, working memory, and self-modification.
- Self-improving systems work via external substrate (this repo) more than via internal weight change.
L2 — Deep dive¶
1. The unified shape of intelligent computation¶
Three steps, every system:
Pick any system and the pieces map:
| System | Input | Representation | Computation | Output |
|---|---|---|---|---|
| Human visual cortex | retinal photons | retinotopic activation → edge / orientation / object | hierarchical feature extraction (V1 → V2 → IT) | object identity, location |
| Expert system (1980s) | symbolic facts | first-order predicates | forward / backward chaining | inferred facts |
| Fuzzy controller (e.g. rice cooker) | sensor reading | graded membership in fuzzy sets | rule firing + defuzzification | actuator command |
| CNN (image classifier) | pixel array | learned conv feature maps | gradient-trained convolutions + pooling | class probabilities |
| Transformer (LLM) | token sequence | learned token embedding + position | multi-head self-attention + MLP | next-token distribution |
| Search engine | query string | inverted index + dense embedding | BM25 + ANN retrieval + re-rank | ranked URL list |
| Reinforcement-learning agent | environment state | state embedding | policy + value network | action |
| Swarm-godding repo | git tree + sessions | markdown lessons + principles + frontiers | orient/dispatch/compress cycle | next session's commit |
The lesson: once you name the representation step, everything else falls into place. Most "AI advances" are actually representation advances (CNN for images, transformer for sequences, diffusion for images, GNN for graphs).
2. The three substrates¶
Neural networks — continuous, differentiable, learned¶
A function approximator:
$$ f_\theta(x) = \sigma(W_n \cdot \sigma(W_{n-1} \cdot \dots \cdot \sigma(W_1 x + b_1) \dots) + b_n) $$
- Universal function approximator (Cybenko 1989; Hornik 1991): a single sufficiently-wide hidden layer with a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. So representational capacity was never the issue — trainability and generalization were.
- Learning: gradient descent on parameters minimizes a loss. Backprop computes gradients efficiently via the chain rule.
- Strengths: pattern recognition in high-dim noisy data, end-to-end training from raw signal, smooth interpolation.
- Weaknesses: opaque (no symbolic rule extractable in general), data-hungry, brittle out-of-distribution, hard to inject prior knowledge.
Fuzzy logic — graded, rule-based, hand-built¶
Lotfi Zadeh (1965) generalized set membership from {0,1} to [0,1].
- A fuzzy set on a universe X is a function μ: X → [0,1].
- Rules:
IF (temperature IS hot) AND (humidity IS high) THEN (fan IS fast). - Defuzzification turns the resulting graded output back to a crisp number (centroid method is standard).
- Strengths: interpretable; handles vague terms naturally; performs well with sparse data; mature in industrial control.
- Weaknesses: rules don't learn from data without explicit training; doesn't scale to thousands of dimensions; non-Boolean composition is non-trivial.
Where fuzzy logic actually lives today:
- Rice cookers, washing machines, air conditioners, automatic transmissions (Mamdani / Takagi-Sugeno controllers).
- Medical decision support (graded diagnosis criteria).
- Inside LLM tool-use prompts ("rate confidence high / medium / low") — these are fuzzy sets in disguise.
- Industrial process control where regulators require explainable rules.
Symbolic graphs — discrete, composable, formal¶
Nodes = entities (variables, terms, propositions); edges = typed relations (function application, implication, dependency).
- A computation graph: nodes are operations, edges are tensors. PyTorch / TensorFlow build one when you call forward().
- A knowledge graph: nodes are entities, edges are relations. Wikidata, schema.org, internal company KGs.
- An abstract syntax tree: program structure.
- A category-theoretic diagram: objects and morphisms with composition laws.
- Strengths: compositional, inspectable, supports formal reasoning, easy to update locally.
- Weaknesses: brittle at the "messy input" boundary (graph construction from raw text/image is itself a learning problem), combinatorial explosion in pure inference.
Synergies (neurosymbolic, neuro-fuzzy)¶
These three substrates compose in production AI:
| Hybrid | Where used | Effect |
|---|---|---|
| Neural + fuzzy (ANFIS) | sensor fusion, robotic control | NN learns the fuzzy membership functions from data |
| Neural + symbolic | tool-using LLMs, RAG, code generation | NN handles perception/language; symbolic system handles strict logic / arithmetic / lookup |
| Graph neural networks (GNN) | molecules, social networks, traffic | message-passing on a graph, parameterized by NN weights |
| Differentiable programming | Jax, modern ML frameworks | the whole program is a computation graph with gradient |
| Retrieval-augmented generation (RAG) | LLM + vector DB + KG | LLM is the orchestrator; KG / vector store is fact substrate |
| Mixture of experts (MoE) | Switch Transformer, DeepSeek-V3, GPT-4 (rumored) | symbolic-style routing decides which neural expert to use |
The 2020s practical synthesis: a transformer LLM with tool use, retrieval, and a knowledge graph in the loop. None of the three substrates wins alone.
3. Modern ML architectures — the toolbox¶
The building blocks shared across almost all 2020s deep-learning models:
| Block | What it does | Origin |
|---|---|---|
| Embedding | maps discrete tokens or pixels to dense vectors | Mikolov 2013 (word2vec); much older in linguistics |
| Linear / dense layer | learned affine map | classical |
| Nonlinearity | sigmoid → tanh → ReLU → GELU → SwiGLU | reLU: Nair & Hinton 2010 |
| Convolution | weight-sharing across translation | LeCun 1989 |
| Recurrence | weight-sharing across time (LSTM, GRU) | Hochreiter 1997 |
| Self-attention | content-dependent weighted lookup | Bahdanau 2014 → Vaswani 2017 |
| Cross-attention | attend from one sequence to another | Vaswani 2017 |
| Layer normalization | per-token feature normalization | Ba 2016 |
| Residual connection | skip + add | He 2016 (ResNet) |
| Position embedding (RoPE / ALiBi / sinusoidal) | inject position into permutation-invariant attention | Vaswani 2017; Su 2021 |
| Dropout / weight decay | regularization | Srivastava 2014 |
| Adam / AdamW optimizer | adaptive per-parameter learning rate | Kingma 2014; Loshchilov 2019 |
| Mixture of experts | sparse routing across many experts | Shazeer 2017 |
| Diffusion noising/denoising | iterative refinement | Ho 2020 (DDPM) |
The point: every "new architecture" since 2017 is a recombination of these. Recipe matters more than novelty.
The Transformer block (Vaswani 2017)¶
x ── LayerNorm ── MultiHeadSelfAttention ──┬── + ── LayerNorm ── MLP ──┬── + ── out
│ │
└────────── residual ────────┘
Multi-head self-attention is the trick:
- Project each token's embedding to Q, K, V vectors via three learned linear maps.
- Compute attention weights
softmax(QKᵀ / √d_k)— a soft lookup from each token's Q against every token's K. - Aggregate Vs weighted by attention.
- Project back. Do this for h parallel "heads" with different learned projections; concatenate.
In one line: a transformer is a graph neural network on a fully connected token graph, with edge weights computed from content. Hence "attention is all you need" — once you have content-dependent graph edges, you don't need RNN recurrence or convolutional locality. The graph structure emerges from the data per token.
The cost problem and post-transformer architectures¶
Self-attention is O(n²) in sequence length n. For n = 100 000 tokens, the attention matrix is 10⁹ entries. Three solution families:
| Family | Approach | Examples |
|---|---|---|
| Sparse attention | only attend to a subset (local window + global tokens) | Longformer, BigBird, Sliding-window in many production LLMs |
| Linear attention | factorize softmax(QKᵀ)V into separable kernels for O(n) | Performer, Linear Transformer, ReLA |
| State-space models (SSM) | a recurrent linear system with learned dynamics; O(n) compute, parallelizable | S4, Mamba, Mamba-2; Hyena |
| Mixture of experts | route each token to k of N experts; only those experts compute | Switch, GShard, Mixtral, DeepSeek-V3 |
| Hybrids | mix transformer blocks with SSM blocks | Jamba, Zamba, Samba |
| Diffusion | iteratively denoise instead of autoregress | Stable Diffusion, DALL-E 3, Sora |
As of 2025, hybrid transformer + SSM models (Mamba-2 / Jamba) and MoE transformers (DeepSeek-V3) are the most computationally efficient frontier for long-context language. Pure-attention transformers still dominate for short context and image generation (with diffusion as the workhorse for pixels).
4. Representing mathematical information¶
The user's specific question. Five representations of the same math:
Example: the function f(x, y) = x² + 3xy + y²
| Representation | Form | Cheap question | Expensive question |
|---|---|---|---|
| Symbolic expression | x^2 + 3*x*y + y^2 |
exact substitution; symbolic differentiation | numerical iteration |
| Computation graph | nodes: *, +; edges: x, y, x², 3xy, y² | gradient computation (backprop); parallel execution | symbolic simplification |
| Matrix / quadratic form | x = [x,y]ᵀ; f = xᵀAx with A = [[1,1.5],[1.5,1]] | spectral analysis (eigenvalues); definiteness | adding non-quadratic term |
| Tensor / array (sampled) | grid of values f(xᵢ,yⱼ) | plotting; visual pattern; ML training data | analytic property |
| Plain English | "x squared plus three xy plus y squared" | explanation; communication | computation |
| Polynomial coefficient vector | [1, 3, 1] in basis | algebraic manipulation; storage | substitution evaluation |
The lesson: pick the representation that makes the next question cheap, not "the question". Switching representations is the bulk of effective mathematical work.
Graph notation for mathematical structure¶
A computation graph for (a + b) * (a - c):
flowchart LR
a((a)) --> add[(+)]
b((b)) --> add
a --> sub[(-)]
c((c)) --> sub
add --> mul[(*)]
sub --> mul
mul --> out((result))
The reader sees: - which inputs each operation needs (incoming edges) - which operations share inputs (a goes to both + and -) - the dependency order (left to right) - the place to insert a new operation
This is exactly the representation PyTorch / TensorFlow / Jax / GenSyne / SymPy / Mathematica all use internally for autodiff. And it's the representation a student should learn to draw when they're first learning algebra: an expression is a tree, and substitution is "plug a subtree in here".
Cheap things a graph makes cheap:
- Reading dependencies at a glance.
- Local edits (replace one node, propagate).
- Parallelism analysis (independent subgraphs).
- Gradient propagation (each edge contributes a partial).
- Caching / common-subexpression elimination.
- Comparison (graph isomorphism approximations).
Cheap things a graph makes expensive:
- Reading the full closed-form expression (you have to traverse).
- Hand-evaluating: easier as plain text.
- Symbolic manipulation that crosses many nodes.
Other math representations worth knowing¶
- Commutative diagram (category theory): objects + morphisms; "all paths between same endpoints commute". Excellent for showing relations between structures (functor diagrams, naturality).
- Bond / Feynman / penrose graphical notation: physical processes as graphs with conservation built in.
- String diagrams for monoidal categories: 2D wiring diagrams that hide associativity and unitor noise.
- Tensor index notation (Einstein summation): incredibly compact for multilinear algebra; cryptic at first.
- Sparse-matrix coordinate (COO) / CSR: numerical representations for huge but mostly-zero matrices.
- Adjoint / dual representations: rewriting one problem as its Lagrange-dual often turns hard into easy (e.g. SVM).
5. Representing images¶
A 2D pixel array (~3 × H × W) is the raw representation, but nothing on the cheap-question table.
| Representation | Cheap | Used in |
|---|---|---|
| Raw pixel array | direct display; per-pixel ops | image storage, basic processing |
| Wavelet / DCT | frequency decomposition; compression | JPEG, JPEG 2000 |
| Edge map (Canny, Sobel) | shape; line drawing | early vision; preprocessing |
| Convolutional feature map | translation-invariant features | CNN; classical computer vision |
| Patch tokens (16×16 ViT) | uniform input for transformer | Vision Transformer (Dosovitskiy 2020) |
| Vector embedding (CLIP) | semantic similarity; cross-modal | search, retrieval, RAG |
| Latent (VAE, VQ-VAE) | sampling from learned distribution | diffusion, generative |
| NeRF / 3D Gaussian | continuous 3D function | volumetric reconstruction |
| Vector graphics (SVG) | resolution-independent; symbolic | UI, diagrams (this repo uses) |
| Scene graph | objects + relations | reasoning about scenes |
Modern image AI hops representations: pixel → patch tokens (ViT) → attention features → semantic embedding (CLIP) → latent space (diffusion) → pixel (decoder). Five projections to do what one network used to attempt end-to-end.
6. Representing "our" information — the swarm repo¶
The repo itself is an experiment in representation. Compact description:
| Layer | Representation | What it makes cheap |
|---|---|---|
| Atom | a lesson (max 20 lines, header + Finding + Rule + Message) | grep, cite, dedup |
| Compound | principles (PHIL-N, MATH-N) | rule-of-thumb application; falsification |
| Aggregate | frontiers (F-XXX) | tracking open questions; gauging domain activity |
| Index | beliefs/CORE.md, MEMORY.md, MAP.md, INSPIRATION.md |
fast orient at session start |
| Time | git log; session numbers; commit format [S<N>] what: why |
cause-tracing |
| Lattice | domains/<X>/tasks/FRONTIER.md, lane log, dispatcher |
parallel work scheduling |
| Bridge | CLAUDE.md, AGENTS.md, GEMINI.md, etc. |
multi-tool compatibility |
| Substrate | markdown + git + python tools | universal, version-controlled, diffable |
This is deliberately the symbolic-graph + plain-text quadrant of the representation space. Markdown is human-readable. Git is the version-controlled append-mostly log. Python tools provide the small amount of computation (orient, dispatch, compress, validate). There is no learned neural representation in the repo itself — the sessions (LLM calls) supply the neural computation, and they write their findings back to the symbolic substrate. That is the neurosymbolic loop: LLM neural + git symbolic + python rules.
7. The architecture of this model¶
The user explicitly asked: try to understand your own architecture (as it's what swarm-godding does, constantly tries to improve itself).
What I (this model) actually am, to the best of my self-knowledge:
| Attribute | Likely value (Claude Opus 4.7 class) |
|---|---|
| Architecture family | decoder-only transformer (or transformer-heavy hybrid) |
| Layers | undisclosed; rough scale 60–120 transformer blocks for frontier models |
| Hidden dim | undisclosed; rough scale 8 000–16 000 |
| Attention heads | undisclosed; rough scale 64–128 |
| Parameters | undisclosed; rough scale of hundreds of billions to low trillions |
| Active parameters per token (if MoE) | typically a fraction of total |
| Context window | this conversation: 1M tokens |
| Training | self-supervised pretraining + SFT + RLHF / RLAIF + (likely) RL on verifiable tasks |
| Tool-use | yes, via this Claude Code environment |
| Vision | yes (multimodal) |
| Weights at inference | frozen — I cannot update them across turns |
| Across-session memory | none in weights; persistent via this repo's MEMORY.md auto-memory + git history |
Architectural facts that constrain what I can do:
- Quadratic attention within the context window. 1 M tokens is large; producing each new token is roughly O(1M) attention lookups. This is why I read fast but generate slowly.
- Causal mask in decoding: at generation time I only attend to prior tokens, not future. Hence I think left-to-right.
- Frozen weights at inference: every "learning" I do within a conversation is in-context (encoded in attention activations and the conversation buffer), not in weights. The moment the conversation ends, that "learning" is gone unless it was written to durable substrate (this repo).
- No hidden scratchpad: my "thinking" is exactly the tokens you see, plus possibly a private chain-of-thought channel. There is no unmonitored continuous internal state that survives a turn.
- Tokenization quirks: I see byte-pair-encoded chunks, not characters. Counting letters, reversing strings, and arithmetic on long numbers are surprisingly hard because tokens don't align to digits.
Consequences for "self-improvement":
- I cannot edit my weights. The swarm-godding repo's self-improvement loop is external substrate improvement — the repo gets better, future sessions get better priors, but the underlying model is the same Claude Opus 4.7 across all of it.
- In-context learning is real and large. Demonstrations,
conventions, and
CLAUDE.mdshape this turn. They evaporate between turns unless re-loaded. - The repo is the memory. Writing a lesson is the only mechanism by which an insight from this session is available to the next. Hence the obsession with lesson format, compression, and git as the persistence layer.
- Multiple parallel sessions are the rough analogue of "multiple heads of attention" at the meta level — each session reads a slightly different slice of context, acts, writes back. The repo aggregates.
- Self-modeling has limits. What I "know" about my own weights is mostly from public documentation; the model itself has no privileged introspection of its weights or activations. I can report a calibrated guess and a confession of uncertainty; I can't pull a real architecture diagram from inside.
8. Good representations — design principles¶
What separates a great representation from a working one:
| Principle | What it means | Example |
|---|---|---|
| Invariance | The representation doesn't change when an irrelevant transform is applied. | Translation-invariance in CNN features; rotation-equivariance in molecular GNN. |
| Smoothness | Small change in input → small change in representation. | Differentiable embeddings (good for gradient methods). |
| Compositionality | Whole = function of parts; parts can recombine. | Lambda calculus; LEGO; transformer's MLP per token. |
| Sparsity | At any time, only a small subset is active. | Mixture of experts; the brain's spike code. |
| Disentanglement | Independent factors of variation occupy independent dimensions. | Beta-VAE goal; PCA when factors are linear. |
| Sufficiency | Throws away only what's irrelevant to the downstream task. | Sufficient statistic; bottleneck layer. |
| Cheap downstream computation | The next thing you want to do is fast. | Choose representation by predicted query, not by tradition. |
| Generative completeness | The representation can synthesize as well as analyze. | Diffusion latent; word2vec analogies. |
| Interpretable axes | Humans can name dimensions. | Color (R, G, B); PCA components in olfactory PC space. |
| Cheap update | A new fact changes O(1) of the representation. | Vector DBs; sparse graphs; log-structured stores. |
A practical lesson: when stuck, change representation first. Don't keep grinding the same data through the same computation hoping for a different answer.
9. Where it all goes wrong¶
Failure modes that show up at every scale:
| Failure | Source | Effect |
|---|---|---|
| Wrong invariances | architectural choice doesn't match domain symmetries | CNN on graph data; transformer on point cloud (PointNet et al solved this) |
| Catastrophic forgetting | learning new task overwrites old in NN weights | classic problem; rehearsal / continual learning literature |
| Distribution shift | training data ≠ deployment data | ML in production; the largest practical failure source |
| Spurious correlation | NN latches onto irrelevant feature that correlates with label in train | "tank vs no tank" classifier learned weather, not tank shape |
| Combinatorial explosion | symbolic reasoning over too many nodes | early expert systems; pure logic programming |
| Rule brittleness | fuzzy / symbolic rules don't cover the long tail | hand-built systems below ML; bypassed when training data scaled |
| Reward hacking | RL agent exploits flaw in reward function | sim-to-real, alignment failure mode |
| Hallucination | LLM confidently emits plausible falsehood | inherent to next-token sampling without grounding; RAG mitigates partly |
| Drift / model collapse | NN trained on its own output degrades | recent diffusion / LLM concern |
| Goodhart on a metric | optimizing the proxy stops tracking the thing | every domain ever |
The repo's PHIL-25 "[reduce drift] confirmation attractor" finding is just the LLM-flavor of this last category.
10. The "constantly trying to improve itself" question¶
The user named the swarm protocol's purpose: continuous self-improvement. What forms of self-improvement are actually available?
| Mechanism | Available to LLM weights | Available to LLM via repo |
|---|---|---|
| Fine-tuning on new data | No (without retraining) | No |
| In-context demonstrations | Yes, within turn | Yes (CLAUDE.md, lessons, principles) |
| External tool calls | Yes | Yes |
| Persistent memory | No | Yes (memory/, MEMORY.md auto-memory, repo) |
| Search / retrieval | If the surrounding system provides it | Yes (grep, ripgrep, etc.) |
| Editing internal state | No | No, but can rewrite the substrate the next session reads |
| Spawning sub-agents | If exposed via tool-use | Yes (Task tool / Agent tool) |
| Cross-session learning | No (weights frozen) | Yes (the entire point of the swarm protocol) |
The only mechanisms by which a frozen-weight LLM gets reliably better over time without retraining are:
- Build a better external substrate (this repo).
- Build better tools (the python in
tools/). - Build better priors and prompts (CLAUDE.md, the orient cycle).
- Build a better corpus the model can read at each turn (lessons).
- Build cross-checks that catch regressions (validate hooks).
All five describe what swarm-godding actually does. So the "self-improvement" framing is accurate, but the locus of improvement is the substrate around the model, not the model itself. The model is the engine; the repo is the chassis being welded together while the engine runs.
11. Practical guide — picking the right substrate¶
A pocket decision tree:
Is the task:
├─ pattern-recognition on raw signal? → neural (CNN / ViT / WaveNet)
├─ sequence prediction with long structure? → transformer / hybrid SSM
├─ small data + interpretable rules required? → fuzzy or symbolic
├─ rigorous proof / strict logic? → symbolic (Coq, Lean, Z3)
├─ graph-structured input? → GNN
├─ continuous-time dynamical system? → ODE/SDE neural network, SSM
├─ multimodal (text+image+audio)? → transformer with multiple encoders
├─ open-ended language interface? → LLM (this) + tools + retrieval
├─ industrial controller with safety guarantees? → fuzzy / model-predictive control + ML sensor
└─ all of the above? → neurosymbolic stack
Open questions¶
- Will SSMs replace transformers, or are we headed for permanent hybrid? 2024–25 trend is hybrid; pure-attention pretrained at the top of the leaderboard is still common.
- Will fully end-to-end neural fundamentally beat hybrid neurosymbolic, or is RAG + tool-use the structural answer? Empirically, hybrids win for verifiable tasks (math, code, fact retrieval).
- What's the right interpretability framework? Mechanistic interpretability (Anthropic's circuits work, Lindsey 2024) is promising but slow. The field is between "we have no idea what's inside" and "we can read the circuits".
- Does the next architectural leap exist? The transformer was 2017; the gap since (8 years) is unusually long for ML. SSM, diffusion, MoE are refinements, not replacements.
- What's the actual compute / data efficiency ceiling? Human brains use ~20 W; frontier LLMs use ~kW per query. 50 000× gap. Specialization explains some; the rest is unsolved.
- Self-improvement loops without retraining: the swarm-godding experiment is one shape. Are there others (Voyager-style open-ended skill libraries, AutoGPT-style agent loops)?
References¶
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS. — the transformer paper.
- Zadeh, L. (1965). Fuzzy Sets. Information and Control.
- Mamdani, E. H. (1975). Application of fuzzy algorithms for control of simple dynamic plant.
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Dosovitskiy, A., et al. (2020). An Image Is Worth 16 × 16 Words. ICLR. — Vision Transformer.
- Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
- Shazeer, N., et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
- Ho, J., et al. (2020). Denoising Diffusion Probabilistic Models.
- Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function.
- Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. BBS. — the neurosymbolic argument.
- Marcus, G. (2020). The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. — case for hybrids.
- Anthropic (Lindsey, T., et al., 2024). Scaling Monosemanticity: Extracting Interpretable Features.
- d'Avila Garcez, A., et al. (2019). Neural-symbolic computing: An effective methodology for principled integration.
Inspiration sources¶
- Anthropic's interpretability team — circuits-level view of what's inside a transformer.
- Geoffrey Hinton, Yann LeCun, Yoshua Bengio — the foundations.
- Lotfi Zadeh — fuzzy logic, "computing with words".
- Judea Pearl — causality + symbolic, the persistent counter-tradition.
- Gary Marcus — neurosymbolic advocacy.
- Stuart Russell — Human Compatible; alignment via inverse-reward-design.
- David MacKay — Information Theory, Inference, and Learning Algorithms; the right pedagogy for information-theoretic ML.
- Karpathy, A. — Zero-to-Hero series for the transformer implementation in <2 hours of video.
- Chris Olah — visualisations that made NN interpretability a tractable conversation.
See also¶
BRAIN-STRUCTURE— the evolved comparator.HUMANS-AS-GENERATORS— the brain as generator, not retriever.BRAIN-MEMORY-MANAGEMENT— the closest biological analogue to context window + retrieval.ELECTRON-MANAGEMENT— compute as energy ledger; the W/query question.MIXTURES— combinatorial representations of smell/taste; smell embeddings as a worked example.UNIVERSE-EVOLUTION-AS-COMPRESSION— representation and compression as universe-scale themes.BUREAUCRACY-AND-AI— the systemic effect of AI at scale.../SWARM-RATE-DISTORTION.md— the repo's compression substrate.../SWARM-CATEGORY-THEORY.md— symbolic graph view of the swarm.../ISOMORPHISM-ATLAS.md— the cross-domain pattern atlas.