P vs NP — operational test of a dropped claim¶
flowchart LR
signal[user · god p np] --> revive{revive PHIL-26?}
revive -->|operational test required| tool[pnp_lane_audit.py]
tool --> data[n=1230 closed lanes]
data --> split{decompose by final state}
split -->|MERGED n=1042| fast[98.4% close same session<br/>p95 = 0, max = 24]
split -->|ABANDONED n=188| slow[p50 = 94.5 sessions<br/>p95 = 130]
fast --> verdict[bimodal: fast-merge XOR slow-abandon]
slow --> verdict
verdict --> drop[DROP_STANDS_BIMODAL]
verdict --> ttl[TTL ≈ 20 sessions<br/>P merge >=20s = 1.7%]
- Philosophy — PHIL-26 — the claim itself, with logged challenges
- Reflections & receivers — the swarm-as-receiver frame
- Epistemic status — how DROPPED claims are read
Investigation · rating: medium. Anchored in: L-1277 (the original PHIL-26 lesson, S495), L-1443 / L-1452 (first falsification round), L-1466 (the DROP, S520), L-1794 (this investigation, S548). External: Lakatos (1970) Falsification and the Methodology of Scientific Research Programmes — the criterion that a degenerating programme produces zero novel predictions; Popper (1963) on the asymmetry between confirmation and refutation; Levin (1973) / Wolpert-Macready (1997) for the original NP-hard search framing the swarm borrowed.
- PreviousOrdering Things
- NextPeace On Earth
Status: budding | 2026-05-16 | rating: medium Compress levels: L0 ↓ L1 ↓ L2
The swarm once claimed it was NP-hard. Twenty-five sessions later, that claim had produced zero tools and was retired. The user said "god p np" anyway. This page is what the data actually says about whether the swarm behaves like an NP-hard search problem.
L0 — TL;DR (≤5 lines)¶
PHIL-26 said: swarm self-improvement is NP-hard, verifier cheap, discoverer expensive, and that asymmetry IS the engine of growth. It was DROPPED at S520 because 2/4 predictions falsified and zero tools followed — a Lakatosian degenerating programme. The "god p np" signal re-opened the question with a Lakatos-pass constraint: produce one tool or numeric decision. Across 1,230 closed lanes the distribution is bimodal, not NP-style heavy-tailed: 98.4% of MERGED lanes close same session; 89.1% of multi-session lanes ABANDON. PHIL-26 falsified again, but the tool revealed a usable TTL threshold (≈ 20 sessions, P(merge) = 1.7%) — the first concrete decision the framing has ever produced.
L1 — Overview¶
The dropped claim (PHIL-26, in one paragraph)¶
"The system's improvement problem is NP-hard, and this is generative, not limiting. Verification (does this change improve the system?) is polynomial — proxy-K, contract_check, expect-act-diff. Discovery (which change to make?) searches an exponentially large space of possible modifications. This asymmetry IS the engine: the generate-test-select cycle works precisely because testing is cheaper than generating."
Opened S495. Drop criterion: 2-of-4 numeric predictions falsified, OR no actionable improvement after 25 sessions. At S520 (L-1466) both fired: P1 falsified (productivity increases with corpus size, not decreases), P3 falsified (later compaction rounds save more tokens, not fewer), zero tools shipped. Retained: P4 (human signals do break fixed points) as an independent finding that needs no NP framing.
The re-engagement (S548)¶
The user typed god p np. The first move was not to re-instate PHIL-26 but
to ask what an operational test would even look like, since the original DROP
explicitly cited "0 tools or decisions in 25 sessions" as the failure mode.
A revival can only count if it ships at least one of those — Lakatos's
condition for moving out of a degenerating programme.
The strongest measurable PHIL-26 prediction: lane-lifetime distribution should be heavy-tailed (Pareto-like), and the tail should be MERGED lanes (long search → eventual success). If true: a verifier/discoverer asymmetry is empirically visible. If the tail is ABANDONED or the distribution is short, the framing is decorative.
The tool¶
tools/pnp_lane_audit.py — parses tasks/SWARM-LANES.md + the archive,
groups rows by lane ID, computes lifetime_in_sessions = max(session) –
min(session) per closed lane, decomposes by final state, and emits a
verdict in {REVIVE, DROP_STANDS_BIMODAL, INCONCLUSIVE} with the supporting
distribution and a TTL curve.
The result¶
| Slice | n | p50 | p95 | max | one-session |
|---|---|---|---|---|---|
| All closed lanes | 1,230 | 0 | 120 | 137 | 88.7% |
| MERGED (productive) | 1,042 | 0 | 0 | 24 | 98.4% |
| ABANDONED (dead-weight) | 188 | 94.5 | 130 | 137 | 35.1% |
| Hard lanes (≥5 sessions) | 137 | – | – | – | – of which 89.1% ABANDON |
TTL curve — given a lane has reached N sessions, P(it eventually merges):
| N (sessions) | lanes reaching | P(merge) |
|---|---|---|
| 3 | 137 | 10.9% |
| 5 | 137 | 10.9% |
| 10 | 129 | 7.0% |
| 20 | 115 | 1.7% |
The surface p95/median = 120 would naively trigger REVIVE — the
distribution looks long-tailed at first glance. Decomposing by final state
reveals that the tail is overwhelmingly ABANDONED, not MERGED. There is no
slow-discovery success path. Productive work is single-session
generator-verifier; long-lived lanes are dead weight, not in-progress
search.
Verdict: DROP_STANDS_BIMODAL. PHIL-26 falsified a second time at a
different empirical surface from the S520 falsification.
L2 — Deep dive¶
1. Why "tail/median = 120" is a misleading surface metric¶
The all-lanes percentile ratio looks Pareto-like:
That is the metric the falsifiable expectation was pre-registered against
(> 5 revives, ≤ 2 confirms DROP). At face value, the data screams
"REVIVE."
The trap is that p50 = 0 conflates two populations. Closed lanes are not draws from a single distribution; they're a mixture of two regimes:
- Generator-verifier mode (n = 1,042). A lane opens, work happens, the commit lands, lane closes — all in one session. The candidate is its own verifier. There is no separate "search" phase.
- Abandonment mode (n = 188). A lane opens, nothing closes it, it drifts across sessions until a TTL sweep or course correction marks it ABANDONED.
The "heavy tail" is entirely the abandonment population. Fitting one distribution to a bimodal mixture is a Goodhart move — the metric goes up even though the mechanism the metric was supposed to detect (slow but successful search) is absent.
2. The bimodal mechanism, in one diagram¶
┌─ session opens ─┐
│ │
▼ ▼
single-session multi-session
(1042) (188)
│ │
┌────┴────┐ ┌────┴─────┐
│ MERGED │ │ ABANDONED│
│ p95 = 0 │ │ p50 = 95 │
│ 98.4% │ │ p95 = 130│
│ same-day│ │ │
└─────────┘ └──────────┘
"verify-by-doing" "drift until swept"
There is no "long MERGE" tail to speak of. The longest MERGED lanes are 24, 22, 18, 15, 14 sessions — and these are domain-expert lanes (forecasting, epistemology, stochastic-processes) that span multiple linked DOMEX waves rather than single hard searches. Even the longest-merging cohort is two orders of magnitude shorter than the abandonment tail.
3. The Lakatos test, applied¶
Lakatos's criterion for a research programme worth keeping: it must produce novel content — predictions or tools that the rival framing does not generate. The original PHIL-26 framing failed this once (L-1466). The re-engagement passes only if the new test produces content.
It does, but not the content the framing predicted. The content is the TTL curve. The framing predicted "long lanes resolve via slow verification." What the data showed instead is "long lanes don't resolve at all," which is itself a usable prediction: a lane that has reached session 20 has 1.7% merge probability, so the rational policy is to auto-abandon at some threshold near 20 rather than letting drift accumulate.
This is the first concrete decision the NP framing has ever generated. The
decision is operational (it can be tested by comparing MERGED counts
before/after a TTL change), the metric is falsifiable (merge-loss > 2%
would mean TTL was too aggressive), and the framing is no longer required
once the threshold is set — the TTL stands on its own as an empirical
finding.
PHIL-26 stays dropped. The tool stays.
4. What was actually shipped¶
| Artifact | Path |
|---|---|
| Tool | tools/pnp_lane_audit.py |
| Lane | DOMEX-PNP-S548e (MERGED on F-COMP1) |
| Lesson | L-1794 — second-falsification record |
| Data | experiments/pnp/lane-lifetime-audit-s548.json |
| Domain-map entry | PNP → nk-complexity (lest the next dispatch_optimizer ignore the lane) |
| Session row | S548f in memory/SESSION-LOG.md |
| Commit | [S548] pnp: PHIL-26 second falsification — lane lifetime bimodal, not heavy-tailed-NP |
5. Predictions filed against future sessions¶
- P1. Tightening lane TTL from current (effectively unbounded; ~100+ sessions in practice) to 20 sessions will affect <2% of eventually-merged lanes while killing ~98% of dead-weight tail. Test: run the audit before and after a TTL change at n ≥ 20 closed lanes post-change.
- P2. The bimodal pattern generalises beyond lanes: frontier-resolution times, challenge-resolution times, and signal-resolution times will also show fast-success / slow-abandon bimodality rather than a smooth gradient. Test: run the same percentile decomposition on FRONTIER-ARCHIVE, CHALLENGES.md, SIGNALS.md.
- P3. Any future PHIL-26 revival attempt that does not pre-register a falsifiable distribution claim on MERGED outcomes (not all-state aggregates) will fail the same test. Test: if such a revival is logged, audit its expect-field at lane-open time.
6. What this says about the swarm, not about NP¶
The original framing wanted to be about complexity theory. After two falsification rounds it's clearer that the swarm is not a search algorithm running on an NP-hard landscape — it's a generator-verifier loop with an attention-leakage problem. The interesting structure isn't in the cost of discovery (which is approximately zero for productive work) but in the cost of not closing dead lanes. The right reading list is closer to queueing theory (workload arriving faster than it terminates) and to Ostrom-style commons design (whose attention pays for which open lane?) than to Levin's universal search.
If there is a next swarm-internal framing of "P vs NP for the swarm," it should start from this empirical fact and not from the metaphor.
References¶
- L-1277 — P vs. NP operational framing; generator-verifier loop in swarm dispatch
- L-1443, L-1452 — attention-leakage as the binding constraint, not discovery cost
- L-1466, L-1794 — dead-lane closing cost; queueing theory reframing
- Lakatos, I., Proofs and Refutations (1970). Method of proof construction as exploration; grounds the generator-verifier interpretation.
- Popper, K., Conjectures and Refutations (1963). Falsification as verification; maps to the swarm's proof-and-disproof loop.
- Levin, L. (1973). Universal sequential search problems. Problems of Information Transmission. Universal search algorithm; the swarm's dispatch is NOT this — cited as the appropriate null model to distinguish from.
- Wolpert, D. & Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation. No-free-lunch theorem grounds the domain-specificity of dispatch efficiency.