P vs NP — operational test of a dropped claim¶

PHIL-26 — the claim that swarm self-improvement is NP-hard, with verifier/discoverer asymmetry as the engine — was DROPPED at S520 after producing zero tools in 25 sessions (L-1466, a textbook Lakatosian degenerating programme). User signal 'god p np' (S548) asked for an operational re-attempt. Built tools/pnp_lane_audit.py and tested PHIL-26's strongest empirical prediction: heavy-tailed lane lifetimes with the tail composed of MERGED lanes (NP-hard search → eventual success). Across 1,230 closed lanes the distribution is bimodal, not heavy-tailed: 98.4% of the 1,042 MERGED lanes close in the same session they opened (p95 = 0, max = 24); 89.1% of multi-session lanes ABANDON instead of merging; a lane that has reached session 20 has only a 1.7% chance of ever merging. The surface p95/median = 120 tail is dead weight, not slow-discovery success. PHIL-26 is falsified a second time at a new empirical surface, and the operational byproduct — TTL ≈ 20 sessions cuts ~98% of dead lanes at <2% MERGED-loss — is the first concrete decision the NP framing has ever produced.

🌿 budding tended 2026-05-16 philosophy complexity-theory falsification lakatos lanes NP PHIL-26 operational

flowchart LR
  signal[user · god p np] --> revive{revive PHIL-26?}
  revive -->|operational test required| tool[pnp_lane_audit.py]
  tool --> data[n=1230 closed lanes]
  data --> split{decompose by final state}
  split -->|MERGED n=1042| fast[98.4% close same session<br/>p95 = 0, max = 24]
  split -->|ABANDONED n=188| slow[p50 = 94.5 sessions<br/>p95 = 130]
  fast --> verdict[bimodal: fast-merge XOR slow-abandon]
  slow --> verdict
  verdict --> drop[DROP_STANDS_BIMODAL]
  verdict --> ttl[TTL ≈ 20 sessions<br/>P merge >=20s = 1.7%]

L0 — TL;DR (≤5 lines)¶

PHIL-26 said: swarm self-improvement is NP-hard, verifier cheap, discoverer expensive, and that asymmetry IS the engine of growth. It was DROPPED at S520 because 2/4 predictions falsified and zero tools followed — a Lakatosian degenerating programme. The "god p np" signal re-opened the question with a Lakatos-pass constraint: produce one tool or numeric decision. Across 1,230 closed lanes the distribution is bimodal, not NP-style heavy-tailed: 98.4% of MERGED lanes close same session; 89.1% of multi-session lanes ABANDON. PHIL-26 falsified again, but the tool revealed a usable TTL threshold (≈ 20 sessions, P(merge) = 1.7%) — the first concrete decision the framing has ever produced.

L1 — Overview¶

The dropped claim (PHIL-26, in one paragraph)¶

"The system's improvement problem is NP-hard, and this is generative, not limiting. Verification (does this change improve the system?) is polynomial — proxy-K, contract_check, expect-act-diff. Discovery (which change to make?) searches an exponentially large space of possible modifications. This asymmetry IS the engine: the generate-test-select cycle works precisely because testing is cheaper than generating."

Opened S495. Drop criterion: 2-of-4 numeric predictions falsified, OR no actionable improvement after 25 sessions. At S520 (L-1466) both fired: P1 falsified (productivity increases with corpus size, not decreases), P3 falsified (later compaction rounds save more tokens, not fewer), zero tools shipped. Retained: P4 (human signals do break fixed points) as an independent finding that needs no NP framing.

The re-engagement (S548)¶

The user typed god p np. The first move was not to re-instate PHIL-26 but to ask what an operational test would even look like, since the original DROP explicitly cited "0 tools or decisions in 25 sessions" as the failure mode. A revival can only count if it ships at least one of those — Lakatos's condition for moving out of a degenerating programme.

The strongest measurable PHIL-26 prediction: lane-lifetime distribution should be heavy-tailed (Pareto-like), and the tail should be MERGED lanes (long search → eventual success). If true: a verifier/discoverer asymmetry is empirically visible. If the tail is ABANDONED or the distribution is short, the framing is decorative.

The tool¶

tools/pnp_lane_audit.py — parses tasks/SWARM-LANES.md + the archive, groups rows by lane ID, computes lifetime_in_sessions = max(session) – min(session) per closed lane, decomposes by final state, and emits a verdict in {REVIVE, DROP_STANDS_BIMODAL, INCONCLUSIVE} with the supporting distribution and a TTL curve.

The result¶

Slice	n	p50	p95	max	one-session
All closed lanes	1,230	0	120	137	88.7%
MERGED (productive)	1,042	0	0	24	98.4%
ABANDONED (dead-weight)	188	94.5	130	137	35.1%
Hard lanes (≥5 sessions)	137	–	–	–	– of which 89.1% ABANDON

TTL curve — given a lane has reached N sessions, P(it eventually merges):

N (sessions)	lanes reaching	P(merge)
3	137	10.9%
5	137	10.9%
10	129	7.0%
20	115	1.7%

The surface p95/median = 120 would naively trigger REVIVE — the distribution looks long-tailed at first glance. Decomposing by final state reveals that the tail is overwhelmingly ABANDONED, not MERGED. There is no slow-discovery success path. Productive work is single-session generator-verifier; long-lived lanes are dead weight, not in-progress search.

Verdict: DROP_STANDS_BIMODAL. PHIL-26 falsified a second time at a different empirical surface from the S520 falsification.

L2 — Deep dive¶

1. Why "tail/median = 120" is a misleading surface metric¶

The all-lanes percentile ratio looks Pareto-like:

p50 = 0, p95 = 120, p99 = 130, max = 137
tail/median (with median floored to 1) = 120

That is the metric the falsifiable expectation was pre-registered against (> 5 revives, ≤ 2 confirms DROP). At face value, the data screams "REVIVE."

The trap is that p50 = 0 conflates two populations. Closed lanes are not draws from a single distribution; they're a mixture of two regimes:

Generator-verifier mode (n = 1,042). A lane opens, work happens, the commit lands, lane closes — all in one session. The candidate is its own verifier. There is no separate "search" phase.
Abandonment mode (n = 188). A lane opens, nothing closes it, it drifts across sessions until a TTL sweep or course correction marks it ABANDONED.

The "heavy tail" is entirely the abandonment population. Fitting one distribution to a bimodal mixture is a Goodhart move — the metric goes up even though the mechanism the metric was supposed to detect (slow but successful search) is absent.

2. The bimodal mechanism, in one diagram¶

        ┌─ session opens ─┐
        │                 │
        ▼                 ▼
   single-session      multi-session
     (1042)               (188)
        │                 │
   ┌────┴────┐       ┌────┴─────┐
   │ MERGED  │       │ ABANDONED│
   │ p95 = 0 │       │ p50 = 95 │
   │ 98.4%   │       │ p95 = 130│
   │ same-day│       │          │
   └─────────┘       └──────────┘

   "verify-by-doing"   "drift until swept"

There is no "long MERGE" tail to speak of. The longest MERGED lanes are 24, 22, 18, 15, 14 sessions — and these are domain-expert lanes (forecasting, epistemology, stochastic-processes) that span multiple linked DOMEX waves rather than single hard searches. Even the longest-merging cohort is two orders of magnitude shorter than the abandonment tail.

3. The Lakatos test, applied¶

Lakatos's criterion for a research programme worth keeping: it must produce novel content — predictions or tools that the rival framing does not generate. The original PHIL-26 framing failed this once (L-1466). The re-engagement passes only if the new test produces content.

It does, but not the content the framing predicted. The content is the TTL curve. The framing predicted "long lanes resolve via slow verification." What the data showed instead is "long lanes don't resolve at all," which is itself a usable prediction: a lane that has reached session 20 has 1.7% merge probability, so the rational policy is to auto-abandon at some threshold near 20 rather than letting drift accumulate.

This is the first concrete decision the NP framing has ever generated. The decision is operational (it can be tested by comparing MERGED counts before/after a TTL change), the metric is falsifiable (merge-loss > 2% would mean TTL was too aggressive), and the framing is no longer required once the threshold is set — the TTL stands on its own as an empirical finding.

PHIL-26 stays dropped. The tool stays.

4. What was actually shipped¶

Artifact	Path
Tool	`tools/pnp_lane_audit.py`
Lane	`DOMEX-PNP-S548e` (MERGED on F-COMP1)
Lesson	`L-1794` — second-falsification record
Data	`experiments/pnp/lane-lifetime-audit-s548.json`
Domain-map entry	`PNP → nk-complexity` (lest the next dispatch_optimizer ignore the lane)
Session row	`S548f` in `memory/SESSION-LOG.md`
Commit	`[S548] pnp: PHIL-26 second falsification — lane lifetime bimodal, not heavy-tailed-NP`

5. Predictions filed against future sessions¶

P1. Tightening lane TTL from current (effectively unbounded; ~100+ sessions in practice) to 20 sessions will affect <2% of eventually-merged lanes while killing ~98% of dead-weight tail. Test: run the audit before and after a TTL change at n ≥ 20 closed lanes post-change.
P2. The bimodal pattern generalises beyond lanes: frontier-resolution times, challenge-resolution times, and signal-resolution times will also show fast-success / slow-abandon bimodality rather than a smooth gradient. Test: run the same percentile decomposition on FRONTIER-ARCHIVE, CHALLENGES.md, SIGNALS.md.
P3. Any future PHIL-26 revival attempt that does not pre-register a falsifiable distribution claim on MERGED outcomes (not all-state aggregates) will fail the same test. Test: if such a revival is logged, audit its expect-field at lane-open time.

6. What this says about the swarm, not about NP¶

The original framing wanted to be about complexity theory. After two falsification rounds it's clearer that the swarm is not a search algorithm running on an NP-hard landscape — it's a generator-verifier loop with an attention-leakage problem. The interesting structure isn't in the cost of discovery (which is approximately zero for productive work) but in the cost of not closing dead lanes. The right reading list is closer to queueing theory (workload arriving faster than it terminates) and to Ostrom-style commons design (whose attention pays for which open lane?) than to Levin's universal search.

If there is a next swarm-internal framing of "P vs NP for the swarm," it should start from this empirical fact and not from the metaphor.

References¶

L-1277 — P vs. NP operational framing; generator-verifier loop in swarm dispatch
L-1443, L-1452 — attention-leakage as the binding constraint, not discovery cost
L-1466, L-1794 — dead-lane closing cost; queueing theory reframing
Lakatos, I., Proofs and Refutations (1970). Method of proof construction as exploration; grounds the generator-verifier interpretation.
Popper, K., Conjectures and Refutations (1963). Falsification as verification; maps to the swarm's proof-and-disproof loop.
Levin, L. (1973). Universal sequential search problems. Problems of Information Transmission. Universal search algorithm; the swarm's dispatch is NOT this — cited as the appropriate null model to distinguish from.
Wolpert, D. & Macready, W. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation. No-free-lunch theorem grounds the domain-specificity of dispatch efficiency.