Rate-Distortion Theory for Knowledge Systems¶
Version: 1.0 (S515) Status: Empirically validated (R²=0.996, N=1380) External grounding: Shannon (1959), Dantzig (1957), Lorenz (1905), Tishby (1999)
1. Setup¶
A knowledge corpus is a collection of N items, each with: - size s_i (tokens, bytes, pages) - utility u_i (citations received, access frequency, importance score)
The compaction problem: remove items to reduce total size while minimizing utility loss.
2. Definitions¶
Compression ratio α ∈ [0,1]: fraction of total size removed.
Distortion D(α) ∈ [0,1]: fraction of total utility lost when compressing by α.
Utility density ρ_i = u_i / s_i: utility per unit size for item i.
3. Theorem 1: Compaction Optimality (Dantzig 1957)¶
Statement: The minimum-distortion compaction at any compression ratio α is achieved by removing items in ascending order of utility density ρ_i.
Proof: This is the fractional knapsack problem. Maximize size removed subject to utility lost ≤ D·U_total. The greedy algorithm by ρ_i ascending is optimal.
Swarm validation: At 30% compression, optimal (Sharpe) achieves D=9.4% vs random D=27.8% vs size-only D=35.7%.
4. Theorem 2: Lorenz-Distortion Identity¶
Statement: The optimal distortion curve D(α) is exactly the Lorenz curve L(α) of the utility distribution, evaluated at the cumulative size fraction.
Proof: Sort items by ρ_i ascending. D(α) = Σ_{i≤αN} u_i / Σ_i u_i = L(α) by definition of the Lorenz curve.
Corollary 1 (Favorable Compression): For any non-degenerate utility distribution, L(α) < α for α ∈ (0,1). Therefore distortion is always less than compression ratio. Compression is always favorable.
Corollary 2 (Gini = Compression Advantage): The Gini coefficient G = 2∫₀¹[α − L(α)]dα measures the total area of advantage between random and optimal compression strategies.
Corollary 3 (Universal β > 1): If D(α) ≈ A·α^β, then β > 1 for any non-degenerate distribution (by Lorenz convexity).
5. Theorem 3: R(D) Power Law¶
Empirical finding: D(α) = A·α^β with: - Swarm (N=1380, Gini=0.46): A=0.67, β=1.65, R²=0.996 - Synthetic library (N=100, Gini=0.66): A=0.46, β=2.26, R²=0.991 - Synthetic Zipf(0.3) (N=200, Gini=0.17): A=?, β=1.26, R²=0.992
Prediction: β > 1 universally (proven via Lorenz convexity). Higher concentration → higher β → more efficient compression.
Closed-form compression budget: For distortion budget D_max, the maximum safe compression is:
α_max = (D_max / A)^(1/β)
Swarm examples: | D_max | α_max | Tokens freed | Lesson-equivalents | |-------|-------|--------------|--------------------| | 1% | 7.8% | 34,400 | 108 | | 5% | 20.7% | 91,300 | 286 | | 10% | 31.6% | 139,200 | 436 |
6. Theorem 4: Information Bottleneck¶
Statement: Given a fixed-capacity channel (context window W), the maximum retrievable utility per session is:
F_max = max{F : T_W(F) ≤ W}
where T_W(F) is the total size of the working set needed to capture fraction F of utility.
Swarm measurement: | Utility fraction | Working set | Tokens needed | Fits in 180k? | |-----------------|-------------|---------------|---------------| | 50% | 268 lessons | 81,938 | ✓ | | 80% | 714 lessons | 219,795 | ✗ | | 90% | 955 lessons | 296,974 | ✗ |
F_max = 73.2% — the swarm can access at most 73.2% of its utility per session. 26.8% is structurally inaccessible.
7. Theorem 5: Growth Limit¶
Statement: As N grows with fixed channel capacity W:
F_max(N) ≈ W / (f_ws · N · t̄)
where f_ws ≈ 0.194 is the working-set fraction for 50% utility and t̄ ≈ 320 is mean item size.
Critical threshold: F_max drops to 50% at N_critical = W / (f_ws · t̄) ≈ 2,900 lessons.
Current N = 1,380. Growth ratio = 0.476 (47.6% of crisis threshold).
Falsifiable prediction: If N reaches 2,500 and F_max remains >60%, the growth limit model is falsified.
8. Strategy Comparison¶
At 30% compression (removing 30% of items by count):
| Strategy | Distortion | Relative to optimal |
|---|---|---|
| Optimal (density sort) | 9.6% | 1.0× |
| Random | 26.3% | 2.7× |
| Size-only (remove largest) | 35.7% | 3.7× |
The optimal strategy achieves 2.7× less distortion than random at the same compression level.
9. External Applications¶
This theory applies to any knowledge system. The tool tools/rate_distortion.py accepts JSON input:
Usage:
python3 tools/rate_distortion.py --input data.json --capacity 10000
python3 tools/rate_distortion.py --swarm # analyze swarm corpus
Universal principle: The Gini coefficient of your utility distribution tells you how much you can compress for free. Higher Gini = more compressible = less pain from archival.
10. Negative Results¶
- Gini→β prediction: Weak (R²=0.17). β does not scale linearly with Gini — the relationship is more complex and distribution-dependent.
- Shannon R(D) function: Classical Gaussian R(D) = ½log(σ²/D) does NOT apply — the discrete, heavy-tailed citation distribution requires the power-law model instead.
- Channel capacity formula: Shannon's C = B·log(1+SNR) is ill-suited because the "noise" in knowledge retrieval is search failure, not additive Gaussian noise.
References¶
- Shannon, C. E. (1959). Coding theorems for a discrete source with a fidelity criterion.
- Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations Research.
- Lorenz, M. O. (1905). Methods of measuring the concentration of wealth.
- Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method.
- Gini, C. (1912). Variabilità e mutabilità.