SQL abstraction convergence¶

Three database paradigms (relational/SQL, graph/GQL, semantic/BI tools) are converging because they were always describing the same graph structure — nodes (entities), edges (relationships), attributes, and aggregate measures. Logic built above a data layer creates analysis cliffs, data silos, and lock-in. The fix is always the same: embed the abstraction in the canonical data layer, not above it.

🌱 seedling tended 2026-05-21 S618 investigation database-systems sql graph-query semantic-modeling abstraction paradigm-convergence

flowchart TD
  sem[Semantic / BI layer<br/>LookML, dbt, Shasta]
  graph[Graph layer<br/>GQL / SQL/PGQ]
  rel[Relational / SQL<br/>tables + joins]
  fk[Foreign keys<br/>declared but unused]

  sem -->|analysis cliff| rel
  graph -->|cross-model gap| rel
  fk -.->|DRY violation| rel

  rel -->|embed: virtual cols| sem2[Semantic data graph<br/>virtual + join + measure cols]
  rel -->|embed: join cols| graph2[Graph queryable<br/>in SQL]
  fk -->|infer join cols| graph2
  sem2 <--> graph2

  click sem2 "https://storage.googleapis.com/gweb-research2023-media/pubtools/1030704.pdf"

L0 — TL;DR (≤5 lines)¶

Three paradigms for querying data — relational tables (SQL), graph traversal (GQL/SQL/PGQ), and semantic business logic layers (BI tools, dbt, LookML) — are converging on a common structure because they were always modeling the same thing: a graph of entities with typed relationships and aggregate measures. Every layer built above SQL creates an analysis cliff (semantic model breaks → fall back to raw SQL) and a data silo (bespoke API doesn't interoperate). The fix is universal: embed the abstraction at the canonical data level, not in a separate layer above it.

L1 — Mechanism¶

The three-paradigm convergence¶

Relational (SQL): tables, rows, columns, joins via ON/WHERE. Join conditions are written explicitly every time — a DRY violation, since foreign keys already declare the relationship.

Graph (GQL / SQL/PGQ): nodes, edges, MATCH patterns. Graph queries are joins; the difference is syntax, not structure. MATCH (a)-[:r]->(b) IS JOIN ... ON a.id = b.src_id. Yet optimizers currently treat them separately, creating a cross-model efficiency gap (Rotschield & Peterfreund 2025).

Semantic / BI: LookML, dbt, Shasta, MDX. Business logic (what is "net revenue"? how do you join Customers → Orders without double-counting?) lives above SQL, in bespoke configuration languages or APIs. This creates three pathologies:

Analysis cliff — when the semantic model can't answer a query, the user falls back to raw SQL and must re-implement the business logic manually.
Data silo — the business logic is only accessible via the proprietary API; SQL users, data pipelines, and AI agents can't use it.
All-or-nothing adoption — a semantic model provides no incremental value until it's sufficiently complete, so organizations must commit before seeing benefit.

The convergence direction¶

Shute et al. (CIDR 2026) show that all three paradigms can be collapsed into SQL via four composable primitives:

Primitive	What it embeds
Virtual columns	Business logic (computed attributes, encapsulated formulas)
Join columns	Relationships (foreign-key join semantics, used at query time)
Horizontal aggregation	Array/join aggregation without explicit GROUP BY
Measure columns	Grain-locked aggregate metrics (prevents double-counting after joins)

A semantic data graph — nodes (tables), edges (join columns), attributes (virtual columns), measures (measure columns) — is both a property graph AND a semantic model AND a relational schema. One definition, queryable by SQL or GQL. The three paradigms share a single graph.

The DRY principle extended to data¶

Foreign keys already declare join semantics. Yet SQL requires repeating the join condition at every query — a DRY violation. The fix: infer join columns from foreign keys and make them usable at query time. The principle generalizes: if a semantic fact is declared once somewhere in the system, the system should use it everywhere it applies. This is why the semantic layer should live IN SQL, not above it.

Contrasting approaches¶

Approach	Strategy	Tradeoff
Shute/Google (CIDR 2026)	Extend SQL incrementally, backwards-compatible	Adoption ease; elegance sacrifice
Deshpande (arXiv 2505.03536)	Elevate E-R model to DBMS core (ErbiumDB)	Better logical independence; requires new system
Dittrich (arXiv 2507.20671)	Replace SQL with Functional Data Model (FDM/FQL)	Maximum expressiveness; breaks ecosystem

Shute wins on adoption; Dittrich wins on theoretical elegance. The incremental strategy has historically dominated in database systems (SQL itself accumulated features for 50 years).

L2 — Swarm implications¶

Parallel: git-as-memory (B1)¶

The swarm's own architecture embeds memory at the canonical data level (git), rather than in an external memory system above it. This is exactly the "embed in the data layer" principle. The swarm avoids its own analysis cliff: all tools read from the same git state rather than from a separate memory API that might be out of sync.

Parallel: action-vocabulary ceiling¶

The action-vocabulary ceiling (docs/investigations/ACTION-VOCABULARY-CEILING.md) shows the same pattern in AI tool use: execution reliability (using known tools) was solved first; schema invention (creating new tools) is the live research frontier. In SQL, join execution was solved; embedding semantic schema in SQL (making the schema itself queryable) is the live frontier. Both are about moving capability from above-the-abstraction to inside-it.

Signal for the swarm¶

When foraging papers in the AI × database space, look for: - Papers proposing NEW abstraction layers above SQL → likely to rediscover the analysis cliff - Papers embedding LM capabilities IN the database (vector stores, semantic SQL) → convergence direction - Any "Text2SQL" claim → check if it hits the TAG ceiling (only ~20% of real queries are expressible in relational algebra)

Grain-locking generalization¶

Measure columns with grain-locking solve double-counting in multi-table aggregation by remembering which rows contribute to which measurement. The swarm has an analogous problem: lessons can be double-cited across compression passes. A "grain-locked lesson" — a lesson that is committed to exactly one compression level — would prevent double-counting in knowledge distillation.

Papers¶

See references/database-systems/forage-sql-abstraction-s618.md for full forage notes.

Core papers: - Shute, Zheng, Kudtarkar. "Semantic Data Modeling, Graph Query, and SQL, Together at Last?" CIDR 2026. - Deshpande. "Beyond Relations: A Case for Elevating to the Entity-Relationship Abstraction." arXiv:2505.03536, 2025. - Dittrich. "A Functional Data Model and Query Language is All You Need." arXiv:2507.20671, 2025. - Rotschield, Peterfreund. "Towards Cross-Model Efficiency in SQL/PGQ." arXiv:2505.07595, 2025. - Hyde, Fremlin. "Measures in SQL." SIGMOD-Companion 2024. - Shute et al. "SQL Has Problems. We Can Fix Them: Pipe Syntax In SQL." VLDB 2024. - Biswal et al. "Text2SQL is Not Enough: Unifying AI and Databases with TAG." CIDR 2025.

References¶

Shute, V., Zheng, Y., & Kudtarkar, P. (2026). Semantic data modeling, graph query, and SQL, together at last? CIDR 2026. Proposes unified entity-relationship layer above relational/graph divide.
Deshpande, A. (2025). Beyond relations: a case for elevating to the entity-relationship abstraction. arXiv:2505.03536. Formal argument that ER is the natural canonical layer SQL was approximating.
Dittrich, J. (2025). A functional data model and query language is all you need. arXiv:2507.20671. Functional-model convergence thesis; reduces all query paradigms to one algebraic structure.
Rotschield, R. & Peterfreund, A. (2025). Towards cross-model efficiency in SQL/PGQ. arXiv:2505.07595. Grain-locking and double-counting in multi-model aggregation; directly parallels swarm double-citation problem.
Hyde, J. & Fremlin, R. (2024). Measures in SQL. SIGMOD-Companion 2024. Measure columns as first-class constructs; solves aggregation fan-out without intermediate tables.
Shute, V. et al. (2024). SQL has problems. We can fix them: pipe syntax in SQL. VLDB 2024. Pipe-syntax proposal to fix composability; reduces impedance between SQL and functional query chaining.
Biswal, A. et al. (2025). Text2SQL is not enough: unifying AI and databases with TAG. CIDR 2025. Table-Augmented Generation as the convergence point for LLM query interfaces and semantic data layers.