Storage Architecture Deep Dive

How CSharpDB.Storage is designed and how its components collaborate — written for mid-level C# engineers comfortable with .NET async patterns, interfaces, and clean architecture.

Source: This page mirrors samples/storage-tutorials/architecture.md in the repository. See also the companion Usage & Extensibility guide.

Role of `CSharpDB.Storage` in the Ecosystem

CSharpDB.Storage is the durability and physical layout layer. It is responsible for:

Persisting pages to a backing store via IStorageDevice
Managing in-memory page caching and dirty tracking (Pager)
Providing crash safety and commit durability through a write-ahead log (WriteAheadLog)
Offering a row-id keyed B+tree for table and index data (BTree, BTreeCursor, SlottedPage)
Encoding row values and schema metadata into compact byte payloads (RecordEncoder, SchemaSerializer)
Maintaining a system catalog for tables, indexes, views, and triggers (SchemaCatalog)

The higher layers (CSharpDB.Engine, CSharpDB.Execution, and the SQL planner/executor) treat storage as a stable persistence substrate. The engine delegates transaction lifecycle methods (BeginTransactionAsync, CommitAsync, RollbackAsync) to the Pager.

High-Level Architecture

At runtime, write/read operations flow through this stack:

┌──────────────────────────────────────────────────────────┐
│  Engine / Planner                                        │
│  Decides what logical records to read/write              │
├──────────────────────────────────────────────────────────┤
│  SchemaCatalog + BTree                                   │
│  Resolves table/index trees, performs key-level ops      │
├──────────────────────────────────────────────────────────┤
│  Pager                                                   │
│  Page access, cache, dirty bookkeeping, transactions     │
├──────────────────────────────────────────────────────────┤
│  WriteAheadLog + WalIndex                                │
│  Durability, commit visibility, recovery, checkpointing  │
├──────────────────────────────────────────────────────────┤
│  IStorageDevice                                          │
│  Physical file reads/writes/flushes                      │
└──────────────────────────────────────────────────────────┘

Conceptually:

BTree is the logical page consumer (structure + algorithms)
Pager is the transactional page manager (state + coordination)
WAL is the durability journal (redo log with commit markers)
Storage device is the pluggable I/O boundary

File Format & On-Disk Layout

Database Pages

Fixed page size: 4096 bytes (PageConstants.PageSize)
Page 0 contains a 100-byte file header, then usable page content
Remaining pages are fully usable for B+tree or freelist content

The file header tracks:

┌───────────────────────┬──────────────┐
│ magic / version       │ bytes 0-15   │
│ page size             │ bytes 16-19  │
│ total page count      │ bytes 20-23  │
│ schema catalog root   │ bytes 24-27  │
│ freelist head page    │ bytes 28-31  │
│ change counter        │ bytes 32-35  │
│ (reserved)            │ bytes 36-99  │
└───────────────────────┴──────────────┘

Slotted Page Structure

Each B+tree page uses a slotted layout:

┌──────────────────────────────────────────────────────────┐
│  Page Header (type, cell count, content-start,           │
│               right-child / next-leaf pointer)           │
├──────────────┬───────────────────────┬───────────────────┤
│ Pointer Array│    Free Space         │   Cell Content    │
│ (grows →)    │                       │        (← grows)  │
└──────────────┴───────────────────────┴───────────────────┘

SlottedPage encapsulates this structure: cell insert/delete, free-space accounting, defragmentation, and typed access to page header fields.

WAL File Layout

The WAL file (<db>.wal) contains:

32-byte WAL header (magic/version/page size/salts)
Repeated frames, each containing:
- 24-byte frame header
- Full 4096-byte page image

Frames are appended for dirty pages. A transaction is committed by marking the last frame with non-zero dbPageCount and flushing the WAL stream.

Core Components & Responsibilities

I/O Abstraction: `IStorageDevice`

IStorageDevice defines async primitives for random reads/writes, flushing, resizing, and disposal
FileStorageDevice is the default, using RandomAccess APIs and async random I/O
Read behavior zero-fills unread portions when reading beyond file end

Why it matters: This is the main extensibility seam for alternate backends (in-memory, encrypted, remote block device) without changing Pager/BTree logic.

Page Manager & Transaction Coordinator: `Pager`

Pager is the central coordinator with these responsibilities:

Caches pages via IPageCache (DictionaryPageCache by default, LruPageCache when bounded, or a custom cache)
Tracks dirty page IDs within a write transaction
Manages page allocation and freelist reuse
Reads/writes file header state
Owns transaction flow: begin, commit, rollback
Delegates WAL appends/commit/rollback/checkpoint
Supports concurrent snapshot readers and checkpoint coordination

Concurrency Model

Mechanism	Purpose
`SemaphoreSlim` writer lock	Single-writer exclusion with timeout
Checkpoint lock	Serializes checkpoint operations
Active reader counter	Prevents checkpoint while snapshot readers are alive
Immutable `WalSnapshot`	Snapshot readers use frozen state + read-only pager

Transaction Behavior

// BeginTransactionAsync
// → acquires writer lock, starts WAL transaction

// CommitAsync
// → writes all dirty pages as WAL frames
// → WAL commits (mark final frame + flush)
// → clears dirty set, releases writer lock

// RollbackAsync
// → truncates uncommitted WAL frames
// → clears cache/dirty state
// → rehydrates header state, releases lock

Checkpoint Behavior

Triggered manually or automatically when WAL frame count threshold is reached
Skips if active readers hold snapshots
Applies committed WAL pages back into main DB file and resets WAL/index

Durability & Recovery: `WriteAheadLog` + `WalIndex`

`WriteAheadLog`

Create/open WAL file
Append page frames for dirty pages
Mark commit by rewriting final frame header (dbPageCount) + checksum + flush
Recover by scanning WAL, validating frame salts/checksums, rebuilding committed index, truncating partial/corrupt tail
Checkpoint committed pages to DB file and reset WAL

`WalIndex`

Maps pageId → latest committed WAL offset
Tracks committed frame count
Produces WalSnapshot copies for readers

`WalSnapshot` — Snapshot Isolation

Readers resolve page lookups through a frozen page-to-offset map
New commits do not affect an existing snapshot

Data Structure: `BTree` & `BTreeCursor`

BTree is a row-id keyed B+tree (long key).

Cell Type	Contents
Leaf cell	key + payload bytes
Interior cell	left child + separator key
Interior header	right-child pointer; leaf pages chain via `nextLeaf`

Supported operations: point lookup (FindAsync + cached fast path TryFindCached), insert with recursive split, delete (no underflow rebalance), full count by leaf traversal, and cursor creation.

Performance: Leaf hint cache (_hintLeafPageId, min/max key range) bypasses root-to-leaf traversal for clustered lookups. Binary search utilities (LowerBoundLeaf, UpperBoundInterior) operate on per-page cells.

BTreeCursor: Forward-only iterator starting at the leftmost leaf and advancing through the leaf-linked list. Supports SeekAsync(targetKey) then incremental MoveNextAsync.

Serialization: `RecordEncoder`, `Varint`, `SchemaSerializer`

Record Encoding

Rows are encoded as a column-count varint followed by repeated [typeTag][valueData] pairs. Optimized APIs avoid materializing full rows:

DecodeInto          // reuse destination span
DecodeUpTo          // prefix decode (first N columns)
DecodeColumn        // single-column extraction
TryColumnTextEquals // UTF-8 equality without string alloc
IsColumnNull        // null check without decode
TryDecodeNumericColumn // numeric extraction

Varint

Unsigned LEB128-style encoding used by record and schema serializers for compact lengths/sizes.

Schema Metadata

SchemaSerializer serializes/deserializes TableSchema, IndexSchema, and TriggerSchema. Maps names to long keys via deterministic hash functions for catalog storage.

Catalog & Metadata: `SchemaCatalog`

SchemaCatalog stores and caches metadata in B+trees:

Main catalog tree stores table definitions keyed by hashed table name
Sentinel keys point to separate subtrees for indexes, views, and triggers
In-memory caches mirror catalog for fast lookups

Responsibilities: create/drop/update table schemas, create/drop indexes/views/triggers, provide table/index BTree instances, track root page IDs, persist root-page changes after splits (PersistRootPageChangesAsync).

Data Flow Walkthroughs

Database Open / Recovery

Engine creates FileStorageDevice, WalIndex, WriteAheadLog, Pager
New DB: initialize page 0 header, create/open WAL
Existing DB: pager loads header, WAL recovery scans WAL
If committed WAL frames exist after recovery, checkpoint runs to bring DB file current
SchemaCatalog loads metadata trees and warms in-memory schema caches

Write Transaction

Pager.BeginTransactionAsync acquires writer lock and starts WAL transaction
Planner/catalog/BTree mutate page byte arrays
Modified pages are marked dirty in pager
Pager.CommitAsync: updates page 0 header → appends dirty pages to WAL → WAL commits (mark final frame + flush) → WAL index visible to future readers → writer lock released
Optional auto-checkpoint may run when threshold reached

Rollback

WAL is truncated to tx start offset
Pager clears dirty state and cache
Header state is reloaded from DB file (or latest committed WAL page 0 if present)
Writer lock is released

Read Path (Normal & Snapshot)

Normal read: BTree asks pager for pages. Pager resolves from cache first, then latest WAL frame (if any), then DB file.

Snapshot read:

Reader acquires WalSnapshot from pager
Read-only snapshot pager is created
Page resolution uses snapshot mapping only; later commits are invisible
On disposal, reader count decrements, allowing future checkpoint

Query Scans

BTreeCursor begins at leftmost leaf and walks nextLeaf pointers. Row payloads are decoded via RecordEncoder as needed by execution operators.

Architectural Patterns

Pattern	Where
Abstraction boundary	`IStorageDevice` decouples persistence medium from page/WAL logic
Coordinator pattern	`Pager` centralizes transactional lifecycle and page state
Redo logging	WAL stores post-image page frames and commit markers
Snapshot pattern	`WalSnapshot` provides immutable read view for concurrent readers
Repository-like metadata	`SchemaCatalog` encapsulates schema persistence and in-memory indexing
Cursor iteration	`BTreeCursor` for ordered streaming traversal
Format utility classes	`RecordEncoder`, `SchemaSerializer`, `Varint` isolate binary format concerns

Plugin architecture: The project is partially pluggable today. First-class provider seams exist for storage device, page cache, checkpoint policy, lifecycle interceptors, checksum provider, index provider, serializer provider, catalog store, and the storage engine factory. Pager/WAL/BTree internals remain concrete.

Extensibility Model

Current Extension Points

Concern	Interface	Default
Storage backend	`IStorageDevice`	`FileStorageDevice`
Page cache	`IPageCache`	`DictionaryPageCache` / `LruPageCache`
Auto-checkpoint	`ICheckpointPolicy`	`FrameCountCheckpointPolicy`
Lifecycle hooks	`IPageOperationInterceptor`	`NoOpPageOperationInterceptor`
WAL checksum	`IPageChecksumProvider`	`AdditiveChecksumProvider`
Index implementation	`IIndexProvider`	`BTreeIndexProvider`
Serialization	`ISerializerProvider`	`DefaultSerializerProvider`
Catalog codec	`ICatalogStore`	`CatalogStore`
Composition root	`IStorageEngineFactory`	`DefaultStorageEngineFactory`

Future Plugin Support

Expose WAL replacement through StorageEngineOptionsBuilder instead of requiring a custom IStorageEngineFactory
Introduce a higher-level abstraction around pager/WAL orchestration for alternate journaling schemes
Consider whether page layout and B+tree algorithms should ever be swappable, or remain intentionally concrete

Tradeoffs & Rationale

Decision	Benefit	Tradeoff
Whole-page WAL frames	Implementation clarity, recovery simplicity	Larger write amplification for small updates
Single-writer lock	Predictable conflict/consistency handling	Write throughput bottleneck under heavy concurrent writes
Snapshot readers via copied map	Cheap creation, easy to reason about	Copy cost grows with WAL page map size
Delete without rebalance	Simpler implementation, fewer edge-case bugs	Potential space inefficiency over time
Hash-based catalog keys	Deterministic and lightweight	Theoretical collision risk
Static utility serializers	Low overhead, fast	Reduced runtime configurability

Contributor Quick Guide

A) Add a New Storage Backend

Implement IStorageDevice
Ensure read semantics match expectations (especially zero-fill beyond EOF behavior)
Validate with WAL + pager tests (transaction commit/rollback/recovery)

B) Add Row Encoding Optimizations

Extend RecordEncoder with focused decode helpers for execution hot paths
Keep wire format backward-compatible unless versioning header strategy is introduced
Add tests for mixed type decoding and edge cases (missing columns, null semantics)

C) Improve BTree Behavior

Add rebalance/merge logic for delete underflow
Preserve cursor and lookup correctness with split/merge interactions
Ensure root-page updates are persisted through SchemaCatalog.PersistRootPageChangesAsync

D) Pluggable Index Providers (Larger Refactor)

Define an interface for index operations used by planner/catalog
Provide adapter for existing BTree implementation
Migrate factory points in engine/database construction

E) Concurrency Changes

Be cautious around writer/checkpoint/reader interactions
Any checkpoint policy changes must preserve snapshot guarantees
Re-run WAL recovery and concurrent reader tests after modifications

Practical Mental Model

A useful way to reason about this storage engine:

BTree          → "Where is this key/value logically?"
Pager          → "Which page image is authoritative right now?"
WAL            → "What is durable and recoverable after a crash?"
SchemaCatalog  → "Which trees represent which database objects?"

Those four concerns are intentionally separated, and that separation is the main architectural strength of CSharpDB.Storage.

Storage Architecture Deep Dive

Role of CSharpDB.Storage in the Ecosystem

High-Level Architecture

File Format & On-Disk Layout

Database Pages

Slotted Page Structure

WAL File Layout

Core Components & Responsibilities

I/O Abstraction: IStorageDevice

Page Manager & Transaction Coordinator: Pager

Concurrency Model

Transaction Behavior

Checkpoint Behavior

Durability & Recovery: WriteAheadLog + WalIndex

WriteAheadLog

WalIndex

WalSnapshot — Snapshot Isolation

Data Structure: BTree & BTreeCursor

Serialization: RecordEncoder, Varint, SchemaSerializer

Record Encoding

Varint

Schema Metadata

Catalog & Metadata: SchemaCatalog

Data Flow Walkthroughs

Database Open / Recovery

Write Transaction

Rollback

Read Path (Normal & Snapshot)

Query Scans

Architectural Patterns

Extensibility Model

Current Extension Points

Future Plugin Support

Tradeoffs & Rationale

Contributor Quick Guide

A) Add a New Storage Backend

B) Add Row Encoding Optimizations

C) Improve BTree Behavior

D) Pluggable Index Providers (Larger Refactor)

E) Concurrency Changes

Practical Mental Model

Next Steps

Database Modes

Storage Tutorials

SQL Reference

Role of `CSharpDB.Storage` in the Ecosystem

I/O Abstraction: `IStorageDevice`

Page Manager & Transaction Coordinator: `Pager`

Durability & Recovery: `WriteAheadLog` + `WalIndex`

`WriteAheadLog`

`WalIndex`

`WalSnapshot` — Snapshot Isolation

Data Structure: `BTree` & `BTreeCursor`

Serialization: `RecordEncoder`, `Varint`, `SchemaSerializer`

Catalog & Metadata: `SchemaCatalog`