Storage Architecture Deep Dive

How CSharpDB.Storage is designed and how its components collaborate — written for mid-level C# engineers comfortable with .NET async patterns, interfaces, and clean architecture.

Source: This page mirrors samples/storage-tutorials/architecture.md in the repository. See also the companion Usage & Extensibility guide.

Role of CSharpDB.Storage in the Ecosystem

CSharpDB.Storage is the durability and physical layout layer. It is responsible for:

  • Persisting pages to a backing store via IStorageDevice
  • Managing in-memory page caching and dirty tracking (Pager)
  • Providing crash safety and commit durability through a write-ahead log (WriteAheadLog)
  • Offering a row-id keyed B+tree for table and index data (BTree, BTreeCursor, SlottedPage)
  • Encoding row values and schema metadata into compact byte payloads (RecordEncoder, SchemaSerializer)
  • Maintaining a system catalog for tables, indexes, views, and triggers (SchemaCatalog)

The higher layers (CSharpDB.Engine, CSharpDB.Execution, and the SQL planner/executor) treat storage as a stable persistence substrate. The engine delegates transaction lifecycle methods (BeginTransactionAsync, CommitAsync, RollbackAsync) to the Pager.

High-Level Architecture

At runtime, write/read operations flow through this stack:

┌──────────────────────────────────────────────────────────┐
│  Engine / Planner                                        │
│  Decides what logical records to read/write              │
├──────────────────────────────────────────────────────────┤
│  SchemaCatalog + BTree                                   │
│  Resolves table/index trees, performs key-level ops      │
├──────────────────────────────────────────────────────────┤
│  Pager                                                   │
│  Page access, cache, dirty bookkeeping, transactions     │
├──────────────────────────────────────────────────────────┤
│  WriteAheadLog + WalIndex                                │
│  Durability, commit visibility, recovery, checkpointing  │
├──────────────────────────────────────────────────────────┤
│  IStorageDevice                                          │
│  Physical file reads/writes/flushes                      │
└──────────────────────────────────────────────────────────┘

Conceptually:

  • BTree is the logical page consumer (structure + algorithms)
  • Pager is the transactional page manager (state + coordination)
  • WAL is the durability journal (redo log with commit markers)
  • Storage device is the pluggable I/O boundary

File Format & On-Disk Layout

Database Pages

  • Fixed page size: 4096 bytes (PageConstants.PageSize)
  • Page 0 contains a 100-byte file header, then usable page content
  • Remaining pages are fully usable for B+tree or freelist content

The file header tracks:

┌───────────────────────┬──────────────┐
│ magic / version       │ bytes 0-15   │
│ page size             │ bytes 16-19  │
│ total page count      │ bytes 20-23  │
│ schema catalog root   │ bytes 24-27  │
│ freelist head page    │ bytes 28-31  │
│ change counter        │ bytes 32-35  │
│ (reserved)            │ bytes 36-99  │
└───────────────────────┴──────────────┘

Slotted Page Structure

Each B+tree page uses a slotted layout:

┌──────────────────────────────────────────────────────────┐
│  Page Header (type, cell count, content-start,           │
│               right-child / next-leaf pointer)           │
├──────────────┬───────────────────────┬───────────────────┤
│ Pointer Array│    Free Space         │   Cell Content    │
│ (grows →)    │                       │        (← grows)  │
└──────────────┴───────────────────────┴───────────────────┘

SlottedPage encapsulates this structure: cell insert/delete, free-space accounting, defragmentation, and typed access to page header fields.

WAL File Layout

The WAL file (<db>.wal) contains:

  • 32-byte WAL header (magic/version/page size/salts)
  • Repeated frames, each containing:
    • 24-byte frame header
    • Full 4096-byte page image

Frames are appended for dirty pages. A transaction is committed by marking the last frame with non-zero dbPageCount and flushing the WAL stream.

Core Components & Responsibilities

I/O Abstraction: IStorageDevice

  • IStorageDevice defines async primitives for random reads/writes, flushing, resizing, and disposal
  • FileStorageDevice is the default, using RandomAccess APIs and async random I/O
  • Read behavior zero-fills unread portions when reading beyond file end
Why it matters: This is the main extensibility seam for alternate backends (in-memory, encrypted, remote block device) without changing Pager/BTree logic.

Page Manager & Transaction Coordinator: Pager

Pager is the central coordinator with these responsibilities:

  • Caches pages via IPageCache (DictionaryPageCache by default, LruPageCache when bounded, or a custom cache)
  • Tracks dirty page IDs within a write transaction
  • Manages page allocation and freelist reuse
  • Reads/writes file header state
  • Owns transaction flow: begin, commit, rollback
  • Delegates WAL appends/commit/rollback/checkpoint
  • Supports concurrent snapshot readers and checkpoint coordination

Concurrency Model

Mechanism Purpose
SemaphoreSlim writer lock Single-writer exclusion with timeout
Checkpoint lock Serializes checkpoint operations
Active reader counter Prevents checkpoint while snapshot readers are alive
Immutable WalSnapshot Snapshot readers use frozen state + read-only pager

Transaction Behavior

// BeginTransactionAsync
// → acquires writer lock, starts WAL transaction

// CommitAsync
// → writes all dirty pages as WAL frames
// → WAL commits (mark final frame + flush)
// → clears dirty set, releases writer lock

// RollbackAsync
// → truncates uncommitted WAL frames
// → clears cache/dirty state
// → rehydrates header state, releases lock

Checkpoint Behavior

  • Triggered manually or automatically when WAL frame count threshold is reached
  • Skips if active readers hold snapshots
  • Applies committed WAL pages back into main DB file and resets WAL/index

Durability & Recovery: WriteAheadLog + WalIndex

WriteAheadLog

  • Create/open WAL file
  • Append page frames for dirty pages
  • Mark commit by rewriting final frame header (dbPageCount) + checksum + flush
  • Recover by scanning WAL, validating frame salts/checksums, rebuilding committed index, truncating partial/corrupt tail
  • Checkpoint committed pages to DB file and reset WAL

WalIndex

  • Maps pageId → latest committed WAL offset
  • Tracks committed frame count
  • Produces WalSnapshot copies for readers

WalSnapshot — Snapshot Isolation

  • Readers resolve page lookups through a frozen page-to-offset map
  • New commits do not affect an existing snapshot

Data Structure: BTree & BTreeCursor

BTree is a row-id keyed B+tree (long key).

Cell Type Contents
Leaf cell key + payload bytes
Interior cell left child + separator key
Interior header right-child pointer; leaf pages chain via nextLeaf

Supported operations: point lookup (FindAsync + cached fast path TryFindCached), insert with recursive split, delete (no underflow rebalance), full count by leaf traversal, and cursor creation.

Performance: Leaf hint cache (_hintLeafPageId, min/max key range) bypasses root-to-leaf traversal for clustered lookups. Binary search utilities (LowerBoundLeaf, UpperBoundInterior) operate on per-page cells.

BTreeCursor: Forward-only iterator starting at the leftmost leaf and advancing through the leaf-linked list. Supports SeekAsync(targetKey) then incremental MoveNextAsync.

Serialization: RecordEncoder, Varint, SchemaSerializer

Record Encoding

Rows are encoded as a column-count varint followed by repeated [typeTag][valueData] pairs. Optimized APIs avoid materializing full rows:

DecodeInto          // reuse destination span
DecodeUpTo          // prefix decode (first N columns)
DecodeColumn        // single-column extraction
TryColumnTextEquals // UTF-8 equality without string alloc
IsColumnNull        // null check without decode
TryDecodeNumericColumn // numeric extraction

Varint

Unsigned LEB128-style encoding used by record and schema serializers for compact lengths/sizes.

Schema Metadata

SchemaSerializer serializes/deserializes TableSchema, IndexSchema, and TriggerSchema. Maps names to long keys via deterministic hash functions for catalog storage.

Catalog & Metadata: SchemaCatalog

SchemaCatalog stores and caches metadata in B+trees:

  • Main catalog tree stores table definitions keyed by hashed table name
  • Sentinel keys point to separate subtrees for indexes, views, and triggers
  • In-memory caches mirror catalog for fast lookups

Responsibilities: create/drop/update table schemas, create/drop indexes/views/triggers, provide table/index BTree instances, track root page IDs, persist root-page changes after splits (PersistRootPageChangesAsync).

Data Flow Walkthroughs

Database Open / Recovery

  1. Engine creates FileStorageDevice, WalIndex, WriteAheadLog, Pager
  2. New DB: initialize page 0 header, create/open WAL
  3. Existing DB: pager loads header, WAL recovery scans WAL
  4. If committed WAL frames exist after recovery, checkpoint runs to bring DB file current
  5. SchemaCatalog loads metadata trees and warms in-memory schema caches

Write Transaction

  1. Pager.BeginTransactionAsync acquires writer lock and starts WAL transaction
  2. Planner/catalog/BTree mutate page byte arrays
  3. Modified pages are marked dirty in pager
  4. Pager.CommitAsync: updates page 0 header → appends dirty pages to WAL → WAL commits (mark final frame + flush) → WAL index visible to future readers → writer lock released
  5. Optional auto-checkpoint may run when threshold reached

Rollback

  1. WAL is truncated to tx start offset
  2. Pager clears dirty state and cache
  3. Header state is reloaded from DB file (or latest committed WAL page 0 if present)
  4. Writer lock is released

Read Path (Normal & Snapshot)

Normal read: BTree asks pager for pages. Pager resolves from cache first, then latest WAL frame (if any), then DB file.

Snapshot read:

  1. Reader acquires WalSnapshot from pager
  2. Read-only snapshot pager is created
  3. Page resolution uses snapshot mapping only; later commits are invisible
  4. On disposal, reader count decrements, allowing future checkpoint

Query Scans

BTreeCursor begins at leftmost leaf and walks nextLeaf pointers. Row payloads are decoded via RecordEncoder as needed by execution operators.

Architectural Patterns

Pattern Where
Abstraction boundary IStorageDevice decouples persistence medium from page/WAL logic
Coordinator pattern Pager centralizes transactional lifecycle and page state
Redo logging WAL stores post-image page frames and commit markers
Snapshot pattern WalSnapshot provides immutable read view for concurrent readers
Repository-like metadata SchemaCatalog encapsulates schema persistence and in-memory indexing
Cursor iteration BTreeCursor for ordered streaming traversal
Format utility classes RecordEncoder, SchemaSerializer, Varint isolate binary format concerns
Plugin architecture: The project is partially pluggable today. First-class provider seams exist for storage device, page cache, checkpoint policy, lifecycle interceptors, checksum provider, index provider, serializer provider, catalog store, and the storage engine factory. Pager/WAL/BTree internals remain concrete.

Extensibility Model

Current Extension Points

Concern Interface Default
Storage backend IStorageDevice FileStorageDevice
Page cache IPageCache DictionaryPageCache / LruPageCache
Auto-checkpoint ICheckpointPolicy FrameCountCheckpointPolicy
Lifecycle hooks IPageOperationInterceptor NoOpPageOperationInterceptor
WAL checksum IPageChecksumProvider AdditiveChecksumProvider
Index implementation IIndexProvider BTreeIndexProvider
Serialization ISerializerProvider DefaultSerializerProvider
Catalog codec ICatalogStore CatalogStore
Composition root IStorageEngineFactory DefaultStorageEngineFactory

Future Plugin Support

  • Expose WAL replacement through StorageEngineOptionsBuilder instead of requiring a custom IStorageEngineFactory
  • Introduce a higher-level abstraction around pager/WAL orchestration for alternate journaling schemes
  • Consider whether page layout and B+tree algorithms should ever be swappable, or remain intentionally concrete

Tradeoffs & Rationale

Decision Benefit Tradeoff
Whole-page WAL frames Implementation clarity, recovery simplicity Larger write amplification for small updates
Single-writer lock Predictable conflict/consistency handling Write throughput bottleneck under heavy concurrent writes
Snapshot readers via copied map Cheap creation, easy to reason about Copy cost grows with WAL page map size
Delete without rebalance Simpler implementation, fewer edge-case bugs Potential space inefficiency over time
Hash-based catalog keys Deterministic and lightweight Theoretical collision risk
Static utility serializers Low overhead, fast Reduced runtime configurability

Contributor Quick Guide

A) Add a New Storage Backend

  1. Implement IStorageDevice
  2. Ensure read semantics match expectations (especially zero-fill beyond EOF behavior)
  3. Validate with WAL + pager tests (transaction commit/rollback/recovery)

B) Add Row Encoding Optimizations

  1. Extend RecordEncoder with focused decode helpers for execution hot paths
  2. Keep wire format backward-compatible unless versioning header strategy is introduced
  3. Add tests for mixed type decoding and edge cases (missing columns, null semantics)

C) Improve BTree Behavior

  1. Add rebalance/merge logic for delete underflow
  2. Preserve cursor and lookup correctness with split/merge interactions
  3. Ensure root-page updates are persisted through SchemaCatalog.PersistRootPageChangesAsync

D) Pluggable Index Providers (Larger Refactor)

  1. Define an interface for index operations used by planner/catalog
  2. Provide adapter for existing BTree implementation
  3. Migrate factory points in engine/database construction

E) Concurrency Changes

  1. Be cautious around writer/checkpoint/reader interactions
  2. Any checkpoint policy changes must preserve snapshot guarantees
  3. Re-run WAL recovery and concurrent reader tests after modifications

Practical Mental Model

A useful way to reason about this storage engine:

BTree"Where is this key/value logically?"
Pager"Which page image is authoritative right now?"
WAL"What is durable and recoverable after a crash?"
SchemaCatalog"Which trees represent which database objects?"

Those four concerns are intentionally separated, and that separation is the main architectural strength of CSharpDB.Storage.