Storage Architecture Deep Dive
How CSharpDB.Storage is designed and how its components collaborate — written for mid-level C# engineers comfortable with .NET async patterns, interfaces, and clean architecture.
samples/storage-tutorials/architecture.md in the repository. See also the companion Usage & Extensibility guide.
Role of CSharpDB.Storage in the Ecosystem
CSharpDB.Storage is the durability and physical layout layer. It is responsible for:
- Persisting pages to a backing store via
IStorageDevice - Managing in-memory page caching and dirty tracking (
Pager) - Providing crash safety and commit durability through a write-ahead log (
WriteAheadLog) - Offering a row-id keyed B+tree for table and index data (
BTree,BTreeCursor,SlottedPage) - Encoding row values and schema metadata into compact byte payloads (
RecordEncoder,SchemaSerializer) - Maintaining a system catalog for tables, indexes, views, and triggers (
SchemaCatalog)
The higher layers (CSharpDB.Engine, CSharpDB.Execution, and the SQL planner/executor) treat storage as a stable persistence substrate. The engine delegates transaction lifecycle methods (BeginTransactionAsync, CommitAsync, RollbackAsync) to the Pager.
High-Level Architecture
At runtime, write/read operations flow through this stack:
┌──────────────────────────────────────────────────────────┐
│ Engine / Planner │
│ Decides what logical records to read/write │
├──────────────────────────────────────────────────────────┤
│ SchemaCatalog + BTree │
│ Resolves table/index trees, performs key-level ops │
├──────────────────────────────────────────────────────────┤
│ Pager │
│ Page access, cache, dirty bookkeeping, transactions │
├──────────────────────────────────────────────────────────┤
│ WriteAheadLog + WalIndex │
│ Durability, commit visibility, recovery, checkpointing │
├──────────────────────────────────────────────────────────┤
│ IStorageDevice │
│ Physical file reads/writes/flushes │
└──────────────────────────────────────────────────────────┘
Conceptually:
- BTree is the logical page consumer (structure + algorithms)
- Pager is the transactional page manager (state + coordination)
- WAL is the durability journal (redo log with commit markers)
- Storage device is the pluggable I/O boundary
File Format & On-Disk Layout
Database Pages
- Fixed page size: 4096 bytes (
PageConstants.PageSize) - Page 0 contains a 100-byte file header, then usable page content
- Remaining pages are fully usable for B+tree or freelist content
The file header tracks:
┌───────────────────────┬──────────────┐
│ magic / version │ bytes 0-15 │
│ page size │ bytes 16-19 │
│ total page count │ bytes 20-23 │
│ schema catalog root │ bytes 24-27 │
│ freelist head page │ bytes 28-31 │
│ change counter │ bytes 32-35 │
│ (reserved) │ bytes 36-99 │
└───────────────────────┴──────────────┘
Slotted Page Structure
Each B+tree page uses a slotted layout:
┌──────────────────────────────────────────────────────────┐
│ Page Header (type, cell count, content-start, │
│ right-child / next-leaf pointer) │
├──────────────┬───────────────────────┬───────────────────┤
│ Pointer Array│ Free Space │ Cell Content │
│ (grows →) │ │ (← grows) │
└──────────────┴───────────────────────┴───────────────────┘
SlottedPage encapsulates this structure: cell insert/delete, free-space accounting, defragmentation, and typed access to page header fields.
WAL File Layout
The WAL file (<db>.wal) contains:
- 32-byte WAL header (magic/version/page size/salts)
- Repeated frames, each containing:
- 24-byte frame header
- Full 4096-byte page image
Frames are appended for dirty pages. A transaction is committed by marking the last frame with non-zero dbPageCount and flushing the WAL stream.
Core Components & Responsibilities
I/O Abstraction: IStorageDevice
IStorageDevicedefines async primitives for random reads/writes, flushing, resizing, and disposalFileStorageDeviceis the default, usingRandomAccessAPIs and async random I/O- Read behavior zero-fills unread portions when reading beyond file end
Page Manager & Transaction Coordinator: Pager
Pager is the central coordinator with these responsibilities:
- Caches pages via
IPageCache(DictionaryPageCacheby default,LruPageCachewhen bounded, or a custom cache) - Tracks dirty page IDs within a write transaction
- Manages page allocation and freelist reuse
- Reads/writes file header state
- Owns transaction flow: begin, commit, rollback
- Delegates WAL appends/commit/rollback/checkpoint
- Supports concurrent snapshot readers and checkpoint coordination
Concurrency Model
| Mechanism | Purpose |
|---|---|
SemaphoreSlim writer lock |
Single-writer exclusion with timeout |
| Checkpoint lock | Serializes checkpoint operations |
| Active reader counter | Prevents checkpoint while snapshot readers are alive |
Immutable WalSnapshot |
Snapshot readers use frozen state + read-only pager |
Transaction Behavior
// BeginTransactionAsync
// → acquires writer lock, starts WAL transaction
// CommitAsync
// → writes all dirty pages as WAL frames
// → WAL commits (mark final frame + flush)
// → clears dirty set, releases writer lock
// RollbackAsync
// → truncates uncommitted WAL frames
// → clears cache/dirty state
// → rehydrates header state, releases lock
Checkpoint Behavior
- Triggered manually or automatically when WAL frame count threshold is reached
- Skips if active readers hold snapshots
- Applies committed WAL pages back into main DB file and resets WAL/index
Durability & Recovery: WriteAheadLog + WalIndex
WriteAheadLog
- Create/open WAL file
- Append page frames for dirty pages
- Mark commit by rewriting final frame header (
dbPageCount) + checksum + flush - Recover by scanning WAL, validating frame salts/checksums, rebuilding committed index, truncating partial/corrupt tail
- Checkpoint committed pages to DB file and reset WAL
WalIndex
- Maps
pageId → latest committed WAL offset - Tracks committed frame count
- Produces
WalSnapshotcopies for readers
WalSnapshot — Snapshot Isolation
- Readers resolve page lookups through a frozen page-to-offset map
- New commits do not affect an existing snapshot
Data Structure: BTree & BTreeCursor
BTree is a row-id keyed B+tree (long key).
| Cell Type | Contents |
|---|---|
| Leaf cell | key + payload bytes |
| Interior cell | left child + separator key |
| Interior header | right-child pointer; leaf pages chain via nextLeaf |
Supported operations: point lookup (FindAsync + cached fast path TryFindCached), insert with recursive split, delete (no underflow rebalance), full count by leaf traversal, and cursor creation.
Performance: Leaf hint cache (_hintLeafPageId, min/max key range) bypasses root-to-leaf traversal for clustered lookups. Binary search utilities (LowerBoundLeaf, UpperBoundInterior) operate on per-page cells.
BTreeCursor: Forward-only iterator starting at the leftmost leaf and advancing through the leaf-linked list. Supports SeekAsync(targetKey) then incremental MoveNextAsync.
Serialization: RecordEncoder, Varint, SchemaSerializer
Record Encoding
Rows are encoded as a column-count varint followed by repeated [typeTag][valueData] pairs. Optimized APIs avoid materializing full rows:
DecodeInto // reuse destination span
DecodeUpTo // prefix decode (first N columns)
DecodeColumn // single-column extraction
TryColumnTextEquals // UTF-8 equality without string alloc
IsColumnNull // null check without decode
TryDecodeNumericColumn // numeric extraction
Varint
Unsigned LEB128-style encoding used by record and schema serializers for compact lengths/sizes.
Schema Metadata
SchemaSerializer serializes/deserializes TableSchema, IndexSchema, and TriggerSchema. Maps names to long keys via deterministic hash functions for catalog storage.
Catalog & Metadata: SchemaCatalog
SchemaCatalog stores and caches metadata in B+trees:
- Main catalog tree stores table definitions keyed by hashed table name
- Sentinel keys point to separate subtrees for indexes, views, and triggers
- In-memory caches mirror catalog for fast lookups
Responsibilities: create/drop/update table schemas, create/drop indexes/views/triggers, provide table/index BTree instances, track root page IDs, persist root-page changes after splits (PersistRootPageChangesAsync).
Data Flow Walkthroughs
Database Open / Recovery
- Engine creates
FileStorageDevice,WalIndex,WriteAheadLog,Pager - New DB: initialize page 0 header, create/open WAL
- Existing DB: pager loads header, WAL recovery scans WAL
- If committed WAL frames exist after recovery, checkpoint runs to bring DB file current
SchemaCatalogloads metadata trees and warms in-memory schema caches
Write Transaction
Pager.BeginTransactionAsyncacquires writer lock and starts WAL transaction- Planner/catalog/BTree mutate page byte arrays
- Modified pages are marked dirty in pager
Pager.CommitAsync: updates page 0 header → appends dirty pages to WAL → WAL commits (mark final frame + flush) → WAL index visible to future readers → writer lock released- Optional auto-checkpoint may run when threshold reached
Rollback
- WAL is truncated to tx start offset
- Pager clears dirty state and cache
- Header state is reloaded from DB file (or latest committed WAL page 0 if present)
- Writer lock is released
Read Path (Normal & Snapshot)
Normal read: BTree asks pager for pages. Pager resolves from cache first, then latest WAL frame (if any), then DB file.
Snapshot read:
- Reader acquires
WalSnapshotfrom pager - Read-only snapshot pager is created
- Page resolution uses snapshot mapping only; later commits are invisible
- On disposal, reader count decrements, allowing future checkpoint
Query Scans
BTreeCursor begins at leftmost leaf and walks nextLeaf pointers. Row payloads are decoded via RecordEncoder as needed by execution operators.
Architectural Patterns
| Pattern | Where |
|---|---|
| Abstraction boundary | IStorageDevice decouples persistence medium from page/WAL logic |
| Coordinator pattern | Pager centralizes transactional lifecycle and page state |
| Redo logging | WAL stores post-image page frames and commit markers |
| Snapshot pattern | WalSnapshot provides immutable read view for concurrent readers |
| Repository-like metadata | SchemaCatalog encapsulates schema persistence and in-memory indexing |
| Cursor iteration | BTreeCursor for ordered streaming traversal |
| Format utility classes | RecordEncoder, SchemaSerializer, Varint isolate binary format concerns |
Extensibility Model
Current Extension Points
| Concern | Interface | Default |
|---|---|---|
| Storage backend | IStorageDevice |
FileStorageDevice |
| Page cache | IPageCache |
DictionaryPageCache / LruPageCache |
| Auto-checkpoint | ICheckpointPolicy |
FrameCountCheckpointPolicy |
| Lifecycle hooks | IPageOperationInterceptor |
NoOpPageOperationInterceptor |
| WAL checksum | IPageChecksumProvider |
AdditiveChecksumProvider |
| Index implementation | IIndexProvider |
BTreeIndexProvider |
| Serialization | ISerializerProvider |
DefaultSerializerProvider |
| Catalog codec | ICatalogStore |
CatalogStore |
| Composition root | IStorageEngineFactory |
DefaultStorageEngineFactory |
Future Plugin Support
- Expose WAL replacement through
StorageEngineOptionsBuilderinstead of requiring a customIStorageEngineFactory - Introduce a higher-level abstraction around pager/WAL orchestration for alternate journaling schemes
- Consider whether page layout and B+tree algorithms should ever be swappable, or remain intentionally concrete
Tradeoffs & Rationale
| Decision | Benefit | Tradeoff |
|---|---|---|
| Whole-page WAL frames | Implementation clarity, recovery simplicity | Larger write amplification for small updates |
| Single-writer lock | Predictable conflict/consistency handling | Write throughput bottleneck under heavy concurrent writes |
| Snapshot readers via copied map | Cheap creation, easy to reason about | Copy cost grows with WAL page map size |
| Delete without rebalance | Simpler implementation, fewer edge-case bugs | Potential space inefficiency over time |
| Hash-based catalog keys | Deterministic and lightweight | Theoretical collision risk |
| Static utility serializers | Low overhead, fast | Reduced runtime configurability |
Contributor Quick Guide
A) Add a New Storage Backend
- Implement
IStorageDevice - Ensure read semantics match expectations (especially zero-fill beyond EOF behavior)
- Validate with WAL + pager tests (transaction commit/rollback/recovery)
B) Add Row Encoding Optimizations
- Extend
RecordEncoderwith focused decode helpers for execution hot paths - Keep wire format backward-compatible unless versioning header strategy is introduced
- Add tests for mixed type decoding and edge cases (missing columns, null semantics)
C) Improve BTree Behavior
- Add rebalance/merge logic for delete underflow
- Preserve cursor and lookup correctness with split/merge interactions
- Ensure root-page updates are persisted through
SchemaCatalog.PersistRootPageChangesAsync
D) Pluggable Index Providers (Larger Refactor)
- Define an interface for index operations used by planner/catalog
- Provide adapter for existing
BTreeimplementation - Migrate factory points in engine/database construction
E) Concurrency Changes
- Be cautious around writer/checkpoint/reader interactions
- Any checkpoint policy changes must preserve snapshot guarantees
- Re-run WAL recovery and concurrent reader tests after modifications
Practical Mental Model
A useful way to reason about this storage engine:
BTree → "Where is this key/value logically?"
Pager → "Which page image is authoritative right now?"
WAL → "What is durable and recoverable after a crash?"
SchemaCatalog → "Which trees represent which database objects?"
Those four concerns are intentionally separated, and that separation is the main architectural strength of CSharpDB.Storage.