Storage Engine Tutorial
A hands-on walkthrough of every layer in the CSharpDB storage stack — from raw file I/O up to a complete folder/file storage system.
CSharpDB.Storage NuGet package (and CSharpDB.Engine for the higher-level sections).
Architecture Overview
The storage engine is organized in layers. Each layer depends only on the one below it, making individual components testable and replaceable.
Additional read-path optimizations layered on top:
- Memory-mapped reads for clean main-file pages when the storage device supports it
- Speculative B+tree leaf read-ahead during sequential forward scans
- Checkpoint residency preservation so already-owned pages stay hot in lazy-resident hybrid mode
Page Layout
Page 0 (File Header):
[Magic: 4 bytes "CSDB"]
[FormatVersion: 4 bytes]
[PageSize: 4 bytes = 4096]
[PageCount: 4 bytes]
[SchemaRootPage: 4 bytes]
[FreelistHead: 4 bytes]
[ChangeCounter: 4 bytes]
[... reserved to 100 bytes ...]
[Slotted page content: 3996 bytes]
Pages 1+:
[SlottedPage: 4096 bytes]
[Header: 9 bytes]
PageType (1) ─ CellCount (2) ─ CellContentStart (2) ─ RightChild/NextLeaf (4)
[CellPointers: 2 bytes each]
[Free space]
[Cells: growing backward from page end]
FileStorageDevice
FileStorageDevice wraps a SafeFileHandle opened with FileOptions.Asynchronous | FileOptions.RandomAccess, providing:
- True async I/O via
RandomAccess.ReadAsync/WriteAsync - Position-independent reads and writes — no seek, no shared file pointer
- Concurrent reads from other processes (
FileShare.Read) - Direct fsync to durable storage via
RandomAccess.FlushToDisk
public FileStorageDevice(string filePath, bool createNew = false)
| Parameter | Description |
|---|---|
filePath | Path to the database file. |
createNew | true → FileMode.CreateNew (fails if file exists). false → OpenOrCreate. |
IStorageDevice Interface
All storage operations go through IStorageDevice, making it easy to swap implementations for in-memory testing.
public interface IStorageDevice : IAsyncDisposable, IDisposable
{
long Length { get; }
ValueTask<int> ReadAsync(long offset, Memory<byte> buffer, CancellationToken ct = default);
ValueTask WriteAsync(long offset, ReadOnlyMemory<byte> buffer, CancellationToken ct = default);
ValueTask FlushAsync(CancellationToken ct = default);
ValueTask SetLengthAsync(long length, CancellationToken ct = default);
}
Device Scenarios
1. Create a New File
await using var device = new FileStorageDevice("mydb.cdb", createNew: true);
Console.WriteLine($"File created. Length: {device.Length}"); // 0
2. Open an Existing File
await using var device = new FileStorageDevice("mydb.cdb");
Console.WriteLine($"Opened. Length: {device.Length}");
3. Write Raw Bytes
Writes are position-independent. Multiple writes at different offsets can be issued concurrently without locking.
await using var device = new FileStorageDevice("mydb.cdb");
byte[] payload = "Hello, CSharpDB!"u8.ToArray();
await device.WriteAsync(offset: 0, payload);
4. Read Raw Bytes
ReadAsync loops internally until the buffer is fully filled or EOF is reached.
await using var device = new FileStorageDevice("mydb.cdb");
var buffer = new byte[16];
int bytesRead = await device.ReadAsync(offset: 0, buffer);
Console.WriteLine($"Read {bytesRead} byte(s)");
5. Zero-Fill on Short Reads
Reading past EOF zero-fills the remainder of the buffer — useful for treating uninitialized pages as zeroed memory.
await using var device = new FileStorageDevice("mydb.cdb");
var buffer = new byte[4096];
int bytesRead = await device.ReadAsync(offset: 0, buffer);
Console.WriteLine($"Bytes on disk: {bytesRead}");
Console.WriteLine($"Remainder is zeros: {buffer[16] == 0}");
6. Pre-allocate / Extend File
Pre-allocating avoids fragmentation and is required before writing pages beyond the current end-of-file on some file systems.
await using var device = new FileStorageDevice("mydb.cdb", createNew: true);
const int PageSize = 4096;
await device.SetLengthAsync(PageSize * 8);
Console.WriteLine($"Pre-allocated: {device.Length} bytes"); // 32768
7. Flush to Disk (fsync)
FlushAsync calls RandomAccess.FlushToDisk, issuing a full fsync. Call this after committing a transaction to guarantee durability.
await using var device = new FileStorageDevice("mydb.cdb");
byte[] data = new byte[4096];
await device.WriteAsync(offset: 0, data);
await device.FlushAsync(); // durable on disk after this returns
8. Writing Fixed-Size Pages (4 KB)
The storage engine works in 4 096-byte pages. Write a page at a computed offset.
await using var device = new FileStorageDevice("mydb.cdb", createNew: true);
const int PageSize = 4096;
await device.SetLengthAsync(PageSize * 4);
byte[] page = new byte[PageSize];
System.Text.Encoding.UTF8.GetBytes("page-0 data").CopyTo(page.AsSpan());
uint pageId = 0;
await device.WriteAsync(offset: (long)pageId * PageSize, page);
9. Cancellation Support
All async methods accept a CancellationToken for clean abort of long I/O operations.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
await using var device = new FileStorageDevice("mydb.cdb");
var buffer = new byte[4096];
try
{
await device.ReadAsync(offset: 0, buffer, cts.Token);
}
catch (OperationCanceledException)
{
Console.WriteLine("Read was cancelled.");
}
10. Testability via IStorageDevice
Program against IStorageDevice so you can substitute an in-memory implementation in tests.
public sealed class MemoryStorageDevice : IStorageDevice
{
private byte[] _data = [];
public long Length => _data.Length;
public ValueTask<int> ReadAsync(long offset, Memory<byte> buffer, CancellationToken ct = default)
{
int available = (int)Math.Max(0, _data.Length - offset);
int toCopy = Math.Min(buffer.Length, available);
_data.AsMemory((int)offset, toCopy).CopyTo(buffer);
buffer[toCopy..].Span.Clear();
return ValueTask.FromResult(toCopy);
}
public ValueTask WriteAsync(long offset, ReadOnlyMemory<byte> buffer, CancellationToken ct = default)
{
long needed = offset + buffer.Length;
if (needed > _data.Length) Array.Resize(ref _data, (int)needed);
buffer.CopyTo(_data.AsMemory((int)offset));
return ValueTask.CompletedTask;
}
public ValueTask FlushAsync(CancellationToken ct = default) => ValueTask.CompletedTask;
public ValueTask SetLengthAsync(long length, CancellationToken ct = default)
{
Array.Resize(ref _data, (int)length);
return ValueTask.CompletedTask;
}
public ValueTask DisposeAsync() => ValueTask.CompletedTask;
public void Dispose() { }
}
Pager
The Pager sits between the B+tree layer and the storage device. It owns the page cache, tracks dirty pages, coordinates transactions, manages WAL integration, and drives checkpointing.
P1. Create a New Database
await using var device = new FileStorageDevice("mydb.cdb", createNew: true);
var walIndex = new WalIndex();
await using var wal = new WriteAheadLog("mydb.cdb", walIndex);
await wal.OpenAsync(currentDbPageCount: 0);
var pager = await Pager.CreateAsync(device, wal, walIndex);
await pager.InitializeNewDatabaseAsync(); // writes file header (page 0)
Console.WriteLine($"Pages: {pager.PageCount}"); // 1
P2. Open and Recover
On startup, call RecoverAsync to redo any committed WAL frames that were not yet checkpointed.
await using var device = new FileStorageDevice("mydb.cdb");
var walIndex = new WalIndex();
await using var wal = new WriteAheadLog("mydb.cdb", walIndex);
var pager = await Pager.CreateAsync(device, wal, walIndex);
await pager.RecoverAsync(); // replays committed WAL frames
P3. Read and Write Pages
// Read a page (checks cache -> WAL -> device)
byte[] page = await pager.GetPageAsync(pageId: 1);
// Modify in-place, then mark dirty
page[0] = 0xFF;
await pager.MarkDirtyAsync(pageId: 1); // tracked for WAL write on commit
P4. Allocate and Free Pages
// Allocate (extends file or reuses from freelist)
uint newPageId = await pager.AllocatePageAsync();
// Free (adds to freelist for reuse)
await pager.FreePageAsync(newPageId);
P5. Transaction Lifecycle
Single writer per database. Reads do not require transactions.
await pager.BeginTransactionAsync();
try
{
// ... modify pages via B+tree ...
await pager.CommitAsync(); // writes dirty pages to WAL, fsync
}
catch
{
await pager.RollbackAsync(); // discards uncommitted WAL frames
throw;
}
P6. Snapshot Isolation
Multiple readers run concurrently with the writer. Each reader sees a consistent point-in-time snapshot.
WalSnapshot snapshot = pager.AcquireReaderSnapshot();
Pager snapshotPager = pager.CreateSnapshotReader(snapshot);
byte[] page = await snapshotPager.GetPageAsync(pageId: 1);
pager.ReleaseReaderSnapshot();
P7. Checkpoint Policies
var options = new PagerOptions
{
CheckpointPolicy = new AnyCheckpointPolicy(
new FrameCountCheckpointPolicy(threshold: 500),
new TimeIntervalCheckpointPolicy(TimeSpan.FromMinutes(5))
),
AutoCheckpointExecutionMode = AutoCheckpointExecutionMode.Background,
AutoCheckpointMaxPagesPerStep = 64
};
| Policy | Triggers When |
|---|---|
FrameCountCheckpointPolicy(n) | Committed frame count exceeds n |
WalSizeCheckpointPolicy(bytes) | Estimated WAL size exceeds bytes |
TimeIntervalCheckpointPolicy(span) | Elapsed time since last checkpoint exceeds span |
AnyCheckpointPolicy(...) | Any sub-policy triggers |
B+Tree
B+tree keyed by signed 64-bit long keys. Leaf pages store (key, payload) pairs; interior pages store routing keys and child pointers. Supports forward-only cursor iteration and cache-only fast paths.
B1. Create a New B+Tree
uint rootPageId = await BTree.CreateNewAsync(pager);
var tree = new BTree(pager, rootPageId);
B2. Insert a Key-Value Pair
Payload is raw bytes — the B+tree has no opinion on format. If the leaf is full, the tree automatically splits.
byte[] payload = System.Text.Encoding.UTF8.GetBytes("Hello, B+tree!");
await tree.InsertAsync(key: 42, payload);
B3. Point Lookup
byte[]? result = await tree.FindAsync(key: 42);
if (result is not null)
Console.WriteLine(System.Text.Encoding.UTF8.GetString(result));
B4. Cache-Only Fast Path
Avoids async I/O when all required pages are already cached.
if (tree.TryFindCached(key: 42, out byte[]? payload))
Console.WriteLine($"Cache hit: {payload is not null}");
else
payload = await tree.FindAsync(key: 42);
B5. Delete a Key
bool deleted = await tree.DeleteAsync(key: 42);
Console.WriteLine(deleted ? "Deleted." : "Key not found.");
B6. Forward Cursor Scan
Iterate all entries in key order. The cursor follows leaf-to-leaf next pointers without interior page I/O.
var cursor = tree.CreateCursor();
while (await cursor.MoveNextAsync())
{
long key = cursor.CurrentKey;
ReadOnlyMemory<byte> value = cursor.CurrentValue;
Console.WriteLine($"Key={key}, PayloadSize={value.Length}");
}
B7. Seek to a Key
Position the cursor at the first key >= target, then iterate forward.
var cursor = tree.CreateCursor();
if (await cursor.SeekAsync(targetKey: 100))
{
do
{
Console.WriteLine($"Key={cursor.CurrentKey}");
} while (await cursor.MoveNextAsync());
}
B8. Count Entries
long count = await tree.CountEntriesAsync();
Console.WriteLine($"Tree contains {count} entries.");
Write-Ahead Log (WAL)
Redo-style WAL for crash recovery and concurrent snapshot-isolated readers. Each commit writes dirty pages as frames to the WAL file. On checkpoint, committed frames are copied to the main database file.
WAL File Format:
[WAL Header: 32 bytes]
Magic ─ Version ─ PageSize ─ Checksum salt
[Frame 0: 4120 bytes]
[FrameHeader: 24 bytes] ─ PageId ─ DbPageCount ─ Checksum
[PageData: 4096 bytes]
[Frame 1: 4120 bytes]
...
W1. Open or Create a WAL
var walIndex = new WalIndex();
await using var wal = new WriteAheadLog("mydb.cdb", walIndex);
await wal.OpenAsync(currentDbPageCount: pager.PageCount);
W2. Write Transaction to WAL
wal.BeginTransaction();
await wal.AppendFrameAsync(pageId: 1, pageData);
await wal.AppendFrameAsync(pageId: 5, pageData);
await wal.CommitAsync(newDbPageCount: pager.PageCount);
// Or rollback: await wal.RollbackAsync();
W3. Reader Snapshot
WalSnapshot snapshot = walIndex.TakeSnapshot();
if (snapshot.TryGet(pageId: 1, out long walOffset))
{
byte[] page = await wal.ReadPageAsync(walOffset);
}
W4. Checkpoint
await wal.CheckpointAsync(device, pageCount: pager.PageCount);
walIndex.Reset();
// WAL is now empty; all data is in the main file
Slotted Page Layout
SlottedPage is a struct that overlays a byte[4096] buffer, providing structured access to variable-size cells within a fixed-size page.
[Header: 9 bytes]
PageType (1) ─ CellCount (2) ─ CellContentStart (2) ─ RightChild/NextLeaf (4)
[Cell Pointers: 2 bytes each, growing forward]
[Free Space]
[Cell Data: growing backward from page end]
S1. Initialize a Page
byte[] buffer = new byte[4096];
var sp = new SlottedPage(buffer, pageId: 1);
sp.Initialize(PageConstants.PageTypeLeaf);
Console.WriteLine($"Type: {sp.PageType}"); // Leaf
Console.WriteLine($"Cells: {sp.CellCount}"); // 0
Console.WriteLine($"Free: {sp.FreeSpace} bytes"); // ~4085
S2. Insert and Read Cells
byte[] cellData = new byte[] { 0x01, 0x02, 0x03, 0x04 };
bool inserted = sp.InsertCell(index: 0, cellData);
Span<byte> cell = sp.GetCell(index: 0);
Console.WriteLine($"Cell[0] length: {cell.Length}"); // 4
S3. Delete and Defragment
sp.DeleteCell(index: 0);
Console.WriteLine($"Cells after delete: {sp.CellCount}"); // 0
// After many inserts/deletes, free space may be fragmented
sp.Defragment(); // rewrites cells contiguously at end of page
Indexing
Secondary B+tree-backed indexes with optional caching and ordered range scan support.
public interface IIndexStore
{
uint RootPageId { get; }
ValueTask<byte[]?> FindAsync(long key, CancellationToken ct = default);
ValueTask InsertAsync(long key, ReadOnlyMemory<byte> payload, CancellationToken ct = default);
ValueTask<bool> DeleteAsync(long key, CancellationToken ct = default);
IIndexCursor CreateCursor(IndexScanRange range);
}
I1. Create an Index Store
uint indexRootPage = await BTree.CreateNewAsync(pager);
var indexTree = new BTree(pager, indexRootPage);
IIndexStore index = new BTreeIndexStore(indexTree);
I2. Insert and Lookup
// Insert: key = hashed column value, payload = rowid
byte[] rowIdPayload = BitConverter.GetBytes(42L);
await index.InsertAsync(key: hashOfColumnValue, rowIdPayload);
// Lookup
byte[]? result = await index.FindAsync(key: hashOfColumnValue);
if (result is not null)
{
long rowId = BitConverter.ToInt64(result);
Console.WriteLine($"Found rowid: {rowId}");
}
I3. Range Scan with Cursor
var range = new IndexScanRange(
LowerBound: 100, LowerInclusive: true,
UpperBound: 200, UpperInclusive: false);
var cursor = index.CreateCursor(range);
while (await cursor.MoveNextAsync())
Console.WriteLine($"IndexKey={cursor.CurrentKey}");
// Full scan / point lookup:
var fullCursor = index.CreateCursor(IndexScanRange.All);
var pointCursor = index.CreateCursor(IndexScanRange.At(42));
I4. Add Caching to an Index
IIndexStore cached = new CachingIndexStore(
inner: new BTreeIndexStore(indexTree),
capacity: 2048);
byte[]? result = await cached.FindAsync(key: 42);
Record Serialization
Compact binary encoding for database rows. Supports selective column projection and fast filter evaluation without materializing managed strings.
Binary Format:
[columnCount: varint]
[col0_typeTag: 1 byte] [col0_data: ...]
[col1_typeTag: 1 byte] [col1_data: ...]
Type Tags:
Null (0x00) -> no data
Integer (0x01) -> 8 bytes, little-endian long
Text (0x02) -> [length: varint] [UTF-8 bytes]
Real (0x03) -> 8 bytes, little-endian double (IEEE 754)
Blob (0x04) -> [length: varint] [raw bytes]
R1. Encode and Decode a Row
var values = new DbValue[]
{
DbValue.FromInteger(1),
DbValue.FromText("Alice"),
DbValue.FromInteger(30)
};
byte[] encoded = RecordEncoder.Encode(values);
DbValue[] decoded = RecordEncoder.Decode(encoded);
Console.WriteLine($"Id={decoded[0].AsInteger}, Name={decoded[1].AsText}, Age={decoded[2].AsInteger}");
R2. Selective Column Projection
Decode only the columns you need — avoids materializing unused fields.
// Decode only columns 0 and 1 (skip column 2)
DbValue[] partial = RecordEncoder.DecodeUpTo(encoded, maxColumnIndexInclusive: 1);
// Decode a single column by index
DbValue age = RecordEncoder.DecodeColumn(encoded, columnIndex: 2);
R3. Fast Filter Without Materialization
// Check if column 1 equals "Alice" without creating a string
byte[] expectedUtf8 = "Alice"u8.ToArray();
if (RecordEncoder.TryColumnTextEquals(encoded, columnIndex: 1, expectedUtf8, out bool equals))
Console.WriteLine($"Column 1 is Alice: {equals}");
// Check numeric column
if (RecordEncoder.TryDecodeNumericColumn(encoded, columnIndex: 2,
out long intValue, out double realValue, out bool isReal))
Console.WriteLine($"Age: {intValue}");
// Check for null
bool isNull = RecordEncoder.IsColumnNull(encoded, columnIndex: 0);
R4. Varint Encoding
Span<byte> buffer = stackalloc byte[10];
int bytesWritten = Varint.Write(buffer, 300UL);
ulong value = Varint.Read(buffer, out int bytesRead);
Console.WriteLine($"Value: {value}, Bytes: {bytesRead}"); // 300, 2
Schema Catalog
B+tree-backed metadata store for tables, indexes, views, and triggers with in-memory caching and schema version tracking.
C1. Initialize the Catalog
var catalog = await SchemaCatalog.CreateAsync(pager);
Console.WriteLine($"Schema version: {catalog.SchemaVersion}");
C2. Create and Query Tables
var schema = new TableSchema
{
TableName = "users",
Columns = new[]
{
new ColumnDefinition { Name = "id", Type = DbType.Integer, IsPrimaryKey = true },
new ColumnDefinition { Name = "name", Type = DbType.Text },
new ColumnDefinition { Name = "age", Type = DbType.Integer },
}
};
await catalog.CreateTableAsync(schema);
TableSchema? users = catalog.GetTable("users");
uint rootPage = catalog.GetTableRootPage("users");
BTree tableTree = catalog.GetTableTree("users");
C3. Create and Query Indexes
var indexSchema = new IndexSchema
{
IndexName = "idx_users_name",
TableName = "users",
Columns = new[] { "name" },
IsUnique = false,
};
await catalog.CreateIndexAsync(indexSchema);
IIndexStore indexStore = catalog.GetIndexStore("idx_users_name");
IReadOnlyList<IndexSchema> indexes = catalog.GetIndexesForTable("users");
C4. Views and Triggers
// Views
await catalog.CreateViewAsync("active_users", "SELECT * FROM users WHERE age > 18");
string? viewSql = catalog.GetViewSql("active_users");
// Triggers
var trigger = new TriggerSchema
{
TriggerName = "trg_users_audit",
TableName = "users",
Event = TriggerEvent.AfterInsert,
Body = "INSERT INTO audit_log (table_name, action) VALUES ('users', 'INSERT')",
};
await catalog.CreateTriggerAsync(trigger);
Folder & File Storage
Build a complete folder/file storage system on top of the storage engine using the higher-level Database + Collection<T> API from CSharpDB.Engine.
Domain Models
public record FolderEntry(
string Name,
string Path,
DateTime CreatedAt,
string? Description = null);
public record FileEntry(
string Name,
string FolderPath,
string Content,
string ContentType,
long SizeBytes,
DateTime CreatedAt,
DateTime UpdatedAt);
F1. Bootstrap the Storage
await using var db = await Database.OpenAsync("storage.cdb");
var folders = await db.GetCollectionAsync<FolderEntry>("folders");
var files = await db.GetCollectionAsync<FileEntry>("files");
For long-lived processes, use hybrid lazy-resident mode:
await using var db = await Database.OpenHybridAsync(
"storage.cdb",
new DatabaseOptions(),
new HybridDatabaseOptions
{
PersistenceMode = HybridPersistenceMode.IncrementalDurable,
HotCollectionNames = ["folders", "files"]
});
F2. Create a Folder
async Task CreateFolderAsync(string path, string? description = null)
{
var entry = new FolderEntry(
Name: Path.GetFileName(path.TrimEnd('/')),
Path: path,
CreatedAt: DateTime.UtcNow,
Description: description);
await folders.PutAsync(path, entry);
}
await CreateFolderAsync("/documents");
await CreateFolderAsync("/documents/reports", description: "Monthly reports");
F3. Create a File
async Task CreateFileAsync(string folderPath, string fileName,
string content, string contentType = "text/plain")
{
string key = $"{folderPath}/{fileName}";
var entry = new FileEntry(
Name: fileName, FolderPath: folderPath, Content: content,
ContentType: contentType,
SizeBytes: System.Text.Encoding.UTF8.GetByteCount(content),
CreatedAt: DateTime.UtcNow, UpdatedAt: DateTime.UtcNow);
await files.PutAsync(key, entry);
}
await CreateFileAsync("/documents/reports", "q1.txt", "Q1 earnings: $1.2M");
F4. Read a File
FileEntry? file = await files.GetAsync("/documents/reports/q1.txt");
if (file is not null)
Console.WriteLine($"{file.Name}: {file.Content}");
F5. List Files in a Folder
await foreach (var kvp in files.FindAsync(f => f.FolderPath == "/documents/reports"))
Console.WriteLine($" {kvp.Value.Name} ({kvp.Value.SizeBytes} bytes)");
F6. Update File Content
PutAsync is an upsert — it replaces the document at the key if it already exists.
FileEntry? existing = await files.GetAsync("/documents/reports/q1.txt");
if (existing is not null)
{
var updated = existing with
{
Content = "Q1 earnings: $1.4M (revised)",
SizeBytes = System.Text.Encoding.UTF8.GetByteCount("Q1 earnings: $1.4M (revised)"),
UpdatedAt = DateTime.UtcNow
};
await files.PutAsync("/documents/reports/q1.txt", updated);
}
F7. Delete a Folder and Contents
var toDelete = new List<string>();
await foreach (var kvp in files.FindAsync(f =>
f.FolderPath.StartsWith("/documents/reports", StringComparison.Ordinal)))
toDelete.Add(kvp.Key);
await db.BeginTransactionAsync();
try
{
foreach (var key in toDelete) await files.DeleteAsync(key);
await folders.DeleteAsync("/documents/reports");
await db.CommitAsync();
}
catch { await db.RollbackAsync(); throw; }
F8. SQL-Based Approach
If you prefer a relational model, create tables with SQL and use ExecuteAsync.
await using var db = await Database.OpenAsync("storage.cdb");
await db.ExecuteAsync(@"
CREATE TABLE IF NOT EXISTS folders (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL,
name TEXT NOT NULL,
created_at TEXT NOT NULL
)");
await db.ExecuteAsync(@"
CREATE TABLE IF NOT EXISTS files (
id INTEGER PRIMARY KEY,
folder_path TEXT NOT NULL,
name TEXT NOT NULL,
content TEXT NOT NULL,
content_type TEXT NOT NULL,
size_bytes INTEGER NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
)");
var result = await db.ExecuteAsync(
"SELECT name, size_bytes FROM files WHERE folder_path = '/documents'");
foreach (var row in result.Rows)
Console.WriteLine($"{row[0]} ({row[1]} bytes)");
Key Design Notes
| Concern | Detail |
|---|---|
| No shared file pointer | RandomAccess APIs are stateless w.r.t. position, so concurrent reads at different offsets are safe without locking. |
| Async-first | All I/O is issued via RandomAccess.ReadAsync / WriteAsync, keeping the storage graph on the platform async file-I/O path. |
| Zero-fill on short reads | Pages beyond EOF are returned as zeros, matching the uninitialized page convention used by the Pager. |
| fsync on flush | FlushAsync maps to FlushFileBuffers (Windows) or fsync (Linux/macOS), guaranteeing crash durability. |
| 4 KB page size | All pages are PageConstants.PageSize (4096 bytes). Page 0 reserves 100 bytes for the file header. |
| Single writer, multiple readers | TransactionCoordinator enforces a single writer via SemaphoreSlim. Readers use WAL snapshots for isolation. |
| Memory-mapped reads | Optional via PagerOptions.UseMemoryMappedReads for clean main-file pages when the storage device supports it. |
| Sequential scan read-ahead | Speculative next-leaf prefetch during forward scans via EnableSequentialLeafReadAhead. |
| Checkpoint residency | With PreserveOwnedPagesOnCheckpoint, owned pages stay resident across checkpoint for the hybrid engine. |
| B+tree leaf linking | Leaf pages are linked via RightChildOrNextLeaf pointers for efficient cursor scans without interior I/O. |
| Pluggable checkpoint policies | ICheckpointPolicy allows frame-count, WAL-size, time-interval, or custom composite triggers. |
| Schema versioning | SchemaCatalog.SchemaVersion increments on every DDL operation, enabling cache invalidation in upper layers. |