Benchmarks
Real performance numbers from BenchmarkDotNet running against the CSharpDB engine. All results are from the CSharpDB.Benchmarks project.
At a Glance
Point Lookups
Single-row lookups by primary key on a warm 10,000-row dataset. This is the most common database operation and the one where CSharpDB's B+tree leaf-hint cache shines.
Why are file-backed and in-memory reads identical?
On warm lookups, all pages are already cached in the IPageCache. The Pager resolves from cache without touching the storage device or WAL, so the I/O backend makes no difference. The B+tree leaf-hint cache often skips root-to-leaf traversal entirely for clustered access patterns, which is why lookups stay under 500 ns even at 10K rows.
The Collection API is ~2x faster than SQL because it bypasses SQL parsing, planning, and row decoding — it goes straight from key to B+tree to JSON deserialization.
Allocations: SQL lookup allocates 1.05 KB per call (QueryResult + DbValue array). Collection get allocates just 208 bytes (deserialized object only).
Cold Lookups (Under Cache Pressure)
What happens when your dataset does not fit in cache? This benchmark seeds 200,000 rows but limits the page cache to just 16 pages, forcing constant eviction and re-reads. Each iteration performs 256 randomized lookups.
Why the 15–18x difference?
With only 16 cached pages and 200K rows, almost every lookup is a cache miss. File-backed lookups must read 4 KB pages from the FileStorageDevice via RandomAccess — each page read involves a kernel syscall and disk I/O. In-memory lookups hit the MemoryStorageDevice which is a simple array copy, staying in user-space memory.
This benchmark uses PagerOptions.MaxCachedPages = 16 to simulate cache pressure. In production, set MaxCachedPages large enough to hold your hot working set and you will see warm-cache performance (sub-500 ns) for most lookups.
Allocations: Cold file-backed lookups allocate ~9.5 KB (page buffers for cache misses + result). In-memory cold lookups allocate ~8 KB (no page buffer copies needed).
Single Inserts
Single-row INSERT with auto-commit on a 10,000-row table. Each insert is a complete transaction: parse → plan → B+tree insert → WAL append → commit → flush.
~16,300 ops/s
~349,000 ops/s
~17,900 ops/s
~481,000 ops/s
Why is file-backed 21–27x slower for writes?
Every auto-committed write must durably flush the WAL to disk before returning. This means at minimum one fsync per insert — the dominant cost is waiting for the NVMe write to complete (~50–60 μs on this hardware). In-memory mode skips the WAL entirely, so inserts are just B+tree mutations in RAM.
Optimization tip: If you need file-backed write throughput, batch multiple inserts into a single explicit transaction. The WAL flush cost is amortized across all rows in the batch (see Batch Inserts below).
Allocations: SQL insert allocates 2.63 KB (file) / 1.64 KB (in-memory). Collection put allocates 2.10 KB (file) / 1.14 KB (in-memory). The difference is WAL frame buffers.
Batch Inserts
Inserting 100 rows per transaction, repeated 256 times (25,600 total rows). Batching amortizes the WAL flush across many rows, dramatically improving file-backed throughput.
~223,000 rows/s
~584,000 rows/s
The batching effect
Compare single inserts (16,300 ops/s file-backed) to batched inserts (223,000 rows/s file-backed) — that is a 13.7x throughput improvement just from batching 100 rows per transaction. The cost of one WAL flush (~60 μs) is now shared across 100 rows instead of paid per row.
In-memory mode sees a smaller improvement (349K → 584K) because the baseline was already fast. The remaining overhead per row is SQL parsing and B+tree page splits. Use the PrepareInsertBatch API to skip repeated parsing for maximum throughput.
Allocations: ~10.9 MB total for 25,600 rows (~430 bytes per row including SQL strings, parsed AST, and B+tree operations).
File-Backed vs In-Memory: When Does It Matter?
The storage mode only affects performance when I/O is involved. Here is the complete picture:
| Operation | File-Backed | In-Memory | Ratio | Why |
|---|---|---|---|---|
| SQL point lookup (warm) | 460.7 ns | 460.0 ns | 1.0x | Both served from page cache |
| Collection get (warm) | 220.1 ns | 216.8 ns | 1.0x | Both served from page cache |
| SQL point lookup (cold) | 32.4 μs | 2.20 μs | 14.7x | File I/O on cache miss |
| Collection get (cold) | 32.9 μs | 1.86 μs | 17.7x | File I/O on cache miss |
| SQL single insert | 61.2 μs | 2.86 μs | 21.4x | WAL flush + fsync |
| Collection single put | 56.0 μs | 2.08 μs | 26.9x | WAL flush + fsync |
| Batch insert (100 rows/tx) | 223K rows/s | 584K rows/s | 2.6x | WAL flush amortized |
The takeaway
Reads are free — once pages are cached, file-backed and in-memory are identical. The page cache is the performance equalizer.
Writes pay for durability — file-backed writes are 21–27x slower because each commit requires flushing the WAL to disk. Batching reduces the gap to 2.6x. If you do not need crash recovery, use in-memory mode for write-heavy workloads.
Cache sizing matters — if your hot working set fits in MaxCachedPages, even file-backed reads stay at ~460 ns. Cold reads are 15–18x slower because each cache miss is a kernel I/O call.
Scan & Query Patterns
The ScanBenchmarks suite tests 10 query patterns at 1K, 10K, and 100K row scales. These measure the full SQL pipeline: parse → plan → execute → iterate.
| Pattern | What It Tests | Key Insight |
|---|---|---|
SELECT * (full scan) |
BTreeCursor leaf-walk + full RecordEncoder decode | Baseline throughput — measures raw sequential read speed |
WHERE value < X (20% selectivity) |
Full scan + row-level predicate filter | Filter overhead is minimal — most cost is page I/O, not evaluation |
WHERE value < X (1% selectivity) |
Full scan with high rejection rate | Still scans all pages; index scan would be faster at 1% |
ORDER BY value |
Full materialization + in-memory sort | Sort cost grows with row count; dominates at 100K rows |
ORDER BY (value + id) |
Expression evaluation per row during sort | Expression overhead is small vs sort/materialization |
ORDER BY value LIMIT 100 |
Top-N optimization with heap-based partial sort | Much faster than full sort — only materializes top 100 |
SELECT * LIMIT 100 |
Early termination after 100 rows | Constant time regardless of table size |
COUNT(*) |
Aggregate scan without row materialization | Faster than SELECT * — skips RecordEncoder decode |
GROUP BY category |
Hash-based grouping + COUNT + AVG aggregates | Aggregate cost grows linearly with input rows |
What to expect at scale
All scan benchmarks run at three dataset sizes (1K, 10K, 100K rows) to show how performance scales. Full scans are O(n) — they touch every leaf page sequentially. LIMIT queries are effectively O(1) because they stop early. ORDER BY + LIMIT is O(n) but with a constant factor much smaller than full sort because only the top-N rows are retained in memory.
The PointLookupBenchmarks suite additionally tests indexed vs non-indexed column lookups, projected columns, and miss-path performance at the same three scales.
Collection API Performance
The CollectionBenchmark macro suite runs 9 workloads for 15 seconds each, measuring sustained throughput and P50/P99 latencies.
| Workload | Dataset | What It Measures |
|---|---|---|
| Single Put (15s) | Growing | Sustained single-document insert throughput with auto-commit |
| Batch Put ×100 (15s) | Growing | Batched puts in explicit transactions — amortized commit cost |
| Point Get (15s) | 10K docs | Random key lookups on pre-seeded collection |
| 80/20 Mixed (15s) | 10K docs | Realistic read-heavy workload with concurrent writes |
| Full Scan (15s) | 1K docs | ScanAsync cursor iteration — measures deserialization throughput |
| Find ~20% Match (15s) | 1K docs | Filtered scan with FindByPathAsync — ~20% selectivity |
| Indexed Find (15s) | 10K docs | FindByIndexAsync on secondary index — O(log n) lookup |
| Put with Index (15s) | Growing | Insert cost with secondary index maintenance overhead |
| SQL vs Collection (10s) | 10K | Head-to-head: SELECT WHERE id = ? vs GetAsync(key) |
SQL vs Collection: which is faster?
For point operations, the Collection API is consistently ~2x faster than SQL (217 ns vs 461 ns for reads, 2.1 μs vs 2.9 μs for writes). The Collection API bypasses SQL parsing, AST construction, query planning, and row-level RecordEncoder decoding. It goes directly from key → B+tree lookup → JSON deserialization.
Use SQL when you need joins, aggregates, or complex predicates. Use the Collection API when you know the key and want maximum single-document throughput.
Mixed Workloads
The MixedWorkloadBenchmark and InMemoryWorkloadBenchmark suites simulate realistic application patterns: 80% reads / 20% writes running for 10–15 seconds on 10,000 pre-seeded rows.
| Workload | Mode | Reports |
|---|---|---|
| SQL 80/20 mix | File-backed | ops/sec, P50/P99 latency (reads and writes separately) |
| SQL 80/20 mix | In-memory | ops/sec, P50/P99 latency (reads and writes separately) |
| Collection 80/20 mix | File-backed | ops/sec, P50/P99 latency (reads and writes separately) |
| Collection 80/20 mix | In-memory | ops/sec, P50/P99 latency (reads and writes separately) |
What to expect in production
In an 80/20 mix, read latency stays near warm-cache levels (~460 ns SQL, ~220 ns Collection) because the hot working set fits in cache. Write latency is dominated by WAL flush in file-backed mode (~60 μs) or is near-instant in-memory mode (~2–3 μs).
The single-writer lock means writes are serialized, but reads proceed concurrently via snapshot isolation. Read throughput scales linearly with reader count (verified by the ReaderScalingBenchmark), while write throughput remains constant.
Methodology & Environment
| Detail | Value |
|---|---|
| Framework | BenchmarkDotNet v0.15.8 |
| Runtime | .NET 10.0.5 (SDK 10.0.201) |
| OS | Windows 11 (10.0.26300.8068) |
| CPU | 11th Gen Intel Core i9-11900K 3.50 GHz (8C/16T) |
| Storage | NVMe SSD |
| RAM | 32 GB |
Benchmark Categories
| Category | Suite | Method |
|---|---|---|
| Micro | BenchmarkDotNet [Benchmark] |
Precise mean/error/stddev/allocated per operation. Uses warmup, pilot, and statistical analysis. |
| Macro | Custom MacroBenchmarkRunner |
Duration-based (10–15s). Reports total ops, ops/sec, P50/P99 latency histograms. |
| Scaling | Custom | Tests behavior at increasing dataset sizes (BTree depth, row count scaling). |
| Stress | Custom | Crash recovery correctness, WAL growth under sustained writes. |
Running the Benchmarks
# Run all micro benchmarks (BenchmarkDotNet)
cd tests/CSharpDB.Benchmarks
dotnet run -c Release -- --filter "*"
# Run a specific benchmark suite
dotnet run -c Release -- --filter "*ColdLookupBenchmarks*"
dotnet run -c Release -- --filter "*InMemorySqlBenchmarks*"
# Run macro benchmarks (duration-based)
dotnet run -c Release -- --filter "*MixedWorkloadBenchmark*"
dotnet run -c Release -- --filter "*CollectionBenchmark*"
Results are written to BenchmarkDotNet.Artifacts/results/ in CSV, GitHub Markdown, and HTML formats. The full benchmark source is at tests/CSharpDB.Benchmarks.