Collation Guide

How text comparison and sorting work in CSharpDB today, when to use BINARY, NOCASE, NOCASE_AI, or ICU:<locale>, and how to apply collation in SQL and collection indexes.

Available now: CSharpDB supports opt-in BINARY, NOCASE, NOCASE_AI, and ICU:<locale> collation across SQL schema definitions, query expressions, metadata surfaces, ordered SQL text indexes, and collection path indexes.

Overview

CSharpDB already stores text as UTF-8, so data from any language can be stored and retrieved correctly. Collation is about something different: how text is compared, filtered, and sorted.

If you only need multilingual storage, you do not need to do anything special. If you need case-insensitive matching, case-insensitive ordering, or stable text lookup semantics across SQL and collection indexes, declare a collation explicitly.

What Collation Solves

Collation helps with text comparison behavior, not translation or UI localization.

  • Case-insensitive equality: make 'Alice', 'alice', and 'ALICE' compare as equal when you want that behavior.
  • Case-insensitive ordering: sort names or tags without uppercase values grouping separately from lowercase values.
  • Consistent index semantics: make indexed lookups and unique enforcement follow the same text rules as query execution.
  • Collection path behavior: let collection indexes on text fields or array elements use the same matching semantics as SQL.

Collation does not provide translation, localized UI text, or language-specific full-text tokenization.

Current Support

Surface Support
Text storage UTF-8 / full Unicode storage is supported by default
Default text behavior Ordinal / byte-order semantics unless a collation is declared
SQL DDL TEXT COLLATE ... in table definitions and COLLATE in index definitions
SQL queries Query-level COLLATE overrides for comparisons and ordering
SQL indexes Collation-aware equality and uniqueness behavior for text indexes
Collections API Optional collation on EnsureIndexAsync for text path indexes, including array-element paths
Metadata Visible through system catalogs, client models, REST, gRPC, Admin, and ADO.NET GetSchema()

Supported Collations

Name Use When Behavior
BINARY You want exact ordinal behavior Default byte/code-point comparison
NOCASE You want case-insensitive matching without locale-specific rules Invariant uppercase normalization for indexed and runtime compare paths
NOCASE_AI You want case-insensitive and accent-insensitive behavior Accent-stripped, invariant-case normalization for equality, ordering, and index semantics
ICU:<locale> You need locale-aware linguistic sorting rules Built-in locale-aware comparison and ordering using .NET globalization sort keys for the specified culture

SQL Usage

You can apply collation in three places: at the column definition, on an index expression, or directly inside a query.

Column-Level Default

Use this when a text column should normally behave case-insensitively.

CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT COLLATE NOCASE,
    email TEXT COLLATE NOCASE
);

With this schema, comparisons and ordering against name and email use NOCASE unless a query overrides them.

Index-Level Collation

Use this when only a specific index needs case-insensitive behavior.

CREATE INDEX idx_users_email_nocase
ON users (email COLLATE NOCASE);

This is useful when you want fast case-insensitive lookups but do not want to change the default semantics of every query that touches the column.

Query-Level Override

Use this when you want an explicit override in a single query.

SELECT *
FROM users
WHERE name = 'alice' COLLATE NOCASE;

SELECT *
FROM users
ORDER BY name COLLATE NOCASE;

Query-level collation is the right choice when the table defaults to BINARY but one report, search, or sort needs case-insensitive behavior.

Locale-Aware Example

Use ICU:<locale> when sorting or matching should follow a specific culture's linguistic rules.

CREATE TABLE products (
    id INTEGER PRIMARY KEY,
    name TEXT COLLATE ICU:sv-SE
);

SELECT *
FROM products
ORDER BY name COLLATE ICU:sv-SE;

Collection Usage

The collection API supports the same idea for text path indexes. Declare the collation when you create the index; lookups and range scans then use that index's rules.

Case-Insensitive Text Index

var users = await db.GetCollectionAsync<User>("users", ct);

await users.EnsureIndexAsync(x => x.Email, "NOCASE", ct);

await foreach (var match in users.FindByPathAsync("Email", "ALPHA@EXAMPLE.COM", ct))
{
    Console.WriteLine(match.Value.Email);
}

Case-Insensitive Array Element Index

var users = await db.GetCollectionAsync<UserWithTags>("users", ct);

await users.EnsureIndexAsync("$.tags[]", "NOCASE", ct);

await foreach (var match in users.FindByPathAsync("$.tags[]", "GREEN", ct))
{
    Console.WriteLine(match.Key);
}

If you create a collection index with one collation and later request the same path with a different collation, CSharpDB treats that as a mismatch. Recreate the index instead of expecting the engine to silently reinterpret existing index data.

Choosing a Collation

If you need... Use
Exact case-sensitive matching or exact ordinal ordering BINARY
Case-insensitive matching for names, emails, usernames, tags, or search fields NOCASE
Accent-insensitive behavior NOCASE_AI
Locale-aware linguistic ordering for a specific language or region ICU:<locale>

Performance

Collation is opt-in. If you do not use it, the default ordinal path keeps its current behavior and performance profile.

Area Default / BINARY NOCASE / NOCASE_AI ICU:<locale>
Reads without collation No extra overhead Not applicable Not applicable
Indexed equality lookup Ordinal key match Normalized key match Culture sort-key match
INSERT / UPDATE on collated index Raw text path Extra normalization work on write Extra sort-key generation work on write
Collection text index Ordinal text key Normalized text key Normalized culture sort-key text

The benchmark suite includes focused ordered-text collation coverage for equality lookups, range scans, top-N ordering, and indexed writes.

Current Limitations

  • Default behavior is still ordinal. If you never declare collation, text comparisons and ordering stay case-sensitive and byte-order based.
  • ICU behavior follows runtime globalization data. If you move a database across environments or upgrade globalization data, rebuild ICU-backed indexes with REINDEX if you need deterministic persisted ordering.
  • Index-size growth is workload-dependent. Collated keys, especially ICU sort keys, can be materially larger than raw ordinal text keys, and the public docs do not yet publish a stable size-growth table.
  • Changing collation means rebuilding index state. Existing index bytes were written with a specific interpretation and should not be silently reused under different rules.

FAQ

Does this give me multilingual support?

Partly. CSharpDB already stores multilingual text because it uses UTF-8. Collation adds better multilingual comparison and sorting behavior, especially for case-insensitive scenarios. It does not provide translation, localized UI strings, or language-aware full-text tokenization.

If I do not use COLLATE, is there a performance penalty?

No meaningful one is expected. Collation is opt-in, and the default ordinal path stays in place when you do not declare it.

Can I use locale-aware sorting now?

Yes. Use ICU:<locale> such as ICU:sv-SE or ICU:tr-TR when you want sorting and equality to follow a specific culture's rules.

Will NOCASE change the stored text?

No. Your original text is still stored and returned as written. Collation only affects how values are compared, sorted, and indexed.

Should I use column-level or query-level collation?

Use column-level collation when a field should usually behave case-insensitively. Use query-level collation when most queries should stay ordinal but one report or lookup needs an override.

See Also