how-to · schemas

Define schemas

A schema is a TOML manifest. One file declares the columns, the primary key, secondary indexes, graph relations, full-text fields, and vector fields for one table. The substrate hashes everything off the manifest — registering it is the prerequisite to any row write.

The catalog is itself stored as rows. Adding a column or extracting a new vector is an online migration, not downtime — see ops → migrations. Reference for vector / full-text / graph query syntax lives on vector, full-text, and graph.

A full example.

One manifest. One shop.products table with rows, two indexes, a supplier relation, a BM25 description field, and a 768-dimensional embedding extraction. Every section below this one is a piece of the same shape.

# manifest.toml — full surface in one declaration
id = "shop.products"

[primary_key]
columns = ["id"]

# ── Columns ─────────────────────────────────────────────────────────
[[columns]]
name = "id"           ; type = "str"
[[columns]]
name = "name"         ; type = "str"
[[columns]]
name = "supplier_id"  ; type = "str"
[[columns]]
name = "price"        ; type = "f64"
[[columns]]
name = "category"     ; type = "str"
[[columns]]
name = "description"  ; type = "str"
[[columns]]
name = "created_at"   ; type = "timestamp"

# ── Secondary indexes ───────────────────────────────────────────────
[[indexes]]
columns = ["category"]
[[indexes]]
columns = ["supplier_id", "created_at"]

# ── Graph relations ─────────────────────────────────────────────────
[[relations]]
name          = "supplied_by"
column        = "supplier_id"
target        = "shop.suppliers"
bidirectional = true                # default; reverse edge written atomically

# ── Full-text extractions ───────────────────────────────────────────
[[extractions.fts]]
field     = "description"
tokenizer = "unicode"               # unicode (UAX #29) | ascii
analyzer  = ["lowercase", "stem:english", "stop:english"]

# ── Vector extractions ──────────────────────────────────────────────
[[extractions.vector]]
field  = "description"
dim    = 768
metric = "cosine"                   # cosine | dot | l2

Columns & primary keys.

Each [[columns]] block declares one field. [primary_key] can list one column (single-column PK) or several (composite PK; lookups go through a Plan-tree query, not the typed /rows/:pk route).

type	description
str	UTF-8 string. The default for IDs, emails, names.
i64	Signed 64-bit integer.
f64	Double-precision float.
bool	Boolean.
ulid	ULID (lexicographically sortable). Recommended for time-ordered primary keys.
uuid	UUID v4.
timestamp	RFC3339 instant. Stored as UTC microseconds.
decimal	Arbitrary-precision decimal. Use for money.
json	Opaque JSON blob. Not indexable, not extractable.

Secondary indexes.

Hash-keyed lookups. Single-column or composite (in declared order). The index entry is written and retired in the same WAL frame as the row — never out of sync.

[[indexes]]
columns = ["status"]                # single-column hash index

[[indexes]]
columns = ["customer_id", "placed_at"]   # composite, in declared order

Index byte layout: idx|<schema>|<column>|<value_bytes>|<pk_bytes>. Range scans work via prefix iteration on (schema, column, value).

Graph relations.

Declared on the table that holds the foreign-key column. When a row writes that column, the substrate emits a forward edge and (with bidirectional = true, the default) a reverse edge — both atomic with the row. Walk them with neighbors / BFS / Dijkstra.

[[relations]]
name          = "supplied_by"        # the verb you'll walk in code
column        = "supplier_id"        # the FK column on this table
target        = "shop.suppliers"     # the target table
bidirectional = true                 # default; reverse edge written atomically

# Self-relations are fine — direction tags in the key resolve the collision:
[[relations]]
name          = "follows"
column        = "followee_id"
target        = "social.users"       # same table; self-loop is allowed
bidirectional = true

Full-text fields.

Each [[extractions.fts]] block makes one (table, field) BM25-searchable. The optional analyzer pipeline runs after tokenisation in declared order. Index-time and query-time use the same pipeline — never analyse one and not the other.

[[extractions.fts]]
field     = "body"
tokenizer = "unicode"                # unicode (UAX #29) | ascii
analyzer  = ["lowercase", "stem:english", "stop:english"]
                                     # any subset; order is the pipeline order

Tokenizers

unicode (default)

UAX #29 word segmentation + full Unicode lowercase. Handles Latin, Cyrillic, Greek, CJK, Arabic, Devanagari, Hebrew without per-language config.

ascii (fast-path)

Whitespace + punctuation split, ASCII lowercase, ~3× faster on pure-ASCII corpora. Falls back to unicode if it sees a non-ASCII byte.

Analyzer pipeline

lowercase Full Unicode case fold (handles Turkish dotted-i, German ß, etc).
fold_diacritics NFKD + drop combining marks. "café" matches "cafe".
stop:<lang> Per-locale stop-word elimination. 18 languages.
stem:<lang> Snowball stemmer. "running" / "ran" / "runs" → one token.

stemming & stop-words available for

ArabicDanishDutchEnglishFinnishFrenchGermanHungarianItalianNorwegianPortugueseRomanianRussianSpanishSwedishTamilTurkishHindi

Vector fields.

Each [[extractions.vector]] block declares one HNSW index. Per-table dimensionality is enforced — a write at the wrong dim is rejected at parse time.

[[extractions.vector]]
field  = "summary_embedding"
dim    = 1024                        # enforced at write time
metric = "cosine"                    # cosine | dot | l2

Index defaults: M=16, ef_construction=200. Each topk call picks fast or high_recall at request time — the index is built once, the search width tunes per call. See vector reference.

Register a manifest.

POST the raw TOML to the schemas route. The body is text/plain, not JSON.

curl -X POST "https://<tenant>.<region>.db.originchain.ai/v1/tenants/$T/schemas" \
  -H "Authorization: Bearer $OC_TOKEN" \
  -H "Content-Type: text/plain" \
  --data-binary @manifest.toml

Re-registering the same id with a higher version triggers an online migration. The server stays writable through backfill and cuts over atomically.

Schema evolution.

Manifest changes follow a strict contract: monotonic version int, one of four allowed shapes per migration (add column with default, drop column, add index, add extraction), a 10% backfill rate so live traffic stays prioritised, dual-read transform during backfill, atomic cutover, abort-only-pre-cutover.

The full failure model and cutover procedure live on ops → migrations. The HTTP migration endpoints are in the API reference.