OriginChain docs

Schema for Full-text.

schema · full-text

FTS indexes are NOT declared in the schema TOML. Index a (table, field) pair with one POST; query with one GET. Synonyms, stopwords, lemmatizers, facets, highlights are all configurable per pair at runtime.

Engine surface: POST /v1/tenants/:t/fts/:table/:field family — plain text, JSON-aware, doc store, synonyms, stopwords. GET /v1/tenants/:t/fts/:table/:field — boolean / bm25 / phrase search.

Required schema fields.

Without these, this query surface doesn't function at all.

field effect
(none) FTS indexes live entirely at runtime. The :table and :field on the URL are logical buckets — do NOT need to be a registered schema or column.

Optional fields — what each one unlocks.

Add only the fields whose effect you need. Each one buys a specific capability — speed up a predicate, guard a write, or unlock a new query shape.

field type default effect
POST /fts/:t/:f body { doc_id, text } object Index plain text. BM25 inverted index built lazily on first call.
POST /fts/:t/:f/json body { doc_id, json, paths } object JSON-aware: walks dotted paths and indexes string leaves. Omit paths to index every string leaf.
POST /fts/:t/:f/doc body { doc_id, text, facets } object Store doc text + per-facet values. REQUIRED for highlight=true and facets= query params.
POST /fts/:t/:f/synonyms body { synonyms: {...} } object Per (table, field) synonym class map. Both index and query treat each class as equivalent. Re-install replaces in full.
POST /fts/:t/:f/stopwords body { stopwords: [...] } object Per (table, field) drop list. Applied at both index and query time. Re-install replaces in full.
GET ?q= (query string) string Query string. Whitespace-split; each token runs through the analyzer pipeline.
GET ?mode= string boolean boolean | bm25 | phrase. Boolean = AND-of-terms. BM25 = ranked. Phrase = in-order match.
GET ?k= int 10 Top-K cap for bm25 mode. Ignored in boolean / phrase.
GET ?fuzzy=0..3 int 0 Edit-distance budget per term in bm25 mode. Catches typos. Capped at MAX_EDIT_DISTANCE = 3.
GET ?highlight=true bool false Return per-hit snippet highlights. Requires the doc text stored via /doc endpoint. BM25 mode only.
GET ?facets=csv string Comma-separated facet field names. Returns aggregated counts alongside hits.

What you can call (no schema knob needed).

  • POST /fts/:t/:f — index plain text
  • POST /fts/:t/:f/json — JSON-aware index walks dotted paths
  • POST /fts/:t/:f/doc — store text + facets (required for highlight + facet aggregation)
  • POST /fts/:t/:f/synonyms — install per-(table, field) synonym map
  • POST /fts/:t/:f/stopwords — install per-(table, field) stopword list
  • GET /fts/:t/:f?q=… — boolean / bm25 / phrase search with optional fuzzy + highlight + facets

Abbreviation legend.

token meaning
BM25 Okapi BM25 — the ranking function used in mode=bm25. Same scoring family as Lucene / Elasticsearch
doc_id Unique identifier per indexed document (within a table, field pair)
facet Categorical attribute stored alongside the doc text for aggregate counts
highlight Per-hit snippet of the matching doc text with the query terms wrapped in `<em>` tags
fuzzy Edit-distance tolerance — fuzzy=1 catches single typos, fuzzy=2 catches two-char edits, etc
tokenizer unicode (UAX #29 default, multilingual) | ascii (fast-path for pure-ASCII corpora)
stemmer Snowball stemmer reducing word forms to a root token. 18 languages supported
stopwords Tokens dropped at both index and query time (e.g. 'the', 'a', 'and')
synonyms Class-based equivalence — every member of a class scores against every query that matches any other member

Worked example.

Schema TOML — copy + register via POST /v1/tenants/:t/schemas with Content-Type: text/plain.

# ──────────────────────────────────────────────────────────────────────
# Important: FTS indexes are NOT declared in the schema TOML.
# The grammar oc_schema::Manifest accepts does NOT have a [[extractions.fts]]
# block. Tokenizer / analyzer / stem language all live on the runtime
# POST /fts/:table/:field body — set once per (table, field) pair.
#
# What the TOML IS for (FTS workflows): registering the ROW schema so the
# same row is reachable via SQL alongside FTS search. Same id keeps things
# aligned.
# ──────────────────────────────────────────────────────────────────────

namespace   = "shop"
table       = "products"
primary_key = ["id"]

[[columns]]
name = "id"          
ty = "str" 
required = true
[[columns]]
name = "name"        
ty = "str"
[[columns]]
name = "description" 
ty = "str"
[[columns]]
name = "category"    
ty = "str"

[[indexes]]
name    = "by_category"
columns = ["category"]

Runtime calls.

# ════════════════════════════════════════════════════════════════════
# INDEX — 3 ways to load text
# ════════════════════════════════════════════════════════════════════

# 1) Plain text per doc — simplest
curl -X POST $BASE/v1/tenants/$T/fts/shop_products/description -H "Authorization: Bearer $BEARER" \
  -H "Content-Type: application/json" \
  -d '{
    "doc_id": "p001",
    "text":   "Wireless Bluetooth headphones with active noise cancellation"
  }'

# 2) JSON-aware — walks dotted paths inside a nested doc
curl -X POST $BASE/v1/tenants/$T/fts/shop_products/description/json -H "Authorization: Bearer $BEARER" \
  -d '{
    "doc_id": "p001",
    "json": {
      "name": "Wireless Headphones",
      "desc": { "short": "BT 5.3 ANC", "long": "Over-ear noise-cancelling..." },
      "tags": ["audio", "premium"]
    },
    "paths": ["name", "desc.short", "desc.long", "tags"]
  }'
# Omit "paths" to index every string leaf in the document.

# 3) Store doc text + facets — REQUIRED for highlight=true and facets= queries
curl -X POST $BASE/v1/tenants/$T/fts/shop_products/description/doc -H "Authorization: Bearer $BEARER" \
  -d '{
    "doc_id": "p001",
    "text":   "Wireless Bluetooth headphones with active noise cancellation",
    "facets": {
      "category":     ["electronics"],
      "brand":        ["acme"],
      "price_bucket": ["100-200"]
    }
  }'

# ════════════════════════════════════════════════════════════════════
# CONFIG — synonyms + stopwords (optional, per (table, field) pair)
# ════════════════════════════════════════════════════════════════════

# Install synonyms — each class is treated as equivalent at both index and query time
curl -X POST $BASE/v1/tenants/$T/fts/shop_products/description/synonyms -H "Authorization: Bearer $BEARER" \
  -d '{
    "synonyms": {
      "headphones": ["earbuds", "earphones", "cans"],
      "laptop":     ["notebook", "computer"],
      "tv":         ["television"]
    }
  }'

# Install stopwords — dropped at both index and query time
curl -X POST $BASE/v1/tenants/$T/fts/shop_products/description/stopwords -H "Authorization: Bearer $BEARER" \
  -d '{
    "stopwords": ["the", "a", "an", "and", "or", "of", "in", "with", "for"]
  }'

# ════════════════════════════════════════════════════════════════════
# SEARCH — 3 modes × 6 query params
# ════════════════════════════════════════════════════════════════════

# Boolean mode (DEFAULT) — AND of terms, no scoring, fastest
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wireless+headphones" \
  -H "Authorization: Bearer $BEARER"

# BM25 mode — ranked, top-k cap via k=
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wireless&mode=bm25&k=10" \
  -H "Authorization: Bearer $BEARER"

# Phrase mode — exact in-order match
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=noise+cancellation&mode=phrase&k=10" \
  -H "Authorization: Bearer $BEARER"

# Fuzzy — every BM25 term treated as term~N. fuzzy=1 catches single typos.
# Capped at MAX_EDIT_DISTANCE = 3.
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wirless&mode=bm25&fuzzy=1&k=10" \
  -H "Authorization: Bearer $BEARER"

# Highlight — returns {highlights: {description: ["…<em>wireless</em>…"]}} per hit
# Requires the stored-text doc was set via POST /doc above. BM25 mode only.
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wireless&mode=bm25&k=5&highlight=true" \
  -H "Authorization: Bearer $BEARER"

# Facets — comma-separated facet field names from the stored doc.
# Returns aggregated {category: {electronics: 5, books: 2}} alongside hits.
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wireless&mode=bm25&k=5&facets=category,brand" \
  -H "Authorization: Bearer $BEARER"

# Kitchen-sink — fuzzy + highlight + facets in one call
curl "$BASE/v1/tenants/$T/fts/shop_products/description?q=wirless&mode=bm25&k=5&fuzzy=1&highlight=true&facets=category,brand,price_bucket" \
  -H "Authorization: Bearer $BEARER"