Knowledge Bases & Indexing

What is a Knowledge Base?

A Knowledge Base (KB) is a container that holds one or more sources and makes their content searchable. When you add a source to a KB, the platform processes the extracted text using the configured indexing strategy and stores the results for retrieval. The indexing strategy determines how content is structured — from simple chunking to hierarchical document trees to structured JSON extraction. The retrieval strategy determines how queries find relevant content — from fast vector similarity to LLM-driven reasoning over document structure.

Indexing Pipeline

The indexing pipeline runs automatically when you add a source to a knowledge base. What happens during indexing depends on the strategy: ChunkEmbed splits text and generates embeddings, PageIndex builds a hierarchical tree with LLM-generated summaries, GraphIndex extends PageIndex with cross-reference detection and node embeddings, and Doc2JSON extracts structured fields. You can reindex at any time to reprocess with different settings.

Indexing Strategies

The indexing strategy controls how source content is processed and stored. Each strategy produces different data structures, supports different retrieval methods, and has different cost profiles. You set the strategy when creating or updating a knowledge base via the indexing_config field.

Strategy	Best For	Cost	Indexing Speed	Retrieval Speed	Compatible Retrieval
ChunkEmbed	General RAG, most documents	Low (embedding only)	Fast	Fast	Vector, Hybrid, Full-text
PageIndex	Long structured PDFs, complex docs	Medium–High (many LLM calls)	Slow (many LLM calls)	Slow (LLM at query time)	Tree Search only
GraphIndex	Cross-referenced documents	High (PageIndex + enrichment + embedding)	Slow (PageIndex + enrichment + embedding)	Fast (vector/hybrid, no LLM)	Vector, Hybrid, Full-text + graph expansion
Doc2JSON	Structured field extraction	Medium (LLM per window)	Medium (LLM per window)	Fast	Vector on summary

ChunkEmbed (Default)

The standard RAG approach. Text is split into overlapping chunks using a configurable chunking strategy, each chunk is embedded into a vector, and the vectors are stored in pgvector for similarity search. A BM25 sparse index is also built over the chunk text, enabling keyword-based full-text search alongside vector search. This is the fastest and cheapest indexing strategy — no LLM calls are needed, only an embedding API call. ChunkEmbed pairs with vector search, hybrid search, or full-text search for retrieval. Hybrid search (which runs both vector and BM25 in parallel and fuses results) is the recommended default for production RAG applications.

Chunking Strategies

ChunkEmbed supports three chunking strategies that control how text is split into searchable units. The chunking strategy is independent of the embedding model.

Chunker	How It Works	Best For
markdown_header (default)	Splits at Markdown headers (h1–h6), then subdivides each section by length using recursive chunking. Prepends the section header as context to each chunk.	The platform default — works well for documents with heading structure
recursive	Splits at natural boundaries in order of preference: paragraph breaks, line breaks, sentence endings, then words. Respects document flow.	Unstructured prose, articles, general text without clear heading structure
fixed_size	Splits at fixed token counts with word-boundary snapping. Simple and predictable chunk sizes.	Uniform content like logs, transcripts, or code where structure doesn’t matter

Chunk size and overlap control the tradeoff between precision and context. Smaller chunks (500–1000 tokens) give more precise retrieval but may lose surrounding context. Larger chunks (2000–4000 tokens) preserve context but can dilute relevance. The defaults (2000 tokens, 50 token overlap) work well for most use cases.

# Create a KB with ChunkEmbed (the default strategy)
response = requests.post(
    f"{BASE_URL}/api/knowledge-bases",
    headers=headers,
    json={
        "name": "Product Docs",
        "indexing_config": {
            "strategy": "chunk_embed",
            "chunk_size": 1500,
            "overlap": 100,
            "embedding_model": "text-embedding-3-small",
        },
    },
)
kb = response.json()

PageIndex

PageIndex builds a hierarchical tree of your document’s structure — sections, subsections, and their content — using LLM-powered analysis. The result is stored as two artifacts: a lightweight ToC (titles and LLM-generated summaries, no full text) and a flat list of section nodes (with the actual text). Retrieval works in two phases: an LLM first reasons over the lightweight ToC to identify relevant sections by structure, then the platform fetches the full text of those sections. PageIndex has two pipelines that are selected automatically. For Markdown content, it parses headers into a nested tree, splits oversized leaf nodes using LLM calls, and generates per-node summaries. For PDFs (when page_texts are provided), it scans the first pages for a table of contents, calibrates page-number offsets against actual headings, infers structure via LLM if no ToC is found, then assigns page text to tree nodes and merges small siblings. Both pipelines cap LLM concurrency at 7 calls by default to avoid rate limits.

PageIndex is expensive to indexPageIndex makes many LLM calls during indexing — for ToC detection, structure inference, oversized node splitting, and summary generation. A 100-page PDF may take several minutes and consume significant LLM tokens. Use this strategy when document structure matters for retrieval quality — compliance manuals, legal contracts, technical specifications, academic papers.

PageIndex retrieval uses LLM reasoning, not vectorsTree Search retrieval does not use vector similarity. It sends the document’s structural outline (titles + summaries, no text) to an LLM and asks it to identify the most relevant sections. This means each retrieval call incurs LLM cost and latency — typically 1–3 seconds per query. For high-throughput, latency-sensitive workloads, ChunkEmbed with hybrid search is more appropriate.

# Create a KB with PageIndex for structured document retrieval
response = requests.post(
    f"{BASE_URL}/api/knowledge-bases",
    headers=headers,
    json={
        "name": "Compliance Manual",
        "indexing_config": {
            "strategy": "page_index",
            "extra": {
                "model": "gpt-4o",
                "if_add_node_summary": "yes",
            },
        },
        "retrieval_config": {
            "method": "tree_search",
            "top_k": 5,
        },
    },
)

GraphIndex

GraphIndex builds on PageIndex with two additional stages: cross-reference enrichment and node embedding. In stage one, it runs the same PageIndex pipeline to build the hierarchical document tree. In stage two, an LLM analyzes each node’s text against the full table of contents and identifies which other sections the node explicitly references — citations, mentions, dependencies, or cross-references (not structural parent/child relationships). These references are stored in each node’s metadata. In stage three, each node’s title and summary (plus its reference list) are embedded into a vector, and a BM25 sparse index is built over node text. The key advantage over plain PageIndex is retrieval flexibility. Because nodes have embeddings and a BM25 index, GraphIndex knowledge bases support vector search, hybrid search, and full-text search — the same fast retrieval methods as ChunkEmbed but over structured document sections instead of arbitrary chunks. After the initial retrieval, graph expansion automatically pulls in first-degree referenced nodes (sections that the matched sections explicitly cite), enriching results with related context. This makes GraphIndex suited for documents with dense internal references — regulatory frameworks, technical standards, codebases with cross-module dependencies.

GraphIndex is expensive to index, but cheap to retrieveGraphIndex performs all the LLM work of PageIndex (tree building, node splitting, summary generation) plus one additional LLM call per node for cross-reference detection, plus embedding computation for every node. For a document with 50 sections, that means ~50 extra LLM calls on top of the PageIndex work. Enrichment concurrency is capped at 7 by default. However, unlike PageIndex, retrieval is fast and cheap — it uses vector, hybrid, or full-text search over node embeddings (no LLM calls at query time). Use GraphIndex when you want structural awareness during indexing with fast, scalable retrieval.

# Create a KB with GraphIndex for cross-referenced document retrieval
response = requests.post(
    f"{BASE_URL}/api/knowledge-bases",
    headers=headers,
    json={
        "name": "Regulatory Framework",
        "indexing_config": {
            "strategy": "graph_index",
            "extra": {
                "model": "gpt-4o",
                "enrichment_model": "gpt-4o",
                "embedding_model": "text-embedding-3-small",
                "if_add_node_summary": "yes",
            },
        },
        "retrieval_config": {
            "method": "hybrid",
            "top_k": 10,
        },
    },
)

Doc2JSON

Doc2JSON extracts structured data from documents using a sliding-window LLM approach. You define a JSON schema with the fields you want to extract (names, types, descriptions, examples), and the platform slides a window across the document content. For each window, an LLM extracts a brief summary and fills in schema fields from the visible text. Extractions are merged across windows: scalar fields use last-value-wins, arrays accumulate new items, and objects are deep-merged. After all windows are processed, a final LLM call generates a combined document summary, which is embedded for vector retrieval. Doc2JSON supports two modes. Text mode (default) processes extracted text using token-based windows (default 4000 tokens, 200 overlap). Image mode processes page screenshots directly as multimodal content — useful for documents with complex layouts, tables, or forms where text extraction loses formatting. In image mode, pages are grouped into windows (default 3 pages per window) and sent as images to the LLM.

# Create a KB with Doc2JSON for invoice extraction
response = requests.post(
    f"{BASE_URL}/api/knowledge-bases",
    headers=headers,
    json={
        "name": "Invoice Extraction",
        "indexing_config": {
            "strategy": "doc2json",
            "extra": {
                "json_schema": {
                    "fields": [
                        {"name": "vendor_name", "type": "string", "description": "Company that issued the invoice"},
                        {"name": "invoice_date", "type": "string", "description": "Date of the invoice"},
                        {"name": "total_amount", "type": "number", "description": "Total amount due"},
                        {"name": "line_items", "type": "array", "description": "Individual line items",
                         "item_type": "object", "items": {
                            "type": "object", "fields": [
                                {"name": "description", "type": "string"},
                                {"name": "quantity", "type": "integer"},
                                {"name": "unit_price", "type": "number"},
                            ]
                        }},
                    ]
                },
                "extraction_model": "gpt-4o",
            },
        },
    },
)

Retrieval Strategies

The retrieval strategy controls how queries find relevant content within a knowledge base. You set the retrieval method when creating a KB or when calling the search endpoint. The right choice depends on your indexing strategy, query patterns, latency requirements, and budget.

Method	How It Works	Latency	Cost	Best For
vector_search	Embeds the query and finds nearest vectors via cosine similarity in pgvector	Very low (~100ms)	Low (one embed call)	Semantic matching — captures meaning even without shared keywords
full_text	BM25 keyword scoring with stemming via PostgreSQL tsvector	Low	None (no API calls)	Exact phrases, product names, error codes, IDs, proper nouns
hybrid (recommended)	Runs vector + BM25 in parallel, fuses results with Reciprocal Rank Fusion (k=60)	Low	Low (one embed call)	Production RAG — robust across query types
tree_search	LLM selects documents, then selects sections by reasoning over ToC structure	Medium (1–3s)	Medium (two LLM calls)	PageIndex KBs only — complex structural queries

Vector Search

Vector search embeds the query using the same model as indexing, then finds the most similar chunk embeddings via cosine similarity in pgvector. It captures semantic meaning — “How do I reset my credentials?” will match chunks about password resets even without shared keywords. Results are ranked by similarity score (higher = more relevant). An optional similarity_threshold filters out low-quality matches.

Full-Text Search (BM25)

Full-text search uses BM25 scoring — a keyword relevance algorithm that considers term frequency, document length, and inverse document frequency. Terms are stemmed using PostgreSQL’s English dictionary (to_tsvector), so “running” matches “run”. BM25 uses standard parameters: k1=1.2 for term frequency saturation and b=0.75 for length normalization. No API calls are needed — scoring runs entirely in PostgreSQL. This complements vector search by catching results that share keywords but may not be semantically close in embedding space.

Hybrid Search (Recommended)

Hybrid search runs both vector search and full-text search in parallel, then fuses the results using Reciprocal Rank Fusion (RRF). RRF merges ranked lists without needing to normalize incompatible score ranges — it uses rank positions, not scores. The formula: rrf_score(d) = sum of weight / (k + rank) across all lists, with k=60 (the original RRF paper constant). The vector_weight parameter (default 0.5) balances the two signals: higher values favor semantic matches, lower values favor keyword matches. Results are normalized so the top result has score 1.0.

# Create a KB with hybrid search retrieval
response = requests.post(
    f"{BASE_URL}/api/knowledge-bases",
    headers=headers,
    json={
        "name": "Product Docs",
        "indexing_config": {
            "strategy": "chunk_embed",
            "chunk_size": 2000,
            "overlap": 50,
        },
        "retrieval_config": {
            "method": "hybrid",
            "top_k": 10,
            "vector_weight": 0.6,
        },
    },
)

Tree Search

Tree Search is a two-phase LLM-driven retrieval method for PageIndex and GraphIndex knowledge bases. In phase one, the LLM reviews compact summaries of each indexed document (name, description, top-level section titles) and selects which documents are relevant. In phase two, the LLM examines the selected documents’ full ToC structure (section titles and summaries, no full text) and identifies the most relevant sections — returning up to top_k node IDs. The platform then fetches the full text of those sections from the database. For multi-document KBs, node IDs are globally prefixed (e.g., d0:0001, d1:0005) so the LLM can reference sections across documents. Response parsing is robust: it tries JSON first, then falls back to regex pattern matching, and validates all returned IDs against the actual tree structure to prevent hallucinated references.

Tree Search requires PageIndexTree Search reads from the page_index_toc and page_index_nodes tables. It is only compatible with the PageIndex strategy. GraphIndex uses vector/hybrid/full-text search over node embeddings instead.

Reranking

Reranking is an optional second stage that improves retrieval precision. The initial retrieval (vector, hybrid, or full-text) fetches a broad candidate pool — by default 20 items (the candidate_count parameter). A cross-encoder reranker then re-scores each candidate by evaluating the query-document pair jointly. Cross-encoders are more accurate than bi-encoder embeddings because they see the query and document together, but they can’t be used for initial retrieval because they don’t produce storable vectors. After reranking, the top_k results are returned to the caller.

Reranker	Provider	Notes
Rerank English v3.0 (default)	Cohere	High quality, English-optimized
Rerank Multilingual v3.0	Cohere	Multilingual support
Jina Reranker v2 Base Multilingual	Jina AI	Multilingual, competitive quality
Rerank 2.5	Voyage	Strong general-purpose reranker
Rerank 2.5 Lite	Voyage	Lighter variant, lower cost

Reranker API keys are platform-managedReranker API keys (Cohere, Jina, Voyage) are configured at the platform level by your administrator, not per-organization. If reranking returns errors, contact your platform admin to verify the reranker provider key is configured.

Embedding Models

Embeddings convert text into high-dimensional vectors that capture semantic meaning. The platform uses OpenAI’s text-embedding-3-small by default (1536 dimensions). All chunks in a knowledge base must use the same embedding model — if you change the model, you must reindex. Embedding calls are batched at up to 250,000 tokens per API call for efficiency. The platform supports embedding models from multiple providers via LiteLLM — select your preferred model in Settings > Knowledge Indexing.

Model	Provider	Dimensions	Tradeoff
text-embedding-3-small (default)	OpenAI	1536	Best balance of quality, cost, and speed. Fits within HNSW index limit.
text-embedding-3-large	OpenAI	3072	Higher quality, 2x storage. Exceeds HNSW dimension limit — see note below.
text-embedding-ada-002	OpenAI	1536	Legacy model — use text-embedding-3-small instead.
embed-english-v3.0	Cohere	1024	High-quality English embeddings. Fits within HNSW limit.
embed-multilingual-v3.0	Cohere	1024	Multilingual support across 100+ languages.
embed-english-light-v3.0 / embed-multilingual-light-v3.0	Cohere	384	Lightweight variants — faster and cheaper, lower quality.
voyage/voyage-01	Voyage AI	1024	Strong general-purpose embeddings from Voyage AI.
gemini/text-embedding-004	Google	768	Google Gemini embedding model.
mistral/mistral-embed	Mistral	1024	Mistral AI embedding model.

Embedding provider API keysOpenAI embeddings use the platform-managed OPENAI_API_KEY. For other providers (Cohere, Voyage, Mistral, Google), the corresponding API key environment variable (e.g. COHERE_API_KEY, VOYAGE_API_KEY, MISTRAL_API_KEY) must be configured at the platform level by your administrator before creating projects. These keys are passed through to LiteLLM at runtime. Contact your platform admin if a non-OpenAI embedding model returns authentication errors.

HNSW index dimension limitThe platform uses pgvector HNSW indexes for fast approximate nearest-neighbor search. HNSW indexes support a maximum of 2000 dimensions. The default model text-embedding-3-small (1536 dimensions) fits within this limit and gets full HNSW acceleration. Models with more than 2000 dimensions (like text-embedding-3-large at 3072) fall back to sequential scan — still correct but significantly slower for large knowledge bases.

Searching a Knowledge Base

Once indexed, you can search a knowledge base with any natural language query. The search uses whichever retrieval method was configured on the KB, or you can override it per-request. Results include the matched text, relevance scores, and source metadata.

response = requests.post(
    f"{BASE_URL}/api/knowledge-bases/{kb_id}/search",
    headers=headers,
    json={"query": "How do I reset my password?", "top_k": 5},
)
results = response.json()
for chunk in results["results"]:
    print(f"Score: {chunk['similarity']:.3f}")
    print(chunk["content"][:200])

Reindexing

Reindexing replaces all indexed contentWhen you reindex a knowledge base, all existing chunks, tree nodes, or extracted JSON are deleted and recreated from scratch. The KB remains searchable during reindexing but results may be incomplete until it finishes. For large KBs with PageIndex or GraphIndex, reindexing can take significant time and LLM tokens.

Recommended Configurations

Use Case	Indexing	Retrieval	Notes
General RAG (default)	ChunkEmbed (2000 tokens, 50 overlap)	Hybrid Search	Works for most documents. Add a reranker for higher precision.
Long structured PDFs	PageIndex	Tree Search	Compliance, legal, technical specs. Higher cost but superior structure-aware retrieval.
Cross-referenced documents	GraphIndex	Hybrid Search	Regulations, standards. Vector/hybrid search over node embeddings with automatic graph expansion of referenced sections.
Keyword-heavy content	ChunkEmbed	Full-Text Search	Logs, code, error messages. BM25 excels at exact matches without embedding cost.
Invoice / form extraction	Doc2JSON	Vector Search	Define a schema and extract structured fields from documents.

Project-Level Defaults

You can configure project-wide defaults for all indexing and retrieval parameters in Settings > Knowledge Indexing and Settings > Knowledge Retrieval. These defaults apply to newly created knowledge bases unless overridden in the indexing_config or retrieval_config at creation time. Settings include chunk sizes, embedding models, LLM models for PageIndex and GraphIndex, reranker configuration, and many advanced tuning parameters.

Next Steps

Create a Knowledge Base

Step-by-step guide to creating and indexing a KB.

Agents & Tools

Attach a KB to an agent for RAG-powered conversations.

Knowledge Bases API Reference

Full endpoint documentation.

Getting Started

Concepts

Guides

API Reference

Knowledge Bases & Indexing

What is a Knowledge Base?

Indexing Pipeline

Indexing Strategies

ChunkEmbed (Default)

Chunking Strategies

PageIndex

GraphIndex

Doc2JSON

Retrieval Strategies

Vector Search

Full-Text Search (BM25)

Hybrid Search (Recommended)

Tree Search

Reranking

Embedding Models

Searching a Knowledge Base

Reindexing

Recommended Configurations

Project-Level Defaults

Next Steps

Create a Knowledge Base

Agents & Tools

Knowledge Bases API Reference

Getting Started

Concepts

Guides

API Reference

Documentation Index

​What is a Knowledge Base?

​Indexing Pipeline

​Indexing Strategies

​ChunkEmbed (Default)

​Chunking Strategies

​PageIndex

​GraphIndex

​Doc2JSON

​Retrieval Strategies

​Vector Search

​Full-Text Search (BM25)

​Hybrid Search (Recommended)

​Tree Search

​Reranking

​Embedding Models

​Searching a Knowledge Base

​Reindexing

​Recommended Configurations

​Project-Level Defaults

​Next Steps

Create a Knowledge Base

Agents & Tools

Knowledge Bases API Reference

What is a Knowledge Base?

Indexing Pipeline

Indexing Strategies

ChunkEmbed (Default)

Chunking Strategies

PageIndex

GraphIndex

Doc2JSON

Retrieval Strategies

Vector Search

Full-Text Search (BM25)

Hybrid Search (Recommended)

Tree Search

Reranking

Embedding Models

Searching a Knowledge Base

Reindexing

Recommended Configurations

Project-Level Defaults

Next Steps