Skip to content

Features

A map of what Kreuzberg can do. Each section links to the guide or reference page with configuration details and code examples.

Kreuzberg features overview -- 96 input formats flow through extraction, OCR, and processing to produce text, tables, chunks, and metadata


Format Support

96 file formats handled by native Rust extractors — no LibreOffice or other external tools required.

PDF .pdf Word .docx .doc Pages .pages PowerPoint .pptx .ppt Keynote .key OpenDocument .odt Plain text .txt Markdown .md Djot .djot MDX .mdx RTF .rtf reStructuredText .rst Org .org Hangul .hwp .hwpx

Excel .xlsx .xls .xlsm .xlsb Numbers .numbers OpenDocument .ods CSV .csv TSV .tsv dBASE .dbf

JPEG .jpg .jpeg PNG .png GIF .gif BMP .bmp TIFF .tiff .tif WebP .webp JPEG 2000 .jp2 .jpx .jpm .mj2 JBIG2 .jbig2 PNM .pnm .pbm .pgm .ppm HEIC .heic .heics HEIF .heif AVIF .avif AVCS .avcs

HEIF / HEIC / AVIF

Pixel decoding for HEIF-family containers requires the heic Cargo feature (included in full) and the system libheif library at build and runtime. Native targets only — not available on wasm-target or android-target. EXIF metadata extraction from HEIC / AVIF works on every target via the pure-Rust nom-exif integration. See the installation guide.

MP3 .mp3 .mpga M4A .m4a WAV .wav WebM audio .webm MP4 audio track .mp4 .mpeg WebM audio track .webm

Since v5.0

Enable the transcription feature and set a transcription config block to extract Whisper ONNX transcripts from audio files and video audio tracks. See Audio and Video Transcription.

EML .eml MSG .msg

HTML .html .htm XHTML .xhtml XML .xml SVG .svg

JSON .json YAML .yaml TOML .toml

ZIP .zip TAR .tar .tgz GZIP .gz 7-Zip .7z

EPUB .epub BibTeX .bib RIS .ris CSL .csl LaTeX .tex Typst .typ JATS .jats DocBook .docbook OPML .opml

For the full format matrix with MIME types, extraction methods, and special capabilities, see the Format Support Reference.


Feature Availability

Use these labels when matching docs to deployed packages. Labels use major.minor only.

Version Feature area
v4.0 HTML metadata extraction and the pdf_oxide PDF provider.
v4.3 LibreOffice-free extraction for legacy .doc and .ppt files.
v4.5 OCR pipeline fallback, layout detection, and document-level OCR.
v4.6 PDF page rendering.
v4.8 LLM/VLM intelligence through liter-llm.
v5.0 Image-index references, SVG/image output normalization, HEIC aggregate formats, list_supported_formats, Whisper audio/video transcription, reranking, NER, redaction, summarization, translation, page classification, image captions, QR-code detection, and the windows-target feature aggregate.

Extraction Pipeline

Every file flows through the same multi-stage pipeline:

flowchart LR
    A[Input File] --> B[MIME Detection]
    B --> C[Format Extractor]
    C --> D{OCR Needed?}
    D -->|Yes| E[OCR Engine]
    D -->|No| F[Post-Processing]
    E --> F
    F --> G[ExtractionResult]
  1. MIME detection -- Kreuzberg identifies the file type from magic bytes and extension, then selects the matching native extractor from the registry.
  2. Format extraction -- The extractor pulls text, tables, metadata, and optionally images from the file. PDF extraction uses pdf_oxide (pure Rust); Office formats use native XML or OLE/CFB parsers; images pass directly to OCR.
  3. OCR -- When the extractor finds no text layer (or force_ocr is set), the file is routed to the configured OCR backend. The OCR result replaces or supplements the extracted text.
  4. Post-processing -- Validators, quality processing, chunking, embeddings, keyword extraction, and any registered post-processor plugins run in sequence.
  5. Caching -- If caching is enabled, results are stored keyed by a content hash so repeated extractions skip the entire pipeline.

For a deep dive into each stage, see Extraction Pipeline.

Output Formats

Kreuzberg supports five output formats: Plain text, Markdown, Djot, HTML, and Structured (JSON). The HTML format includes a styled renderer with semantic kb-* CSS classes, five built-in themes, and CSS custom properties for full customization. See HTML Output for details.


OCR Engines

Three OCR backends, usable individually or chained into a quality-driven fallback pipeline.

Backend Comparison

Tesseract PaddleOCR EasyOCR
Languages 100+ 80+ (11 script families) 80+
Best for General purpose, broad language coverage CJK, complex scripts, high accuracy GPU-accelerated workloads
Platform Native and WASM targets Native ONNX Runtime builds Python only
Install System package (tesseract-ocr) Cargo feature paddle-ocr (bundled in Python package by v4.8) pip install kreuzberg[easyocr]
Runtime C library (Tesseract 4.0+) ONNX Runtime (models downloaded on first use) PyTorch (optional CUDA)
Python version Any Any Any

Multi-Backend Pipeline

Available by v4.5

When the paddle-ocr feature is enabled, Kreuzberg automatically constructs a fallback pipeline: Tesseract runs first, and if the output falls below configurable quality thresholds (16 tunable parameters), PaddleOCR takes over. You can also define a custom ordering across all three backends.

The pipeline supports auto-rotate for page orientation detection (0/90/180/270 degrees) and per-stage language and backend-specific settings.

flowchart TD
    A[Image / Scanned Page] --> B[Primary Backend]
    B --> C{Quality Above Threshold?}
    C -->|Yes| D[Return Result]
    C -->|No| E[Fallback Backend]
    E --> F{Quality Above Threshold?}
    F -->|Yes| D
    F -->|No| G[Return Best Result]

Document-Level Optimization

Available by v4.5

Some OCR backends (including EasyOCR) now support document-level processing. When a file path is provided, the extractor can bypass the expensive page-by-page rendering stage and delegate the entire document to the OCR engine. This significantly reduces memory overhead and improves throughput for large PDFs and multi-page images.

For backend configuration, language selection, and PSM/OEM modes, see the OCR Guide.

Candle GLM-OCR

Added in v5.0.0-rc.18

Pure-Rust VLM OCR via the candle-glm-ocr feature. Wraps the zai-org/GLM-OCR 0.9B-param vision-language model running natively through the candle transformer framework. No ONNX Runtime dependency.

Feature flag: candle-glm-ocr

Implies: candle-ocr, kreuzberg-candle-ocr/glm-ocr, layout-detection

Deployment:

  • CPU & Metal (macOS) — Full support
  • CUDA (Linux/Windows with NVIDIA GPU) — Full support
  • WASM — Excluded (candle not available on WASM)
  • Android x86_64 emulator — Excluded (no prebuilt candle targets)

Model & performance:

  • Model size: ~3 GB on first download; cached at ~/.cache/huggingface/
  • Default layout mode: paired — PP-DocLayout-V3 detects regions, per-region task-specific OCR (ocr/table/formula/chart/caption), outputs merged into reading-order markdown
  • Alternative mode: whole_page — Single OCR pass over entire page with optional task override
  • Metal dtype: F32 (BF16 matmul unavailable in candle 0.10)

Configure via --ocr-backend candle-glm-ocr or ocr.backend = "candle-glm-ocr" in config. Set layout mode and device via backend_options: {"layout_mode":"paired"}, {"layout_mode":"whole_page"}, {"device":"metal"}, {"device":"cuda"}.

Candle Hunyuan-OCR

Added in v5.0.0-rc.18

Pure-Rust VLM OCR via the candle-hunyuan-ocr feature. Tencent Hunyuan-OCR vision-language model with document layout understanding and multilingual support. No ONNX Runtime dependency.

Feature flag: candle-hunyuan-ocr

Implies: candle-ocr, kreuzberg-candle-ocr/hunyuan-ocr

Deployment:

  • CPU & Metal (macOS) — Full support
  • CUDA (Linux/Windows with NVIDIA GPU) — Full support
  • WASM — Excluded (candle not available on WASM)
  • Android x86_64 emulator — Excluded (no prebuilt candle targets)

Model & performance:

  • Model size: ~2 GB on first download; cached at ~/.cache/huggingface/
  • Detects layout and text regions, outputs merged into reading-order markdown
  • CPU dtype: F32; CUDA dtype: F16

Configure via --ocr-backend candle-hunyuan-ocr or ocr.backend = "candle-hunyuan-ocr" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.

Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.

Candle DeepSeek-OCR

Added in v5.0.0-rc.18

Pure-Rust VLM OCR via the candle-deepseek-ocr feature. DeepSeek-OCR vision-language model combining SAM, CLIP, Qwen2, and DeepSeek-V2 MoE architecture. Advanced document understanding with multilingual support. No ONNX Runtime dependency.

Feature flag: candle-deepseek-ocr

Implies: candle-ocr, kreuzberg-candle-ocr/deepseek-ocr

Deployment:

  • CPU & Metal (macOS) — Full support
  • CUDA (Linux/Windows with NVIDIA GPU) — Full support
  • WASM — Excluded (candle not available on WASM)
  • Android x86_64 emulator — Excluded (no prebuilt candle targets)

Model & performance:

  • Model size: ~3 GB+ on first download; cached at ~/.cache/huggingface/
  • Fine-grained layout detection, table region recognition, text extraction with confidence scores
  • CPU dtype: F32; CUDA dtype: F16

Configure via --ocr-backend candle-deepseek-ocr or ocr.backend = "candle-deepseek-ocr" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.

Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.

Candle PaddleOCR-VL 1.5

Added in v5.0.0-rc.18

Pure-Rust VLM OCR via the candle-paddleocr-vl-15 feature. PaddleOCR-VL 1.5 vision-language model with SigLIP+Ernie integration. Fast multilingual document OCR with strong CJK support. No ONNX Runtime dependency.

Feature flag: candle-paddleocr-vl-15

Implies: candle-ocr, kreuzberg-candle-ocr/paddleocr-vl-15

Deployment:

  • CPU & Metal (macOS) — Full support
  • CUDA (Linux/Windows with NVIDIA GPU) — Full support
  • WASM — Excluded (candle not available on WASM)
  • Android x86_64 emulator — Excluded (no prebuilt candle targets)

Model & performance:

  • Model size: ~1 GB on first download; cached at ~/.cache/huggingface/
  • Lightweight architecture optimized for speed and accuracy on scanned documents
  • CPU dtype: F32; CUDA dtype: F16

Configure via --ocr-backend candle-paddleocr-vl-15 or ocr.backend = "candle-paddleocr-vl-15" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.

Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.

Candle VLM-OCR Umbrella

The candle-vlm-ocr feature aggregates all Candle VLM-OCR backends: candle-hunyuan-ocr, candle-deepseek-ocr, candle-paddleocr-vl-15, candle-glm-ocr, and candle-trocr. Use this aggregate to enable all pure-Rust vision-language OCR options in a single feature flag.


Processing Features

Optional post-extraction steps, each configured independently through ExtractionConfig.

For RAG Pipelines

Content Chunking -- Split extracted text into sized chunks for LLM consumption. Strategies include recursive (paragraph/sentence/word splitting), semantic, and Markdown-aware chunking that preserves heading hierarchy. Chunks can be sized by character count or by token count using any HuggingFace tokenizer.

Embeddings -- Generate vector embeddings locally using FastEmbed. Choose from preset models ("fast", "balanced", "quality") or any FastEmbed-compatible model. Embeddings are generated in-process with no external API calls.

Page Tracking -- Extract per-page content with byte-accurate offsets for O(1) page lookups. Chunks are automatically mapped to their source pages, enabling precise citations in retrieval systems. Supported for PDF (byte-accurate), PPTX (slide boundaries), and DOCX (best-effort page breaks). See Extraction Basics for usage.

PDF Hierarchy Detection -- Detect document structure from PDFs using K-means clustering on block characteristics (font size, weight, indentation, position). Blocks are assigned to semantic levels (title, section, subsection, paragraph) without relying on explicit heading tags. See the Output Formats Guide.

PDF Page Rendering v4.6 -- Render individual PDF pages as PNG images for thumbnails, vision model input, or custom processing pipelines. Memory-efficient iterator renders one page at a time. Configurable DPI (default 150). Available across all language bindings. See Extraction Guide.

LLM-Powered Intelligence

Available by v4.8

Kreuzberg integrates with 143 LLM providers including local inference (Ollama, LM Studio, vLLM, llama.cpp) via liter-llm to unlock three new capabilities that complement the local extraction pipeline.

VLM OCR -- Vision language models as an OCR backend Use OpenAI GPT-4o, Anthropic Claude, Google Gemini, or any vision-capable model as an OCR engine. VLM OCR delivers superior accuracy on low-quality scans, handwriting, Arabic/Farsi scripts, and complex layouts where traditional OCR struggles. Configure via `ocr.backend = "vlm"` with `ocr.vlm_config` in your extraction config or `kreuzberg.toml`.
Structured Extraction -- Extract typed JSON from documents using a schema Provide a JSON schema and an optional Jinja2 prompt template; the LLM returns conforming structured data. Supports strict mode (OpenAI) with automatic `additionalProperties` sanitization for cross-provider compatibility. Available through the `kreuzberg extract-structured` CLI command, `POST /extract-structured` API endpoint, and `extract_structured` MCP tool.
{
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string" },
    "total": { "type": "number" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}
VLM Embeddings -- Provider-hosted embedding models Use provider-hosted embedding models (for example, `openai/text-embedding-3-small`, `mistral/mistral-embed`) as an alternative to local ONNX models. Works through the existing `/embed` API endpoint, `embed_text` MCP tool, and `embed` CLI command with `--provider llm`.
Custom Jinja2 Prompts -- Minijinja template engine for LLM prompts Customize the prompts sent to LLMs with Minijinja templates. Available variables for structured extraction: `{{ content }}`, `{{ schema }}`, `{{ schema_name }}`, `{{ schema_description }}`. For VLM OCR prompts: `{{ language }}`. Override the default prompt per-request or in configuration.

LlmConfig and StructuredExtractionConfig types are exposed in Python, Node.js, and PHP bindings. Five new environment variables (KREUZBERG_LLM_MODEL, KREUZBERG_LLM_API_KEY, KREUZBERG_LLM_BASE_URL, KREUZBERG_VLM_OCR_MODEL, KREUZBERG_VLM_EMBEDDING_MODEL) provide zero-code configuration.

Document Enrichment

Available by v5.0

Named-Entity Recognition -- Detect people, organisations, locations, dates, money, percentages, emails, phones, URLs, and caller-supplied zero-shot labels via gline-rs (ONNX) or any liter-llm provider. Results populate ExtractionResult.entities. See the NER Guide.

Redaction & Anonymisation -- Late-stage post-processor that rewrites content, formatted_content, chunks, entities, summary, translation, and page classifications. Pattern engine covers emails, phones, SSNs, credit cards, IBANs, IP addresses, SWIFT/BIC, postal codes, dates of birth; pair with NER for PERSON / ORGANIZATION / LOCATION. Strategies: mask, hash, token-replace, drop. Caller can supply literal terms and regex patterns. See the Redaction Guide.

Document Summarisation -- Pure-Rust TextRank (extractive, local, deterministic) or any liter-llm provider (abstractive). Result on ExtractionResult.summary. See the Summarisation Guide.

Document Translation -- Translate content, formatted_content, and per-chunk text into a BCP-47 target language with any liter-llm provider. Optional Markdown/HTML preservation. Result on ExtractionResult.translation. See the Translation Guide.

Page Classification -- Per-page LLM classification against caller-supplied labels. Single-label or multi-label. Result on ExtractionResult.page_classifications. See the Page Classification Guide.

VLM Image Captions -- Describe extracted images with any vision-capable liter-llm provider. Result on ExtractedImage.caption. See the Image Captions Guide.

QR-Code Detection -- Pure-Rust rqrr decoder runs over extracted images. Result on ExtractedImage.qr_codes. Ships in wasm-target and android-target. See the QR Codes Guide.

For Search and Indexing

Keyword Extraction -- Extract key phrases using YAKE (unsupervised, language-independent) or RAKE (fast statistical method). Configurable n-gram ranges and language-specific stopword filtering. See the Keyword Extraction Guide.

Language Detection -- Identify 60+ languages with confidence scoring using fast-langdetect. Supports multi-language detection for documents with mixed content.

Metadata Extraction -- Pull document properties (title, author, creation date), page/word/character counts, and format-specific metadata (Excel sheet names, PDF annotations).

For Code

Code Intelligence -- Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics from 306 programming languages via tree-sitter. Results are available in ExtractionResult.code_intelligence as a ProcessResult. Code files produce semantic chunks (function/class-aware) that bypass the text-splitter entirely. Configure content mode with CodeContentMode: chunks (default, semantic TSLP chunks), raw (source as-is), or structure (headings + docstrings only).

For Data Quality

Quality Processing -- Unicode normalization (NFC/NFD/NFKC/NFKD), whitespace and line break standardization, encoding detection, and mojibake correction.

Token Reduction -- Reduce token count while preserving meaning through TF-IDF-based extractive summarization. Three modes: light (~15% reduction), moderate (~30%), and aggressive (~50%).

Table Extraction -- Structured table data from PDFs, spreadsheets, and Word documents with cell-level row/column indexing, merged cell support, and Markdown or JSON output.


Layout Detection

Available by v4.5

Detect and classify document regions using ONNX-based deep learning. Layout detection identifies 17 element types (text, tables, figures, headers, code, forms, captions, and more), enabling accurate region-aware extraction and structured table recovery.

RT-DETR v2 -- The layout detection model that identifies document structure with high precision. Automatically selects and configures separate table structure models (TATR, SLANeXT variants, or SLANet-plus) for cell-level analysis within detected table regions.

Table Structure Recognition -- When layout detection identifies a table, a configurable table structure model analyzes rows, columns, headers, and spanning cells for HTML recovery with colspan/rowspan support. Choose from:

  • TATR (30 MB) — General-purpose, fast, default
  • SLANeXT Wired/Wireless/Auto (365–737 MB) — Optimized for bordered/borderless tables with auto-detection
  • SLANet-plus (7.78 MB) — Lightweight, resource-constrained environments

GPU acceleration via ONNX Runtime (CUDA, CoreML, TensorRT) significantly reduces inference time. Models are automatically downloaded and cached on first use.

Availability: Native builds that include ONNX Runtime. It is excluded from wasm-target, android-target, and the curated windows-target aggregate.

For configuration and usage, see the Layout Detection Guide.


Plugin System

The extraction pipeline and query-time APIs are extensible through six plugin categories:

flowchart LR
    A[File Input] --> B[Document Extractor Plugin]
    B --> C[OCR Backend Plugin]
    C --> D[Validator Plugin]
    D --> E[Post-Processor Plugin]
    E --> F[Renderer Plugin]
    F --> G[Output]
    H[Query + Documents] --> I[Reranker Backend Plugin]
    I --> J[Reranked Documents]
Plugin Type Purpose Example
Document Extractors Add support for custom file formats or override defaults Proprietary format parser
OCR Backends Integrate cloud OCR services or custom engines AWS Textract, Google Vision
Reranker Backends Score query/document pairs for search ranking Cross-encoder or provider API
Validators Enforce quality standards on extraction results Minimum word count check
Post-Processors Transform or enrich results after extraction PII redaction, custom metadata
Renderers Convert document structures into output formats Custom Markdown or HTML writer

Plugins are registered programmatically through typed registries. Built-in plugins register at initialization when their Cargo feature is active; runtime configuration selects registered backends and processors.

For the architecture overview, see Plugin System. For implementation guidance, see Creating Plugins.


Deployment Modes

Mode When to Use Details
Library Embedding extraction into your application Import the package in Python, TypeScript, Rust, Go, Java/Kotlin JVM, Kotlin Android, Ruby, C#, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm
CLI One-off extractions, scripting, CI pipelines kreuzberg extract document.pdf --format json -- see CLI Usage
REST API Multi-service architectures, language-agnostic access kreuzberg serve --port 8000 -- see API Server Guide
MCP Server AI agent integration (Claude Desktop, Continue.dev) kreuzberg mcp -- stdio transport with JSON-RPC 2.0
Docker Reproducible deployments with all dependencies bundled ghcr.io/kreuzberg-dev/kreuzberg:latest -- see Docker Guide

Language Bindings

Polyglot bindings share the Rust core and expose the same generated types where the target platform supports the underlying feature.

Binding Tiers

Full feature parity with async API -- Rust, Python (PyO3), TypeScript/Node.js (NAPI-RS)

Full features, synchronous API -- Go, Ruby, C#, Java, PHP, Elixir

Native FFI surfaces -- C, R, Dart, Swift, Zig, Kotlin Android

TypeScript: Two flavors

  • Native (@kreuzberg/node) — Full speed, complete feature parity (servers, plugins, config file discovery)
  • WASM (@kreuzberg/wasm) — Browser/edge runtime, 60–80% of native speed, no native dependencies required. Excluded features: ORT-dependent inference (paddle-ocr, layout detection, embeddings, reranker, auto-rotate, transcription), liter-llm/VLM features, server modes (api/mcp), CLI binary, and browser filesystem paths. Pure-Rust extraction formats, Tesseract WASM OCR, chunking, keywords, language detection, stopwords, tree-sitter, redaction, summarization, SVG, and QR-code detection are supported.

Choose Native for server-side Node.js; choose WASM for browser or edge deployments.

Rust Feature Flags

Rust builds are modular through Cargo features. The default feature set is tokio-runtime plus simd-utf8; enable format and analysis features explicitly for the surface you need.

Category Features
Format extractors pdf, excel, office, hwp, hwpx, iwork, email, html, xml, archives, mdx, svg, heic
OCR and ML ocr, ocr-wasm, paddle-ocr, layout-detection, embeddings, reranker, transcription, liter-llm
Text analysis language-detection, chunking, quality, keywords, stopwords, diff, ner, redaction, summarization, translation, classification, captioning, qr-codes
Servers api, mcp, mcp-http, otel
Bundles formats, analysis, services, full, server, cli, wasm-target, android-target, windows-target

Package Installation

pip install kreuzberg                  # Core + Tesseract + PaddleOCR
pip install kreuzberg[easyocr]         # + EasyOCR
pip install kreuzberg[all]             # Everything
npm install @kreuzberg/node            # Native (Node.js/Bun)
npm install @kreuzberg/wasm            # WASM (browser/edge)
[dependencies]
kreuzberg = { version = "5", features = ["pdf", "ocr", "chunking"] }
gem install kreuzberg                  # Ruby
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v5  # Go
dotnet add package Kreuzberg           # C#

For API details per language, see the API Reference.


Configuration

Four configuration methods, checked in this order:

  1. Programmatic -- Construct ExtractionConfig objects in code (all bindings)
  2. TOML -- kreuzberg.toml
  3. YAML -- kreuzberg.yaml
  4. JSON -- kreuzberg.json

Config files are auto-discovered from the current directory, ~/.config/kreuzberg/, and /etc/kreuzberg/. Environment variables (KREUZBERG_CONFIG_PATH, KREUZBERG_CACHE_DIR, KREUZBERG_OCR_BACKEND, KREUZBERG_OCR_LANGUAGE) override file-based settings.

For the full configuration schema and examples, see the Configuration Guide.


AI Coding Assistants

Added in v4.2

Kreuzberg ships with an Agent Skill that teaches AI coding assistants the complete API across Python, TypeScript, Rust, and CLI. Install it with:

npx skills add kreuzberg-dev/kreuzberg

Compatible with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard. See the AI Coding Assistants Guide.


Next Steps

Edit this page on GitHub