Features¶
A map of what Kreuzberg can do. Each section links to the guide or reference page with configuration details and code examples.

Format Support¶
96 file formats handled by native Rust extractors — no LibreOffice or other external tools required.
.pdf
Word .docx .doc
Pages .pages
PowerPoint .pptx .ppt
Keynote .key
OpenDocument .odt
Plain text .txt
Markdown .md
Djot .djot
MDX .mdx
RTF .rtf
reStructuredText .rst
Org .org
Hangul .hwp .hwpx
.xlsx .xls .xlsm .xlsb
Numbers .numbers
OpenDocument .ods
CSV .csv
TSV .tsv
dBASE .dbf
.jpg .jpeg
PNG .png
GIF .gif
BMP .bmp
TIFF .tiff .tif
WebP .webp
JPEG 2000 .jp2 .jpx .jpm .mj2
JBIG2 .jbig2
PNM .pnm .pbm .pgm .ppm
HEIC .heic .heics
HEIF .heif
AVIF .avif
AVCS .avcs
HEIF / HEIC / AVIF
Pixel decoding for HEIF-family containers requires the heic Cargo
feature (included in full) and the system libheif library at build
and runtime. Native targets only — not available on wasm-target or
android-target. EXIF metadata extraction from HEIC / AVIF works on
every target via the pure-Rust nom-exif integration. See the
installation guide.
.mp3 .mpga
M4A .m4a
WAV .wav
WebM audio .webm
MP4 audio track .mp4 .mpeg
WebM audio track .webm
Since v5.0
Enable the transcription feature and set a transcription config block to extract Whisper ONNX transcripts from audio files and video audio tracks. See Audio and Video Transcription.
.eml
MSG .msg
.html .htm
XHTML .xhtml
XML .xml
SVG .svg
.json
YAML .yaml
TOML .toml
.zip
TAR .tar .tgz
GZIP .gz
7-Zip .7z
.epub
BibTeX .bib
RIS .ris
CSL .csl
LaTeX .tex
Typst .typ
JATS .jats
DocBook .docbook
OPML .opml
For the full format matrix with MIME types, extraction methods, and special capabilities, see the Format Support Reference.
Feature Availability¶
Use these labels when matching docs to deployed packages. Labels use major.minor only.
| Version | Feature area |
|---|---|
| v4.0 | HTML metadata extraction and the pdf_oxide PDF provider. |
| v4.3 | LibreOffice-free extraction for legacy .doc and .ppt files. |
| v4.5 | OCR pipeline fallback, layout detection, and document-level OCR. |
| v4.6 | PDF page rendering. |
| v4.8 | LLM/VLM intelligence through liter-llm. |
| v5.0 | Image-index references, SVG/image output normalization, HEIC aggregate formats, list_supported_formats, Whisper audio/video transcription, reranking, NER, redaction, summarization, translation, page classification, image captions, QR-code detection, and the windows-target feature aggregate. |
Extraction Pipeline¶
Every file flows through the same multi-stage pipeline:
flowchart LR
A[Input File] --> B[MIME Detection]
B --> C[Format Extractor]
C --> D{OCR Needed?}
D -->|Yes| E[OCR Engine]
D -->|No| F[Post-Processing]
E --> F
F --> G[ExtractionResult]
- MIME detection -- Kreuzberg identifies the file type from magic bytes and extension, then selects the matching native extractor from the registry.
- Format extraction -- The extractor pulls text, tables, metadata, and optionally images from the file. PDF extraction uses pdf_oxide (pure Rust); Office formats use native XML or OLE/CFB parsers; images pass directly to OCR.
- OCR -- When the extractor finds no text layer (or
force_ocris set), the file is routed to the configured OCR backend. The OCR result replaces or supplements the extracted text. - Post-processing -- Validators, quality processing, chunking, embeddings, keyword extraction, and any registered post-processor plugins run in sequence.
- Caching -- If caching is enabled, results are stored keyed by a content hash so repeated extractions skip the entire pipeline.
For a deep dive into each stage, see Extraction Pipeline.
Output Formats¶
Kreuzberg supports five output formats: Plain text, Markdown, Djot, HTML, and Structured (JSON). The HTML format includes a styled renderer with semantic kb-* CSS classes, five built-in themes, and CSS custom properties for full customization. See HTML Output for details.
OCR Engines¶
Three OCR backends, usable individually or chained into a quality-driven fallback pipeline.
Backend Comparison¶
| Tesseract | PaddleOCR | EasyOCR | |
|---|---|---|---|
| Languages | 100+ | 80+ (11 script families) | 80+ |
| Best for | General purpose, broad language coverage | CJK, complex scripts, high accuracy | GPU-accelerated workloads |
| Platform | Native and WASM targets | Native ONNX Runtime builds | Python only |
| Install | System package (tesseract-ocr) |
Cargo feature paddle-ocr (bundled in Python package by v4.8) |
pip install kreuzberg[easyocr] |
| Runtime | C library (Tesseract 4.0+) | ONNX Runtime (models downloaded on first use) | PyTorch (optional CUDA) |
| Python version | Any | Any | Any |
Multi-Backend Pipeline¶
Available by v4.5
When the paddle-ocr feature is enabled, Kreuzberg automatically constructs a fallback pipeline: Tesseract runs first, and if the output falls below configurable quality thresholds (16 tunable parameters), PaddleOCR takes over. You can also define a custom ordering across all three backends.
The pipeline supports auto-rotate for page orientation detection (0/90/180/270 degrees) and per-stage language and backend-specific settings.
flowchart TD
A[Image / Scanned Page] --> B[Primary Backend]
B --> C{Quality Above Threshold?}
C -->|Yes| D[Return Result]
C -->|No| E[Fallback Backend]
E --> F{Quality Above Threshold?}
F -->|Yes| D
F -->|No| G[Return Best Result]
Document-Level Optimization¶
Available by v4.5
Some OCR backends (including EasyOCR) now support document-level processing. When a file path is provided, the extractor can bypass the expensive page-by-page rendering stage and delegate the entire document to the OCR engine. This significantly reduces memory overhead and improves throughput for large PDFs and multi-page images.
For backend configuration, language selection, and PSM/OEM modes, see the OCR Guide.
Candle GLM-OCR¶
Added in v5.0.0-rc.18
Pure-Rust VLM OCR via the candle-glm-ocr feature. Wraps the zai-org/GLM-OCR 0.9B-param vision-language model running natively through the candle transformer framework. No ONNX Runtime dependency.
Feature flag: candle-glm-ocr
Implies: candle-ocr, kreuzberg-candle-ocr/glm-ocr, layout-detection
Deployment:
- CPU & Metal (macOS) — Full support
- CUDA (Linux/Windows with NVIDIA GPU) — Full support
- WASM — Excluded (candle not available on WASM)
- Android x86_64 emulator — Excluded (no prebuilt candle targets)
Model & performance:
- Model size: ~3 GB on first download; cached at
~/.cache/huggingface/ - Default layout mode:
paired— PP-DocLayout-V3 detects regions, per-region task-specific OCR (ocr/table/formula/chart/caption), outputs merged into reading-order markdown - Alternative mode:
whole_page— Single OCR pass over entire page with optional task override - Metal dtype: F32 (BF16 matmul unavailable in candle 0.10)
Configure via --ocr-backend candle-glm-ocr or ocr.backend = "candle-glm-ocr" in config. Set layout mode and device via backend_options: {"layout_mode":"paired"}, {"layout_mode":"whole_page"}, {"device":"metal"}, {"device":"cuda"}.
Candle Hunyuan-OCR¶
Added in v5.0.0-rc.18
Pure-Rust VLM OCR via the candle-hunyuan-ocr feature. Tencent Hunyuan-OCR vision-language model with document layout understanding and multilingual support. No ONNX Runtime dependency.
Feature flag: candle-hunyuan-ocr
Implies: candle-ocr, kreuzberg-candle-ocr/hunyuan-ocr
Deployment:
- CPU & Metal (macOS) — Full support
- CUDA (Linux/Windows with NVIDIA GPU) — Full support
- WASM — Excluded (candle not available on WASM)
- Android x86_64 emulator — Excluded (no prebuilt candle targets)
Model & performance:
- Model size: ~2 GB on first download; cached at
~/.cache/huggingface/ - Detects layout and text regions, outputs merged into reading-order markdown
- CPU dtype: F32; CUDA dtype: F16
Configure via --ocr-backend candle-hunyuan-ocr or ocr.backend = "candle-hunyuan-ocr" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.
Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.
Candle DeepSeek-OCR¶
Added in v5.0.0-rc.18
Pure-Rust VLM OCR via the candle-deepseek-ocr feature. DeepSeek-OCR vision-language model combining SAM, CLIP, Qwen2, and DeepSeek-V2 MoE architecture. Advanced document understanding with multilingual support. No ONNX Runtime dependency.
Feature flag: candle-deepseek-ocr
Implies: candle-ocr, kreuzberg-candle-ocr/deepseek-ocr
Deployment:
- CPU & Metal (macOS) — Full support
- CUDA (Linux/Windows with NVIDIA GPU) — Full support
- WASM — Excluded (candle not available on WASM)
- Android x86_64 emulator — Excluded (no prebuilt candle targets)
Model & performance:
- Model size: ~3 GB+ on first download; cached at
~/.cache/huggingface/ - Fine-grained layout detection, table region recognition, text extraction with confidence scores
- CPU dtype: F32; CUDA dtype: F16
Configure via --ocr-backend candle-deepseek-ocr or ocr.backend = "candle-deepseek-ocr" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.
Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.
Candle PaddleOCR-VL 1.5¶
Added in v5.0.0-rc.18
Pure-Rust VLM OCR via the candle-paddleocr-vl-15 feature. PaddleOCR-VL 1.5 vision-language model with SigLIP+Ernie integration. Fast multilingual document OCR with strong CJK support. No ONNX Runtime dependency.
Feature flag: candle-paddleocr-vl-15
Implies: candle-ocr, kreuzberg-candle-ocr/paddleocr-vl-15
Deployment:
- CPU & Metal (macOS) — Full support
- CUDA (Linux/Windows with NVIDIA GPU) — Full support
- WASM — Excluded (candle not available on WASM)
- Android x86_64 emulator — Excluded (no prebuilt candle targets)
Model & performance:
- Model size: ~1 GB on first download; cached at
~/.cache/huggingface/ - Lightweight architecture optimized for speed and accuracy on scanned documents
- CPU dtype: F32; CUDA dtype: F16
Configure via --ocr-backend candle-paddleocr-vl-15 or ocr.backend = "candle-paddleocr-vl-15" in config. Set device via backend_options: {"device":"metal"}, {"device":"cuda"}.
Attribution: Model vendored from jhqxxx/aha (Apache-2.0). See ATTRIBUTIONS.md.
Candle VLM-OCR Umbrella¶
The candle-vlm-ocr feature aggregates all Candle VLM-OCR backends: candle-hunyuan-ocr, candle-deepseek-ocr, candle-paddleocr-vl-15, candle-glm-ocr, and candle-trocr. Use this aggregate to enable all pure-Rust vision-language OCR options in a single feature flag.
Processing Features¶
Optional post-extraction steps, each configured independently through ExtractionConfig.
For RAG Pipelines¶
Content Chunking -- Split extracted text into sized chunks for LLM consumption. Strategies include recursive (paragraph/sentence/word splitting), semantic, and Markdown-aware chunking that preserves heading hierarchy. Chunks can be sized by character count or by token count using any HuggingFace tokenizer.
Embeddings -- Generate vector embeddings locally using FastEmbed. Choose from preset models ("fast", "balanced", "quality") or any FastEmbed-compatible model. Embeddings are generated in-process with no external API calls.
Page Tracking -- Extract per-page content with byte-accurate offsets for O(1) page lookups. Chunks are automatically mapped to their source pages, enabling precise citations in retrieval systems. Supported for PDF (byte-accurate), PPTX (slide boundaries), and DOCX (best-effort page breaks). See Extraction Basics for usage.
PDF Hierarchy Detection -- Detect document structure from PDFs using K-means clustering on block characteristics (font size, weight, indentation, position). Blocks are assigned to semantic levels (title, section, subsection, paragraph) without relying on explicit heading tags. See the Output Formats Guide.
PDF Page Rendering v4.6 -- Render individual PDF pages as PNG images for thumbnails, vision model input, or custom processing pipelines. Memory-efficient iterator renders one page at a time. Configurable DPI (default 150). Available across all language bindings. See Extraction Guide.
LLM-Powered Intelligence¶
Available by v4.8
Kreuzberg integrates with 143 LLM providers including local inference (Ollama, LM Studio, vLLM, llama.cpp) via liter-llm to unlock three new capabilities that complement the local extraction pipeline.
VLM OCR -- Vision language models as an OCR backend
Use OpenAI GPT-4o, Anthropic Claude, Google Gemini, or any vision-capable model as an OCR engine. VLM OCR delivers superior accuracy on low-quality scans, handwriting, Arabic/Farsi scripts, and complex layouts where traditional OCR struggles. Configure via `ocr.backend = "vlm"` with `ocr.vlm_config` in your extraction config or `kreuzberg.toml`.Structured Extraction -- Extract typed JSON from documents using a schema
Provide a JSON schema and an optional Jinja2 prompt template; the LLM returns conforming structured data. Supports strict mode (OpenAI) with automatic `additionalProperties` sanitization for cross-provider compatibility. Available through the `kreuzberg extract-structured` CLI command, `POST /extract-structured` API endpoint, and `extract_structured` MCP tool.VLM Embeddings -- Provider-hosted embedding models
Use provider-hosted embedding models (for example, `openai/text-embedding-3-small`, `mistral/mistral-embed`) as an alternative to local ONNX models. Works through the existing `/embed` API endpoint, `embed_text` MCP tool, and `embed` CLI command with `--provider llm`.Custom Jinja2 Prompts -- Minijinja template engine for LLM prompts
Customize the prompts sent to LLMs with Minijinja templates. Available variables for structured extraction: `{{ content }}`, `{{ schema }}`, `{{ schema_name }}`, `{{ schema_description }}`. For VLM OCR prompts: `{{ language }}`. Override the default prompt per-request or in configuration.LlmConfig and StructuredExtractionConfig types are exposed in Python, Node.js, and PHP bindings. Five new environment variables (KREUZBERG_LLM_MODEL, KREUZBERG_LLM_API_KEY, KREUZBERG_LLM_BASE_URL, KREUZBERG_VLM_OCR_MODEL, KREUZBERG_VLM_EMBEDDING_MODEL) provide zero-code configuration.
Document Enrichment¶
Available by v5.0
Named-Entity Recognition -- Detect people, organisations, locations, dates, money, percentages, emails, phones, URLs, and caller-supplied zero-shot labels via gline-rs (ONNX) or any liter-llm provider. Results populate ExtractionResult.entities. See the NER Guide.
Redaction & Anonymisation -- Late-stage post-processor that rewrites content, formatted_content, chunks, entities, summary, translation, and page classifications. Pattern engine covers emails, phones, SSNs, credit cards, IBANs, IP addresses, SWIFT/BIC, postal codes, dates of birth; pair with NER for PERSON / ORGANIZATION / LOCATION. Strategies: mask, hash, token-replace, drop. Caller can supply literal terms and regex patterns. See the Redaction Guide.
Document Summarisation -- Pure-Rust TextRank (extractive, local, deterministic) or any liter-llm provider (abstractive). Result on ExtractionResult.summary. See the Summarisation Guide.
Document Translation -- Translate content, formatted_content, and per-chunk text into a BCP-47 target language with any liter-llm provider. Optional Markdown/HTML preservation. Result on ExtractionResult.translation. See the Translation Guide.
Page Classification -- Per-page LLM classification against caller-supplied labels. Single-label or multi-label. Result on ExtractionResult.page_classifications. See the Page Classification Guide.
VLM Image Captions -- Describe extracted images with any vision-capable liter-llm provider. Result on ExtractedImage.caption. See the Image Captions Guide.
QR-Code Detection -- Pure-Rust rqrr decoder runs over extracted images. Result on ExtractedImage.qr_codes. Ships in wasm-target and android-target. See the QR Codes Guide.
For Search and Indexing¶
Keyword Extraction -- Extract key phrases using YAKE (unsupervised, language-independent) or RAKE (fast statistical method). Configurable n-gram ranges and language-specific stopword filtering. See the Keyword Extraction Guide.
Language Detection -- Identify 60+ languages with confidence scoring using fast-langdetect. Supports multi-language detection for documents with mixed content.
Metadata Extraction -- Pull document properties (title, author, creation date), page/word/character counts, and format-specific metadata (Excel sheet names, PDF annotations).
For Code¶
Code Intelligence -- Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics from 306 programming languages via tree-sitter. Results are available in ExtractionResult.code_intelligence as a ProcessResult. Code files produce semantic chunks (function/class-aware) that bypass the text-splitter entirely. Configure content mode with CodeContentMode: chunks (default, semantic TSLP chunks), raw (source as-is), or structure (headings + docstrings only).
For Data Quality¶
Quality Processing -- Unicode normalization (NFC/NFD/NFKC/NFKD), whitespace and line break standardization, encoding detection, and mojibake correction.
Token Reduction -- Reduce token count while preserving meaning through TF-IDF-based extractive summarization. Three modes: light (~15% reduction), moderate (~30%), and aggressive (~50%).
Table Extraction -- Structured table data from PDFs, spreadsheets, and Word documents with cell-level row/column indexing, merged cell support, and Markdown or JSON output.
Layout Detection¶
Available by v4.5
Detect and classify document regions using ONNX-based deep learning. Layout detection identifies 17 element types (text, tables, figures, headers, code, forms, captions, and more), enabling accurate region-aware extraction and structured table recovery.
RT-DETR v2 -- The layout detection model that identifies document structure with high precision. Automatically selects and configures separate table structure models (TATR, SLANeXT variants, or SLANet-plus) for cell-level analysis within detected table regions.
Table Structure Recognition -- When layout detection identifies a table, a configurable table structure model analyzes rows, columns, headers, and spanning cells for HTML recovery with colspan/rowspan support. Choose from:
- TATR (30 MB) — General-purpose, fast, default
- SLANeXT Wired/Wireless/Auto (365–737 MB) — Optimized for bordered/borderless tables with auto-detection
- SLANet-plus (7.78 MB) — Lightweight, resource-constrained environments
GPU acceleration via ONNX Runtime (CUDA, CoreML, TensorRT) significantly reduces inference time. Models are automatically downloaded and cached on first use.
Availability: Native builds that include ONNX Runtime. It is excluded from wasm-target, android-target, and the curated windows-target aggregate.
For configuration and usage, see the Layout Detection Guide.
Plugin System¶
The extraction pipeline and query-time APIs are extensible through six plugin categories:
flowchart LR
A[File Input] --> B[Document Extractor Plugin]
B --> C[OCR Backend Plugin]
C --> D[Validator Plugin]
D --> E[Post-Processor Plugin]
E --> F[Renderer Plugin]
F --> G[Output]
H[Query + Documents] --> I[Reranker Backend Plugin]
I --> J[Reranked Documents]
| Plugin Type | Purpose | Example |
|---|---|---|
| Document Extractors | Add support for custom file formats or override defaults | Proprietary format parser |
| OCR Backends | Integrate cloud OCR services or custom engines | AWS Textract, Google Vision |
| Reranker Backends | Score query/document pairs for search ranking | Cross-encoder or provider API |
| Validators | Enforce quality standards on extraction results | Minimum word count check |
| Post-Processors | Transform or enrich results after extraction | PII redaction, custom metadata |
| Renderers | Convert document structures into output formats | Custom Markdown or HTML writer |
Plugins are registered programmatically through typed registries. Built-in plugins register at initialization when their Cargo feature is active; runtime configuration selects registered backends and processors.
For the architecture overview, see Plugin System. For implementation guidance, see Creating Plugins.
Deployment Modes¶
| Mode | When to Use | Details |
|---|---|---|
| Library | Embedding extraction into your application | Import the package in Python, TypeScript, Rust, Go, Java/Kotlin JVM, Kotlin Android, Ruby, C#, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm |
| CLI | One-off extractions, scripting, CI pipelines | kreuzberg extract document.pdf --format json -- see CLI Usage |
| REST API | Multi-service architectures, language-agnostic access | kreuzberg serve --port 8000 -- see API Server Guide |
| MCP Server | AI agent integration (Claude Desktop, Continue.dev) | kreuzberg mcp -- stdio transport with JSON-RPC 2.0 |
| Docker | Reproducible deployments with all dependencies bundled | ghcr.io/kreuzberg-dev/kreuzberg:latest -- see Docker Guide |
Language Bindings¶
Polyglot bindings share the Rust core and expose the same generated types where the target platform supports the underlying feature.
Binding Tiers¶
Full feature parity with async API -- Rust, Python (PyO3), TypeScript/Node.js (NAPI-RS)
Full features, synchronous API -- Go, Ruby, C#, Java, PHP, Elixir
Native FFI surfaces -- C, R, Dart, Swift, Zig, Kotlin Android
TypeScript: Two flavors
- Native (
@kreuzberg/node) — Full speed, complete feature parity (servers, plugins, config file discovery) - WASM (
@kreuzberg/wasm) — Browser/edge runtime, 60–80% of native speed, no native dependencies required. Excluded features: ORT-dependent inference (paddle-ocr, layout detection, embeddings, reranker, auto-rotate, transcription), liter-llm/VLM features, server modes (api/mcp), CLI binary, and browser filesystem paths. Pure-Rust extraction formats, Tesseract WASM OCR, chunking, keywords, language detection, stopwords, tree-sitter, redaction, summarization, SVG, and QR-code detection are supported.
Choose Native for server-side Node.js; choose WASM for browser or edge deployments.
Rust Feature Flags¶
Rust builds are modular through Cargo features. The default feature set is tokio-runtime plus simd-utf8; enable format and analysis features explicitly for the surface you need.
| Category | Features |
|---|---|
| Format extractors | pdf, excel, office, hwp, hwpx, iwork, email, html, xml, archives, mdx, svg, heic |
| OCR and ML | ocr, ocr-wasm, paddle-ocr, layout-detection, embeddings, reranker, transcription, liter-llm |
| Text analysis | language-detection, chunking, quality, keywords, stopwords, diff, ner, redaction, summarization, translation, classification, captioning, qr-codes |
| Servers | api, mcp, mcp-http, otel |
| Bundles | formats, analysis, services, full, server, cli, wasm-target, android-target, windows-target |
Package Installation¶
For API details per language, see the API Reference.
Configuration¶
Four configuration methods, checked in this order:
- Programmatic -- Construct
ExtractionConfigobjects in code (all bindings) - TOML --
kreuzberg.toml - YAML --
kreuzberg.yaml - JSON --
kreuzberg.json
Config files are auto-discovered from the current directory, ~/.config/kreuzberg/, and /etc/kreuzberg/. Environment variables (KREUZBERG_CONFIG_PATH, KREUZBERG_CACHE_DIR, KREUZBERG_OCR_BACKEND, KREUZBERG_OCR_LANGUAGE) override file-based settings.
For the full configuration schema and examples, see the Configuration Guide.
AI Coding Assistants¶
Added in v4.2
Kreuzberg ships with an Agent Skill that teaches AI coding assistants the complete API across Python, TypeScript, Rust, and CLI. Install it with:
Compatible with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard. See the AI Coding Assistants Guide.
Next Steps¶
- Installation -- Install Kreuzberg for your language
- Quick Start -- Extract your first document in 5 minutes
- Architecture -- Understand the Rust core and binding layers
- Development Workflow -- Performance benchmarks and optimization guidance