OCR (Optical Character Recognition)¶
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set force_ocr=True to OCR all pages regardless.
Backend Comparison¶
Six OCR backends — pick based on platform, accuracy needs, and language coverage.
| Tesseract | PaddleOCR | EasyOCR | Candle GLM-OCR v5.0 | Candle TrOCR v5.0 | Candle Hunyuan-OCR v5.0 | Candle DeepSeek-OCR v5.0 | Candle PaddleOCR-VL v5.0 | VLM | |
|---|---|---|---|---|---|---|---|---|---|
| Speed | Fast | Very fast | Moderate | Moderate | Moderate | Moderate | Moderate | Moderate | Slow (API latency) |
| Accuracy | Good | Excellent | Excellent | Excellent | Good | Excellent | Excellent | Excellent | Highest |
| Languages | 100+ | 80+ (11 script families) | 80+ | All | 100+ | 20+ (CJK + Latin) | 20+ (CJK + Latin) | 20+ (CJK + Latin) | All (provider-dependent) |
| Installation | System package | Built-in (native) or Python package | Python package only | Cargo feature | Cargo feature | Cargo feature | Cargo feature | Cargo feature | API key only |
| Model size | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | ~3 GB | ~250 MB | ~3.5 GB | ~4 GB | ~2.5 GB | None (cloud-hosted) |
| GPU support | No | Yes | Yes | Yes (Metal/CUDA) | Yes (Metal/CUDA) | Yes (Metal/CUDA) | Yes (Metal/CUDA) | Yes (Metal/CUDA) | N/A (server-side) |
| Platform | All (including Wasm) | All except Wasm | Python only | Native only | Native only | Native only | Native only | Native only | All |
| Cost | Free | Free | Free | Free | Free | Free | Free | Free | Per-token API cost |
When to use which:
- Tesseract — Default choice. Works everywhere, low overhead, broadest platform support.
- PaddleOCR — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
- EasyOCR — Highest accuracy with deep learning models. Python-only, heavier dependency.
- Candle GLM-OCR v5.0 — Excellent accuracy with VLM-level reasoning on 0.9B-param GLM model. Pure Rust, GPU-accelerated (Metal on macOS, CUDA on Linux). Region-aware layout dispatch. First download ~3 GB.
- Candle TrOCR v5.0 — Smaller model footprint (~250 MB) with solid accuracy across languages. Pure Rust, GPU-accelerated. Good balance of speed and quality.
- Candle Hunyuan-OCR v5.0 — Tencent Hunyuan-OCR with comprehensive document parsing and multilingual support including CJK and Latin scripts. Pure Rust, GPU-accelerated. First download ~3.5 GB.
- Candle DeepSeek-OCR v5.0 — Deep learning-based OCR combining SAM + CLIP + Qwen2 + DeepSeek MoE. Multilingual with strong CJK coverage. Pure Rust, GPU-accelerated. First download ~4 GB.
- Candle PaddleOCR-VL v5.0 — SigLIP vision encoder + Ernie-4.5 text decoder. Lightweight multilingual model with CJK and Latin support. Pure Rust, GPU-accelerated. First download ~2.5 GB.
- VLM — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See LLM Integration for full details.
Installation¶
Tesseract¶
Download from GitHub releases.
Additional language packs:
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual languages
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
# Verify installed languages
tesseract --list-langs
PaddleOCR¶
Built in via the paddle-ocr feature flag. Models download automatically on first use — no extra installation needed.
PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.
EasyOCR (Python only)¶
!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install kreuzberg[easyocr] on any supported Python version (3.10–3.14).
Tesseract marker extra
pip install "kreuzberg[tesseract]" is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).
Configuration¶
Basic OCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
cfg := kreuzberg.ExtractionConfig{
Ocr: &kreuzberg.OcrConfig{
Backend: "tesseract",
Language: "eng",
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
library(kreuzberg)
# Configure Tesseract OCR
config <- list(
force_ocr = TRUE,
ocr = list(backend = "tesseract", language = "eng")
)
# Extract text from a scanned image
json <- extract_file_sync("scan.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
import { enableOcr, extractFromFile, initWasm } from "@kreuzberg/wasm";
await initWasm();
await enableOcr();
const fileInput = document.getElementById("file") as HTMLInputElement;
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: {
backend: "kreuzberg-tesseract",
language: "eng",
},
});
console.log(result.content);
}
import { enableOcr, extractFile, initWasm } from "@kreuzberg/wasm";
await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend
const result = await extractFile("./scanned_document.png", "image/png", {
ocr: {
backend: "kreuzberg-tesseract",
language: "eng",
},
});
console.log(result.content);
Multiple Languages¶
Specify multiple language codes separated by + (Tesseract) or as a list (EasyOCR/PaddleOCR):
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)
result = extract_file_sync("multilingual.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng+deu+fra".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("multilingual.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
result, err := kreuzberg.ExtractFileSync("multilingual.pdf", nil, kreuzberg.ExtractionConfig{
Ocr: &kreuzberg.OcrConfig{
Backend: "tesseract",
Language: "eng+deu+fra",
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.language("eng+deu+fra")
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());
library(kreuzberg)
# Configure multi-language OCR (English, French, German)
config <- list(
force_ocr = TRUE,
ocr = list(backend = "tesseract", language = "eng+fra+deu")
)
# Extract from a multilingual document
json <- extract_file_sync("multilingual.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
Force OCR¶
Process PDFs with OCR even when they have a text layer:
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="tesseract"),
force_ocr=True,
)
result = extract_file_sync("document.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
force_ocr: true,
..Default::default()
};
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
package main
import (
"fmt"
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
result, err := kreuzberg.ExtractFileSync("document.pdf", nil, kreuzberg.ExtractionConfig{
Ocr: &kreuzberg.OcrConfig{
Backend: "tesseract",
},
ForceOcr: true,
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Println(result.Content)
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.forceOcr(true)
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());
library(kreuzberg)
config <- list(force_ocr = TRUE)
json <- extract_file_sync("multipage_document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Total pages: %d\n", length(result$pages)))
cat(sprintf("Content extracted via OCR: %d characters\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))
Using EasyOCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en")
)
# EasyOCR-specific options (use_gpu, beam_width, etc.) go in easyocr_kwargs,
# not in OcrConfig — OcrConfig only accepts backend, language, and backend-specific configs.
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
EasyOCR is only available in Python.
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "easyocr".to_string(),
language: "en".to_string(),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
Disable OCR¶
Added in v4.7
When disable_ocr is set, image files return empty content instead of raising MissingDependencyError:
Using PaddleOCR¶
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="en") # model_tier="server" for max accuracy
)
result = extract_file_sync("scanned.pdf", config=config)
content: str = result.content
preview: str = content[:100]
total_length: int = len(content)
print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "paddleocr".to_string(),
language: "en".to_string(),
// paddle_ocr_config: Some(serde_json::json!({"model_tier": "server"})), // for max accuracy
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
println!("Extracted text: {}", result.content);
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
cfg := kreuzberg.ExtractionConfig{
Ocr: &kreuzberg.OcrConfig{
Backend: "paddle-ocr",
Language: "en",
},
}
result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, cfg)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println(len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
try {
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("paddle-ocr")
.language("en")
// .paddleOcrConfig(PaddleOcrConfig.builder().modelTier("server").build()) // for max accuracy
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
System.out.println(result.getContent());
} catch (IOException | KreuzbergException e) {
System.err.println("Extraction failed: " + e.getMessage());
}
}
}
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
ocr: Kreuzberg::OcrConfig.new(
backend: 'paddleocr',
language: 'eng'
# model_tier: 'server' # for max accuracy
)
)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"
library(kreuzberg)
# Configure PaddleOCR backend (defaults to mobile tier)
config <- list(
force_ocr = TRUE,
ocr = list(backend = "paddle-ocr", language = "en")
)
# Extract text from an image using PaddleOCR
json <- extract_file_sync("document.jpg", "image/jpeg", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("MIME type: %s\n", result$mime_type))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
Candle GLM-OCR¶
Added in v5.0.0-rc.18
Built in via the candle-glm-ocr feature flag. The GLM-OCR model downloads automatically on first use (~3 GB) and is cached at ~/.cache/huggingface/.
[dependencies]
kreuzberg = { version = "5", features = ["candle-glm-ocr"] }
GPU support:
- Metal (macOS) — Default, F32 dtype (BF16 matmul unavailable in candle 0.10)
- CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
- CPU fallback — Slowest, but always available
Using Candle GLM-OCR¶
Added in v5.0.0-rc.18
Candle GLM-OCR dispatches by detected layout region using PP-DocLayout-V3. Each region runs through the appropriate task prompt (ocr/table/formula/chart/caption) and outputs are merged into reading-order markdown.
from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync
# Paired mode: per-region dispatch (default)
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="candle-glm-ocr",
language="en",
backend_options={"layout_mode": "paired"},
),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)
# Whole-page mode: single OCR pass over entire page
config_whole = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="candle-glm-ocr",
language="en",
backend_options={"layout_mode": "whole_page"},
),
)
result_whole = extract_file_sync("document.pdf", config=config_whole)
import { extractFileSync } from '@kreuzberg/node';
// Paired mode: per-region dispatch (default)
const result = extractFileSync('document.pdf', {
forceOcr: true,
ocr: {
backend: 'candle-glm-ocr',
language: 'en',
backendOptions: { layout_mode: 'paired' },
},
});
console.log(result.content);
// Whole-page mode
const resultWholePage = extractFileSync('document.pdf', {
forceOcr: true,
ocr: {
backend: 'candle-glm-ocr',
language: 'en',
backendOptions: { layout_mode: 'whole_page' },
},
});
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;
// Paired mode: per-region dispatch (default)
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "candle-glm-ocr".into(),
language: "en".into(),
backend_options: Some(json!({"layout_mode": "paired"})),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);
// Whole-page mode
let config_whole = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "candle-glm-ocr".into(),
language: "en".into(),
backend_options: Some(json!({"layout_mode": "whole_page"})),
..Default::default()
}),
..Default::default()
};
let result_whole = extract_file("document.pdf", &config_whole).await?;
Backend options:
| Option | Values | Description |
|---|---|---|
layout_mode |
"paired" (default), "whole_page" |
Paired: dispatch per-region via PP-DocLayout-V3. Whole-page: single OCR pass on entire page. |
task |
"ocr" (default), "table", "formula", "chart", "caption" |
Task prompt for whole-page mode only; ignored in paired mode where the region type determines the prompt. |
device |
"auto" (default), "cpu", "metal", "cuda" |
Device selection. Auto detects Metal on macOS, CUDA on Linux, CPU fallback. |
Candle Hunyuan-OCR¶
Added in v5.0.0-rc.18
Tencent Hunyuan-OCR — vision-language model for comprehensive document parsing with markdown output and multilingual support.
Built in via the candle-hunyuan-ocr feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~3.5 GB) and is cached at ~/.cache/huggingface/.
[dependencies]
kreuzberg = { version = "5", features = ["candle-hunyuan-ocr"] }
GPU support:
- Metal (macOS) — Default, F32 dtype
- CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
- CPU fallback — Slowest, but always available
Using Candle Hunyuan-OCR¶
Added in v5.0.0-rc.18
from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="candle-hunyuan-ocr",
language="en",
backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "candle-hunyuan-ocr".into(),
language: "en".into(),
backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);
Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, and others.
Model source: Download from Hugging Face Hub.
Candle DeepSeek-OCR¶
Added in v5.0.0-rc.18
DeepSeek-OCR — combination of SAM + CLIP encoder fused with Qwen2 decoder and DeepSeek V2 MoE for comprehensive multilingual document understanding. Markdown output.
Built in via the candle-deepseek-ocr feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~4 GB) and is cached at ~/.cache/huggingface/.
[dependencies]
kreuzberg = { version = "5", features = ["candle-deepseek-ocr"] }
GPU support:
- Metal (macOS) — Default, F32 dtype
- CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
- CPU fallback — Slowest, but always available
Using Candle DeepSeek-OCR¶
Added in v5.0.0-rc.18
from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="candle-deepseek-ocr",
language="en",
backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', {
forceOcr: true,
ocr: {
backend: 'candle-deepseek-ocr',
language: 'en',
backendOptions: { device: 'auto', model_path: '~/.cache/huggingface/' },
},
});
console.log(result.content);
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "candle-deepseek-ocr".into(),
language: "en".into(),
backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);
Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, and others.
Model source: Download from Hugging Face Hub.
Candle PaddleOCR-VL¶
Added in v5.0.0-rc.18
PaddleOCR-VL 1.5 — SigLIP vision encoder + Ernie-4.5 text decoder for lightweight multilingual document understanding. Markdown output.
Built in via the candle-paddleocr-vl-15 feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~2.5 GB) and is cached at ~/.cache/huggingface/.
[dependencies]
kreuzberg = { version = "5", features = ["candle-paddleocr-vl-15"] }
GPU support:
- Metal (macOS) — Default, F32 dtype
- CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
- CPU fallback — Slowest, but always available
Using Candle PaddleOCR-VL¶
Added in v5.0.0-rc.18
from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="candle-paddleocr-vl-15",
language="en",
backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf', {
forceOcr: true,
ocr: {
backend: 'candle-paddleocr-vl-15',
language: 'en',
backendOptions: { device: 'auto', model_path: '~/.cache/huggingface/' },
},
});
console.log(result.content);
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "candle-paddleocr-vl-15".into(),
language: "en".into(),
backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
..Default::default()
}),
..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);
Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, and others.
Model source: Download from PaddlePaddle Hub.
Using VLM OCR¶
Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the ollama/ prefix — see Local LLM Support.
import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, LlmConfig
async def main() -> None:
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="vlm",
vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
),
)
result = await extract_file("scan.pdf", config=config)
print(result.content)
asyncio.run(main())
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "vlm".to_string(),
vlm_config: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see LLM Integration.
!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set use_gpu=True in your OCR config. PaddleOCR's model_tier="server" gives the best accuracy with GPU.
DPI Configuration¶
Higher DPI improves accuracy but increases processing time and memory.
| DPI | Trade-off |
|---|---|
| 150 | Fastest — lower accuracy, less memory |
| 300 (default) | Balanced — good accuracy, reasonable speed |
| 600 | Best accuracy — slower, more memory |
from kreuzberg import (
extract_file_sync,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ImagePreprocessingConfig,
)
config: ExtractionConfig = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
tesseract_config=TesseractConfig(
preprocessing=ImagePreprocessingConfig(target_dpi=300),
),
),
)
result = extract_file_sync("scanned.pdf", config=config)
content_length: int = len(result.content)
table_count: int = len(result.tables)
print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
..Default::default()
}),
pdf_options: Some(PdfConfig {
dpi: Some(300),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("scanned.pdf", None, &config)?;
Ok(())
}
package main
import (
"log"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
targetDpi := int32(300)
result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, kreuzberg.ExtractionConfig{
Ocr: &kreuzberg.OcrConfig{
Backend: "tesseract",
TesseractConfig: &kreuzberg.TesseractConfig{
Preprocessing: &kreuzberg.ImagePreprocessingConfig{
TargetDpi: &targetDpi,
},
},
},
})
if err != nil {
log.Fatalf("extract failed: %v", err)
}
log.Println("content length:", len(result.Content))
}
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import dev.kreuzberg.ImagePreprocessingConfig;
ExtractionConfig config = ExtractionConfig.builder()
.ocr(OcrConfig.builder()
.backend("tesseract")
.build())
.imagePreprocessing(ImagePreprocessingConfig.builder()
.targetDpi(300)
.build())
.build();
ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
library(kreuzberg)
# Tesseract OCR via the kreuzberg R bindings does not expose a DPI setting in
# the high-level config; PDF rasterization DPI is determined by the pipeline.
# This example demonstrates running Tesseract OCR end-to-end on a PDF.
config <- list(
force_ocr = TRUE,
ocr = list(backend = "tesseract", language = "eng")
)
json <- extract_file_sync("document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Characters extracted: %d\n", nchar(result$content)))
PaddleOCR Script Families¶
80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
| Family | Languages |
|---|---|
| English | English, numbers, punctuation |
| Chinese | Simplified/Traditional Chinese, Japanese |
| Latin | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
| Korean | Korean (Hangul) |
| Slavic | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
| Thai | Thai script |
| Greek | Greek script |
| Arabic | Arabic, Persian, Urdu |
| Devanagari | Hindi, Marathi, Sanskrit, Nepali |
| Tamil | Tamil script |
| Telugu | Telugu script |
Models are cached locally after first download, so subsequent runs start immediately.
CLI Usage¶
# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr true
# Specific language
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
# Specific backend
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
# Force OCR on all pages
kreuzberg extract document.pdf --force-ocr true
# VLM OCR backend
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
# Use a config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
| Flag | Description |
|---|---|
--ocr true |
Enable OCR processing |
--ocr-language <code> |
Language code (eng, deu, fra, ch, ja, ru, etc.) |
--ocr-backend <backend> |
Engine: tesseract, paddle-ocr, easyocr, or vlm |
--force-ocr true |
OCR all pages regardless of text layer |
--vlm-model <model> |
VLM model for OCR (for example, openai/gpt-4o-mini). Implies --ocr-backend vlm |
Troubleshooting¶
Tesseract not found
Install Tesseract and verify it's on your PATH:
Language not found
Install the language data pack:
Poor accuracy
- Increase DPI to 600 for better quality
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
- Specify the correct language code for your document
- Use
force_ocr=Trueif a PDF's embedded text layer is low quality - For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see LLM Integration)
Slow processing
- Reduce DPI to 150 for faster throughput
- Enable GPU acceleration with EasyOCR or PaddleOCR (
use_gpu=True) - Use batch extraction to process multiple files concurrently
Out of memory on large PDFs
- Reduce DPI — lower resolution uses significantly less memory
- Process pages in smaller batches
- Use PaddleOCR's mobile tier (
model_tier="mobile") for a smaller memory footprint
Next Steps¶
- LLM Integration — VLM OCR, structured extraction, and LLM embeddings
- Configuration — all configuration options
- Extraction Basics — core extraction API and supported formats
- Advanced Features — chunking, language detection, embeddings