OCR (Optical Character Recognition)¶

Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set force_ocr=True to OCR all pages regardless.

Backend Comparison¶

Six OCR backends — pick based on platform, accuracy needs, and language coverage.

	Tesseract	PaddleOCR	EasyOCR	Candle GLM-OCR v5.0	Candle TrOCR v5.0	Candle Hunyuan-OCR v5.0	Candle DeepSeek-OCR v5.0	Candle PaddleOCR-VL v5.0	VLM
Speed	Fast	Very fast	Moderate	Moderate	Moderate	Moderate	Moderate	Moderate	Slow (API latency)
Accuracy	Good	Excellent	Excellent	Excellent	Good	Excellent	Excellent	Excellent	Highest
Languages	100+	80+ (11 script families)	80+	All	100+	20+ (CJK + Latin)	20+ (CJK + Latin)	20+ (CJK + Latin)	All (provider-dependent)
Installation	System package	Built-in (native) or Python package	Python package only	Cargo feature	Cargo feature	Cargo feature	Cargo feature	Cargo feature	API key only
Model size	~10 MB	Mobile ~8 MB, Server ~120 MB	~100 MB	~3 GB	~250 MB	~3.5 GB	~4 GB	~2.5 GB	None (cloud-hosted)
GPU support	No	Yes	Yes	Yes (Metal/CUDA)	Yes (Metal/CUDA)	Yes (Metal/CUDA)	Yes (Metal/CUDA)	Yes (Metal/CUDA)	N/A (server-side)
Platform	All (including Wasm)	All except Wasm	Python only	Native only	Native only	Native only	Native only	Native only	All
Cost	Free	Free	Free	Free	Free	Free	Free	Free	Per-token API cost

When to use which:

Tesseract — Default choice. Works everywhere, low overhead, broadest platform support.
PaddleOCR — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
EasyOCR — Highest accuracy with deep learning models. Python-only, heavier dependency.
Candle GLM-OCR v5.0 — Excellent accuracy with VLM-level reasoning on 0.9B-param GLM model. Pure Rust, GPU-accelerated (Metal on macOS, CUDA on Linux). Region-aware layout dispatch. First download ~3 GB.
Candle TrOCR v5.0 — Smaller model footprint (~250 MB) with solid accuracy across languages. Pure Rust, GPU-accelerated. Good balance of speed and quality.
Candle Hunyuan-OCR v5.0 — Tencent Hunyuan-OCR with comprehensive document parsing and multilingual support including CJK and Latin scripts. Pure Rust, GPU-accelerated. First download ~3.5 GB.
Candle DeepSeek-OCR v5.0 — Deep learning-based OCR combining SAM + CLIP + Qwen2 + DeepSeek MoE. Multilingual with strong CJK coverage. Pure Rust, GPU-accelerated. First download ~4 GB.
Candle PaddleOCR-VL v5.0 — SigLIP vision encoder + Ernie-4.5 text decoder. Lightweight multilingual model with CJK and Latin support. Pure Rust, GPU-accelerated. First download ~2.5 GB.
VLM — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See LLM Integration for full details.

Installation¶

Tesseract¶

macOSUbuntu / DebianRHEL / FedoraWindows

Terminal

brew install tesseract

Terminal

sudo apt-get install tesseract-ocr

Terminal

sudo dnf install tesseract

Download from GitHub releases.

Additional language packs:

Terminal

# macOS — all languages
brew install tesseract-lang

# Ubuntu/Debian — individual languages
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-fra  # French

# Verify installed languages
tesseract --list-langs

PaddleOCR¶

Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)Python

Built in via the paddle-ocr feature flag. Models download automatically on first use — no extra installation needed.

Cargo.toml (Rust example)

[dependencies]
kreuzberg = { version = "5", features = ["paddle-ocr"] }

PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.

EasyOCR (Python only)¶

Terminal

pip install "kreuzberg[easyocr]"

!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install kreuzberg[easyocr] on any supported Python version (3.10–3.14).

Tesseract marker extra

pip install "kreuzberg[tesseract]" is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).

Configuration¶

Basic OCR¶

PythonTypeScriptRustGoJavaRubyRWasm

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  ocr: {
    backend: "tesseract",
    language: "eng",
  },
};

const result = extractFileSync("scanned.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    cfg := kreuzberg.ExtractionConfig{
        Ocr: &kreuzberg.OcrConfig{
            Backend:  "tesseract",
            Language: "eng",
        },
    }

    result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, cfg)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }
    log.Println(len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            ExtractionConfig config = ExtractionConfig.builder()
                .ocr(OcrConfig.builder()
                    .backend("tesseract")
                    .language("eng")
                    .build())
                .build();

            ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
            System.out.println(result.getContent());
        } catch (IOException | KreuzbergException e) {
            System.err.println("Extraction failed: " + e.getMessage());
        }
    }
}

Ruby

require 'kreuzberg'

ocr_config = Kreuzberg::OcrConfig.new(
  backend: 'tesseract',
  language: 'eng'
)

config = Kreuzberg::ExtractionConfig.new(ocr: ocr_config)
result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content

R

library(kreuzberg)

# Configure Tesseract OCR
config <- list(
  force_ocr = TRUE,
  ocr = list(backend = "tesseract", language = "eng")
)

# Extract text from a scanned image
json <- extract_file_sync("scan.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

WASM (Browser)

import { enableOcr, extractFromFile, initWasm } from "@kreuzberg/wasm";

await initWasm();
await enableOcr();

const fileInput = document.getElementById("file") as HTMLInputElement;
const file = fileInput.files?.[0];

if (file) {
  const result = await extractFromFile(file, file.type, {
    ocr: {
      backend: "kreuzberg-tesseract",
      language: "eng",
    },
  });
  console.log(result.content);
}

WASM (Node.js / Deno / Bun)

import { enableOcr, extractFile, initWasm } from "@kreuzberg/wasm";

await initWasm();
await enableOcr(); // Uses native kreuzberg-tesseract backend

const result = await extractFile("./scanned_document.png", "image/png", {
  ocr: {
    backend: "kreuzberg-tesseract",
    language: "eng",
  },
});
console.log(result.content);

Multiple Languages¶

Specify multiple language codes separated by + (Tesseract) or as a list (EasyOCR/PaddleOCR):

PythonTypeScriptRustGoJavaRubyRWasm

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng+deu+fra")
)

result = extract_file_sync("multilingual.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  ocr: {
    backend: "tesseract",
    language: "eng+deu+fra",
  },
};

const result = extractFileSync("multilingual.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "eng+deu+fra".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("multilingual.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("multilingual.pdf", nil, kreuzberg.ExtractionConfig{
        Ocr: &kreuzberg.OcrConfig{
            Backend:  "tesseract",
            Language: "eng+deu+fra",
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println(result.Content)
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .language("eng+deu+fra")
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("multilingual.pdf", config);
System.out.println(result.getContent());

Ruby

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(
    backend: 'tesseract',
    language: 'eng+deu+fra'
  )
)

result = Kreuzberg.extract_file_sync('multilingual.pdf', config: config)
puts result.content

R

library(kreuzberg)

# Configure multi-language OCR (English, French, German)
config <- list(
  force_ocr = TRUE,
  ocr = list(backend = "tesseract", language = "eng+fra+deu")
)

# Extract from a multilingual document
json <- extract_file_sync("multilingual.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Detected language: %s\n", result$detected_language))
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';

await initWasm();
await enableOcr();

const file = fileInput.files?.[0];
if (file) {
  const result = await extractFromFile(file, file.type, {
    ocr: { backend: 'tesseract-wasm', language: 'eng+deu' },
  });
}

Force OCR¶

Process PDFs with OCR even when they have a text layer:

PythonTypeScriptRustGoJavaRubyR

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract"),
    force_ocr=True,
)

result = extract_file_sync("document.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  ocr: {
    backend: "tesseract",
  },
  forceOcr: true,
};

const result = extractFileSync("document.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        force_ocr: true,
        ..Default::default()
    };

    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

Go

package main

import (
    "fmt"
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("document.pdf", nil, kreuzberg.ExtractionConfig{
        Ocr: &kreuzberg.OcrConfig{
            Backend: "tesseract",
        },
        ForceOcr: true,
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    fmt.Println(result.Content)
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .forceOcr(true)
    .build();

ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
System.out.println(result.getContent());

Ruby

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(backend: 'tesseract'),
  force_ocr: true
)

result = Kreuzberg.extract_file_sync('document.pdf', config: config)
puts result.content

R

library(kreuzberg)

config <- list(force_ocr = TRUE)

json <- extract_file_sync("multipage_document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Total pages: %d\n", length(result$pages)))
cat(sprintf("Content extracted via OCR: %d characters\n", nchar(result$content)))
cat(sprintf("Detected language: %s\n", result$detected_language))

Using EasyOCR¶

PythonTypeScriptRust

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="en")
)

# EasyOCR-specific options (use_gpu, beam_width, etc.) go in easyocr_kwargs,
# not in OcrConfig — OcrConfig only accepts backend, language, and backend-specific configs.
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

EasyOCR is only available in Python.

Rust

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "easyocr".to_string(),
            language: "en".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

Disable OCR¶

Added in v4.7

When disable_ocr is set, image files return empty content instead of raising MissingDependencyError:

PythonTypeScriptRust

disable_ocr.py

from kreuzberg import ExtractionConfig, extract_file_sync

config = ExtractionConfig(disable_ocr=True)
result = extract_file_sync("scanned.png", config=config)
# result.content will be empty — OCR was skipped

disable_ocr.ts

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('scanned.png', {
  disableOcr: true,
});
// result.content will be empty — OCR was skipped

disable_ocr.rs

use kreuzberg::{ExtractionConfig, extract_file};

let config = ExtractionConfig {
    disable_ocr: true,
    ..Default::default()
};
let result = extract_file("scanned.png", &config).await?;
// result.content will be empty — OCR was skipped

Using PaddleOCR¶

PythonTypeScriptRustGoJavaRubyR

Python

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="en")  # model_tier="server" for max accuracy
)

result = extract_file_sync("scanned.pdf", config=config)

content: str = result.content
preview: str = content[:100]
total_length: int = len(content)

print(f"Extracted content (preview): {preview}")
print(f"Total characters: {total_length}")

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  ocr: {
    backend: "paddle-ocr",
    language: "en",
    // modelTier: 'server', // for max accuracy
  },
};

const result = extractFileSync("scanned.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "paddleocr".to_string(),
            language: "en".to_string(),
            // paddle_ocr_config: Some(serde_json::json!({"model_tier": "server"})), // for max accuracy
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file("document.pdf", None, &config).await?;
    println!("Extracted text: {}", result.content);
    Ok(())
}

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    cfg := kreuzberg.ExtractionConfig{
        Ocr: &kreuzberg.OcrConfig{
            Backend:  "paddle-ocr",
            Language: "en",
        },
    }

    result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, cfg)
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }
    log.Println(len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.KreuzbergException;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import java.io.IOException;

public class Main {
    public static void main(String[] args) {
        try {
            ExtractionConfig config = ExtractionConfig.builder()
                .ocr(OcrConfig.builder()
                    .backend("paddle-ocr")
                    .language("en")
                    // .paddleOcrConfig(PaddleOcrConfig.builder().modelTier("server").build()) // for max accuracy
                    .build())
                .build();

            ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);
            System.out.println(result.getContent());
        } catch (IOException | KreuzbergException e) {
            System.err.println("Extraction failed: " + e.getMessage());
        }
    }
}

Ruby

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(
    backend: 'paddleocr',
    language: 'eng'
    # model_tier: 'server' # for max accuracy
  )
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)
puts result.content[0..100]
puts "Total length: #{result.content.length}"

R

library(kreuzberg)

# Configure PaddleOCR backend (defaults to mobile tier)
config <- list(
  force_ocr = TRUE,
  ocr = list(backend = "paddle-ocr", language = "en")
)

# Extract text from an image using PaddleOCR
json <- extract_file_sync("document.jpg", "image/jpeg", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat(sprintf("MIME type: %s\n", result$mime_type))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))

Candle GLM-OCR¶

Added in v5.0.0-rc.18

Native bindings (Rust, Go, TypeScript, Node.js, Java, C#, Ruby, PHP, Elixir)

Built in via the candle-glm-ocr feature flag. The GLM-OCR model downloads automatically on first use (~3 GB) and is cached at ~/.cache/huggingface/.

Cargo.toml (Rust example)

[dependencies]
kreuzberg = { version = "5", features = ["candle-glm-ocr"] }

GPU support:

Metal (macOS) — Default, F32 dtype (BF16 matmul unavailable in candle 0.10)
CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
CPU fallback — Slowest, but always available

Using Candle GLM-OCR¶

Added in v5.0.0-rc.18

Candle GLM-OCR dispatches by detected layout region using PP-DocLayout-V3. Each region runs through the appropriate task prompt (ocr/table/formula/chart/caption) and outputs are merged into reading-order markdown.

PythonTypeScriptRust

candle_glm_ocr.py

from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync

# Paired mode: per-region dispatch (default)
config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="candle-glm-ocr",
        language="en",
        backend_options={"layout_mode": "paired"},
    ),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)

# Whole-page mode: single OCR pass over entire page
config_whole = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="candle-glm-ocr",
        language="en",
        backend_options={"layout_mode": "whole_page"},
    ),
)
result_whole = extract_file_sync("document.pdf", config=config_whole)

candle-glm-ocr.ts

import { extractFileSync } from '@kreuzberg/node';

// Paired mode: per-region dispatch (default)
const result = extractFileSync('document.pdf', {
  forceOcr: true,
  ocr: {
    backend: 'candle-glm-ocr',
    language: 'en',
    backendOptions: { layout_mode: 'paired' },
  },
});
console.log(result.content);

// Whole-page mode
const resultWholePage = extractFileSync('document.pdf', {
  forceOcr: true,
  ocr: {
    backend: 'candle-glm-ocr',
    language: 'en',
    backendOptions: { layout_mode: 'whole_page' },
  },
});

candle_glm_ocr.rs

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;

// Paired mode: per-region dispatch (default)
let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "candle-glm-ocr".into(),
        language: "en".into(),
        backend_options: Some(json!({"layout_mode": "paired"})),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);

// Whole-page mode
let config_whole = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "candle-glm-ocr".into(),
        language: "en".into(),
        backend_options: Some(json!({"layout_mode": "whole_page"})),
        ..Default::default()
    }),
    ..Default::default()
};
let result_whole = extract_file("document.pdf", &config_whole).await?;

Backend options:

Option	Values	Description
`layout_mode`	`"paired"` (default), `"whole_page"`	Paired: dispatch per-region via PP-DocLayout-V3. Whole-page: single OCR pass on entire page.
`task`	`"ocr"` (default), `"table"`, `"formula"`, `"chart"`, `"caption"`	Task prompt for whole-page mode only; ignored in paired mode where the region type determines the prompt.
`device`	`"auto"` (default), `"cpu"`, `"metal"`, `"cuda"`	Device selection. Auto detects Metal on macOS, CUDA on Linux, CPU fallback.

Candle Hunyuan-OCR¶

Added in v5.0.0-rc.18

Tencent Hunyuan-OCR — vision-language model for comprehensive document parsing with markdown output and multilingual support.

Native bindings (Rust, Go, TypeScript, Node.js, Java, C#, Ruby, PHP, Elixir)

Built in via the candle-hunyuan-ocr feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~3.5 GB) and is cached at ~/.cache/huggingface/.

Cargo.toml (Rust example)

[dependencies]
kreuzberg = { version = "5", features = ["candle-hunyuan-ocr"] }

GPU support:

Metal (macOS) — Default, F32 dtype
CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
CPU fallback — Slowest, but always available

Using Candle Hunyuan-OCR¶

Added in v5.0.0-rc.18

PythonTypeScriptRustCLI

candle_hunyuan_ocr.py

from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="candle-hunyuan-ocr",
        language="en",
        backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
    ),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)

candle-hunyuan-ocr.ts

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf', {
  forceOcr: true,
  ocr: {
    backend: 'candle-hunyuan-ocr',
    language: 'en',
    backendOptions: { device: 'auto', model_path: '~/.cache/huggingface/' },
  },
});
console.log(result.content);

candle_hunyuan_ocr.rs

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;

let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "candle-hunyuan-ocr".into(),
        language: "en".into(),
        backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);

Terminal

kreuzberg extract document.pdf --force-ocr true --ocr-backend candle-hunyuan-ocr --ocr-backend-options '{"device":"auto","model_path":"~/.cache/huggingface/"}'

Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, and others.

Model source: Download from Hugging Face Hub.

Candle DeepSeek-OCR¶

Added in v5.0.0-rc.18

DeepSeek-OCR — combination of SAM + CLIP encoder fused with Qwen2 decoder and DeepSeek V2 MoE for comprehensive multilingual document understanding. Markdown output.

Native bindings (Rust, Go, TypeScript, Node.js, Java, C#, Ruby, PHP, Elixir)

Built in via the candle-deepseek-ocr feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~4 GB) and is cached at ~/.cache/huggingface/.

Cargo.toml (Rust example)

[dependencies]
kreuzberg = { version = "5", features = ["candle-deepseek-ocr"] }

GPU support:

Metal (macOS) — Default, F32 dtype
CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
CPU fallback — Slowest, but always available

Using Candle DeepSeek-OCR¶

Added in v5.0.0-rc.18

PythonTypeScriptRustCLI

candle_deepseek_ocr.py

from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="candle-deepseek-ocr",
        language="en",
        backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
    ),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)

candle-deepseek-ocr.ts

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf', {
  forceOcr: true,
  ocr: {
    backend: 'candle-deepseek-ocr',
    language: 'en',
    backendOptions: { device: 'auto', model_path: '~/.cache/huggingface/' },
  },
});
console.log(result.content);

candle_deepseek_ocr.rs

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;

let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "candle-deepseek-ocr".into(),
        language: "en".into(),
        backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);

Terminal

kreuzberg extract document.pdf --force-ocr true --ocr-backend candle-deepseek-ocr --ocr-backend-options '{"device":"auto","model_path":"~/.cache/huggingface/"}'

Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Hindi, Thai, Vietnamese, and others.

Model source: Download from Hugging Face Hub.

Candle PaddleOCR-VL¶

Added in v5.0.0-rc.18

PaddleOCR-VL 1.5 — SigLIP vision encoder + Ernie-4.5 text decoder for lightweight multilingual document understanding. Markdown output.

Native bindings (Rust, Go, TypeScript, Node.js, Java, C#, Ruby, PHP, Elixir)

Built in via the candle-paddleocr-vl-15 feature flag or the candle-vlm-ocr umbrella feature. The model downloads automatically on first use (~2.5 GB) and is cached at ~/.cache/huggingface/.

Cargo.toml (Rust example)

[dependencies]
kreuzberg = { version = "5", features = ["candle-paddleocr-vl-15"] }

GPU support:

Metal (macOS) — Default, F32 dtype
CUDA (Linux/Windows with NVIDIA GPU) — Auto-detected
CPU fallback — Slowest, but always available

Using Candle PaddleOCR-VL¶

Added in v5.0.0-rc.18

PythonTypeScriptRustCLI

candle_paddleocr_vl.py

from kreuzberg import ExtractionConfig, OcrConfig, extract_file_sync

config = ExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(
        backend="candle-paddleocr-vl-15",
        language="en",
        backend_options={"device": "auto", "model_path": "~/.cache/huggingface/"},
    ),
)
result = extract_file_sync("document.pdf", config=config)
print(result.content)

candle-paddleocr-vl.ts

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf', {
  forceOcr: true,
  ocr: {
    backend: 'candle-paddleocr-vl-15',
    language: 'en',
    backendOptions: { device: 'auto', model_path: '~/.cache/huggingface/' },
  },
});
console.log(result.content);

candle_paddleocr_vl.rs

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig};
use serde_json::json;

let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "candle-paddleocr-vl-15".into(),
        language: "en".into(),
        backend_options: Some(json!({"device": "auto", "model_path": "~/.cache/huggingface/"})),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("document.pdf", &config).await?;
println!("{}", result.content);

Terminal

kreuzberg extract document.pdf --force-ocr true --ocr-backend candle-paddleocr-vl-15 --ocr-backend-options '{"device":"auto","model_path":"~/.cache/huggingface/"}'

Supported languages: English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian, and others.

Model source: Download from PaddlePaddle Hub.

Using VLM OCR¶

Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the ollama/ prefix — see Local LLM Support.

PythonTypeScriptRustCLITOML

Python

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, LlmConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="vlm",
            vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
        ),
    )
    result = await extract_file("scan.pdf", config=config)
    print(result.content)

asyncio.run(main())

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  forceOcr: true,
  ocr: {
    backend: "vlm",
    vlmConfig: {
      model: "openai/gpt-4o-mini",
    },
  },
};

const result = extractFileSync("scan.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};

let config = ExtractionConfig {
    force_ocr: true,
    ocr: Some(OcrConfig {
        backend: "vlm".to_string(),
        vlm_config: Some(LlmConfig {
            model: "openai/gpt-4o-mini".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    }),
    ..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;

Terminal

kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini

kreuzberg.toml

force_ocr = true

[ocr]
backend = "vlm"

[ocr.vlm_config]
model = "openai/gpt-4o-mini"

For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see LLM Integration.

!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set use_gpu=True in your OCR config. PaddleOCR's model_tier="server" gives the best accuracy with GPU.

DPI Configuration¶

Higher DPI improves accuracy but increases processing time and memory.

DPI	Trade-off
150	Fastest — lower accuracy, less memory
300 (default)	Balanced — good accuracy, reasonable speed
600	Best accuracy — slower, more memory

PythonTypeScriptRustGoJavaRubyR

Python

from kreuzberg import (
    extract_file_sync,
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ImagePreprocessingConfig,
)

config: ExtractionConfig = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        tesseract_config=TesseractConfig(
            preprocessing=ImagePreprocessingConfig(target_dpi=300),
        ),
    ),
)

result = extract_file_sync("scanned.pdf", config=config)

content_length: int = len(result.content)
table_count: int = len(result.tables)

print(f"Content length: {content_length} characters")
print(f"Tables detected: {table_count}")

TypeScript

import { extractFileSync } from "@kreuzberg/node";

const config = {
  ocr: {
    backend: "tesseract",
  },
  pdfOptions: {
    extractImages: true,
  },
};

const result = extractFileSync("scanned.pdf", null, config);
console.log(result.content);

Rust

use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, PdfConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        pdf_options: Some(PdfConfig {
            dpi: Some(300),
            ..Default::default()
        }),
        ..Default::default()
    };

    let result = extract_file_sync("scanned.pdf", None, &config)?;
    Ok(())
}

Go

package main

import (
    "log"

    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    targetDpi := int32(300)
    result, err := kreuzberg.ExtractFileSync("scanned.pdf", nil, kreuzberg.ExtractionConfig{
        Ocr: &kreuzberg.OcrConfig{
            Backend: "tesseract",
            TesseractConfig: &kreuzberg.TesseractConfig{
                Preprocessing: &kreuzberg.ImagePreprocessingConfig{
                    TargetDpi: &targetDpi,
                },
            },
        },
    })
    if err != nil {
        log.Fatalf("extract failed: %v", err)
    }

    log.Println("content length:", len(result.Content))
}

Java

import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.ExtractionConfig;
import dev.kreuzberg.OcrConfig;
import dev.kreuzberg.ImagePreprocessingConfig;

ExtractionConfig config = ExtractionConfig.builder()
    .ocr(OcrConfig.builder()
        .backend("tesseract")
        .build())
    .imagePreprocessing(ImagePreprocessingConfig.builder()
        .targetDpi(300)
        .build())
    .build();

ExtractionResult result = Kreuzberg.extractFile("scanned.pdf", config);

Ruby

require 'kreuzberg'

config = Kreuzberg::ExtractionConfig.new(
  ocr: Kreuzberg::OcrConfig.new(backend: 'tesseract'),
  pdf: Kreuzberg::PdfConfig.new(dpi: 300)
)

result = Kreuzberg.extract_file_sync('scanned.pdf', config: config)

R

library(kreuzberg)

# Tesseract OCR via the kreuzberg R bindings does not expose a DPI setting in
# the high-level config; PDF rasterization DPI is determined by the pipeline.
# This example demonstrates running Tesseract OCR end-to-end on a PDF.
config <- list(
  force_ocr = TRUE,
  ocr = list(backend = "tesseract", language = "eng")
)

json <- extract_file_sync("document.pdf", "application/pdf", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)

cat(sprintf("Characters extracted: %d\n", nchar(result$content)))

PaddleOCR Script Families¶

80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:

Family	Languages
English	English, numbers, punctuation
Chinese	Simplified/Traditional Chinese, Japanese
Latin	French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on.
Korean	Korean (Hangul)
Slavic	Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on.
Thai	Thai script
Greek	Greek script
Arabic	Arabic, Persian, Urdu
Devanagari	Hindi, Marathi, Sanskrit, Nepali
Tamil	Tamil script
Telugu	Telugu script

Models are cached locally after first download, so subsequent runs start immediately.

CLI Usage¶

Terminal

# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr true

# Specific language
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra

# Specific backend
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch

# Force OCR on all pages
kreuzberg extract document.pdf --force-ocr true

# VLM OCR backend
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini

# Use a config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true

Flag	Description
`--ocr true`	Enable OCR processing
`--ocr-language <code>`	Language code (`eng`, `deu`, `fra`, `ch`, `ja`, `ru`, etc.)
`--ocr-backend <backend>`	Engine: `tesseract`, `paddle-ocr`, `easyocr`, or `vlm`
`--force-ocr true`	OCR all pages regardless of text layer
`--vlm-model <model>`	VLM model for OCR (for example, `openai/gpt-4o-mini`). Implies `--ocr-backend vlm`

Troubleshooting¶

Tesseract not found

Install Tesseract and verify it's on your PATH:

Terminal

# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Verify
tesseract --version

Language not found

Install the language data pack:

Terminal

# macOS — all languages
brew install tesseract-lang

# Ubuntu/Debian — individual language
sudo apt-get install tesseract-ocr-deu

# Verify
tesseract --list-langs

Poor accuracy

Increase DPI to 600 for better quality
Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
Specify the correct language code for your document
Use force_ocr=True if a PDF's embedded text layer is low quality
For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see LLM Integration)

Slow processing

Reduce DPI to 150 for faster throughput
Enable GPU acceleration with EasyOCR or PaddleOCR (use_gpu=True)
Use batch extraction to process multiple files concurrently

Out of memory on large PDFs

Reduce DPI — lower resolution uses significantly less memory
Process pages in smaller batches
Use PaddleOCR's mobile tier (model_tier="mobile") for a smaller memory footprint

Next Steps¶

LLM Integration — VLM OCR, structured extraction, and LLM embeddings
Configuration — all configuration options
Extraction Basics — core extraction API and supported formats
Advanced Features — chunking, language detection, embeddings

Edit this page on GitHub