Skip to main content

Content to Markdown (Kreuzberg)

The Content to Markdown plugin converts documents into markdown text using Kreuzberg, a Java library for document extraction. It supports a wide range of formats including PDF, DOCX, HTML, images (with OCR), PPTX, XLSX, and more.

Java 25 required

Kreuzberg requires Java 25 (or later) to run. If Otoroshi is running on an earlier JDK, Kreuzberg extraction will fail at runtime. Make sure your Otoroshi instance uses JDK 25+.

How it works

Kreuzberg extracts text content from documents and converts it to markdown format. For images and scanned PDFs, it uses Tesseract OCR as the OCR backend.

The plugin can process documents from three sources:

  • URL — fetches the document from a remote URL and converts it
  • JSON body — accepts url, content (base64), or raw document bytes
  • Raw body — sends the document directly as the request body

Plugin configuration

  • Plugin name: Cloud APIM - Content to Markdown
  • Step: Backend call
ParameterTypeDefaultDescription
maxBodySizenumber10485760 (10MB)Maximum request body size in bytes

Usage

Via URL query parameter

curl "http://myroute.oto.tools:8080?url=https://example.com/document.pdf"

The response is the markdown content with Content-Type: text/markdown:

HTTP/1.1 200 OK
Content-Type: text/markdown; charset=utf-8
X-Source-Content-Type: application/pdf

# Document Title

This is the extracted content...

Via JSON body

curl -X POST http://myroute.oto.tools:8080 \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/document.pdf",
"method": "GET",
"headers": {
"Authorization": "Bearer xxx"
}
}'

Or with base64-encoded content:

curl -X POST http://myroute.oto.tools:8080 \
-H "Content-Type: application/json" \
-d '{
"content": "JVBERi0xLjQK...",
"content_type": "application/pdf"
}'

Via raw body

curl -X POST http://myroute.oto.tools:8080 \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf

Supported formats

Kreuzberg supports all formats handled by the underlying Java libraries:

FormatMIME typeNotes
PDFapplication/pdfText extraction + OCR for scanned pages
DOCXapplication/vnd.openxmlformats-officedocument.wordprocessingml.document
PPTXapplication/vnd.openxmlformats-officedocument.presentationml.presentation
XLSXapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet
HTMLtext/htmlConverts to clean markdown
Plain texttext/plainPassed through
Imagesimage/png, image/jpeg, image/tiff, etc.OCR via Tesseract
And more...VariousAny format supported by Kreuzberg

Workflow function

Kreuzberg is also available as a workflow function for use in Otoroshi workflows:

  • Function name: extensions.com.cloud-apim.llm-extension.content_to_markdown
{
"kind": "call",
"function": "extensions.com.cloud-apim.llm-extension.content_to_markdown",
"args": {
"url": "https://example.com/document.pdf",
"method": "GET",
"headers": {
"Authorization": "Bearer xxx"
}
},
"result": "markdown_content"
}

Parameters

ParameterTypeDescription
urlstringURL of the document to fetch and convert
methodstringHTTP method for fetching (default: GET)
headersobjectHTTP headers as key-value pairs
contentstringBase64-encoded document content (alternative to url)
content_typestringMIME type of the content (required when using content)

Output

{
"content": "# Document Title\n\nExtracted markdown content...",
"source_type": "application/pdf"
}

Agent built-in tool

Kreuzberg is also available as an agent built-in tool called content_to_markdown. Enable it in the agent's built_in_tools configuration:

{
"built_in_tools": {
"content_to_markdown": true
}
}

The agent can then call the tool with either a url or content + content_type parameters. See the AI Agent node documentation for details.