Supported Formats

Complete reference for document formats supported by GateFlow OCR.

Overview

GateFlow OCR supports a wide range of document formats for text extraction, with specialized handling for complex layouts, tables, and multi-page documents.

Document Formats

PDF Documents

Format	Extension	OCR Required	Notes
Native PDF	`.pdf`	No	Text directly extractable
Scanned PDF	`.pdf`	Yes	Image-based, requires OCR
Mixed PDF	`.pdf`	Partial	Both native and scanned pages
PDF/A	`.pdf`	Varies	Archival format, full support
Encrypted PDF	`.pdf`	No*	Requires password

python

from openai import OpenAI

client = OpenAI(
    base_url="https://api.gateflow.ai/v1",
    api_key="gw_prod_..."
)

# Process a PDF document
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "detect_type": True,  # Auto-detect native vs scanned
        "extract_tables": True,
        "preserve_layout": True
    }
)

print(f"Document type: {response['document_type']}")
print(f"Pages: {response['page_count']}")
print(f"OCR applied: {response['ocr_applied']}")

Image Formats

Format	Extension	Best For	Max Resolution
JPEG	`.jpg`, `.jpeg`	Photos, scans	10000x10000
PNG	`.png`	Screenshots, diagrams	10000x10000
TIFF	`.tif`, `.tiff`	High-quality scans	10000x10000
WebP	`.webp`	Web images	10000x10000
BMP	`.bmp`	Legacy scans	10000x10000
HEIC	`.heic`	iOS photos	10000x10000

python

# Process an image
response = client.post(
    "/data/ocr",
    files={"file": open("scan.jpg", "rb")},
    data={
        "deskew": True,       # Correct rotation
        "denoise": True,      # Remove noise
        "enhance_contrast": True
    }
)

Office Documents

Format	Extension	Native Text	OCR for Images
Word	`.docx`	Yes	Embedded images
Word (Legacy)	`.doc`	Yes	Embedded images
Excel	`.xlsx`	Yes	Embedded images
Excel (Legacy)	`.xls`	Yes	Embedded images
PowerPoint	`.pptx`	Yes	Slides as needed
PowerPoint (Legacy)	`.ppt`	Yes	Slides as needed

python

# Process a Word document with embedded images
response = client.post(
    "/data/ocr",
    files={"file": open("report.docx", "rb")},
    data={
        "ocr_embedded_images": True,
        "preserve_formatting": True
    }
)

Plain Text and Markup

Format	Extension	Processing
Plain Text	`.txt`	Direct extraction
Markdown	`.md`	Structure preserved
HTML	`.html`, `.htm`	Tags stripped, structure kept
XML	`.xml`	Content extraction
CSV	`.csv`	Table structure
JSON	`.json`	Data extraction

Email Formats

Format	Extension	Attachments
EML	`.eml`	Processed recursively
MSG	`.msg`	Processed recursively
MBOX	`.mbox`	Multiple emails

python

# Process email with attachments
response = client.post(
    "/data/ocr",
    files={"file": open("email.eml", "rb")},
    data={
        "process_attachments": True,
        "max_attachment_depth": 2,
        "attachment_types": ["pdf", "docx", "jpg"]
    }
)

print(f"Email body extracted")
print(f"Attachments processed: {response['attachments_count']}")

Image Quality Requirements

Minimum Requirements

Parameter	Minimum	Recommended	Optimal
Resolution	150 DPI	300 DPI	600 DPI
Text height	8px	20px	32px+
Contrast ratio	2:1	4:1	7:1+
Skew angle	< 15°	< 5°	< 1°

Quality Preprocessing

python

# Enable automatic quality enhancement
response = client.post(
    "/data/ocr",
    files={"file": open("poor_scan.jpg", "rb")},
    data={
        "preprocessing": {
            "deskew": True,
            "denoise": True,
            "binarize": True,
            "remove_borders": True,
            "enhance_resolution": True  # AI upscaling
        }
    }
)

print(f"Quality score before: {response['quality']['original']}")
print(f"Quality score after: {response['quality']['enhanced']}")

Language Support

Supported Languages

GateFlow OCR supports 100+ languages including:

Category	Languages
Latin	English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Czech, etc.
Cyrillic	Russian, Ukrainian, Bulgarian, Serbian, etc.
Asian	Chinese (Simplified & Traditional), Japanese, Korean, Thai, Vietnamese
Middle Eastern	Arabic, Hebrew, Persian, Urdu
South Asian	Hindi, Bengali, Tamil, Telugu, Gujarati
Other	Greek, Armenian, Georgian, etc.

python

# Specify language for better accuracy
response = client.post(
    "/data/ocr",
    files={"file": open("german_doc.pdf", "rb")},
    data={
        "languages": ["de", "en"],  # Primary and secondary
        "detect_language": True
    }
)

print(f"Detected languages: {response['detected_languages']}")

Multi-Language Documents

python

# Process document with multiple languages
response = client.post(
    "/data/ocr",
    files={"file": open("multilingual.pdf", "rb")},
    data={
        "languages": ["en", "es", "fr", "de"],
        "per_page_detection": True
    }
)

for page in response["pages"]:
    print(f"Page {page['number']}: {page['detected_language']}")

File Size Limits

Plan	Max File Size	Max Pages	Max Resolution
Free	10 MB	10 pages	2000x2000
Pro	50 MB	100 pages	5000x5000
Enterprise	500 MB	1000 pages	10000x10000

Handling Large Documents

python

# Process large document in chunks
response = client.post(
    "/data/ocr",
    files={"file": open("large_doc.pdf", "rb")},
    data={
        "page_range": "1-100",  # First 100 pages
        "async": True           # Process asynchronously
    }
)

job_id = response["job_id"]

# Check status
status = client.get(f"/data/ocr/jobs/{job_id}")
print(f"Status: {status['status']}")
print(f"Progress: {status['pages_processed']}/{status['total_pages']}")

Output Formats

Text Output

python

# Get plain text output
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={"output_format": "text"}
)

print(response["text"])

Structured Output

python

# Get structured output with layout
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={"output_format": "structured"}
)

for page in response["pages"]:
    print(f"\n--- Page {page['number']} ---")
    for block in page["blocks"]:
        print(f"[{block['type']}] {block['text'][:100]}...")

JSON Output

python

# Get full JSON output
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "output_format": "json",
        "include_coordinates": True,
        "include_confidence": True
    }
)

# Response includes bounding boxes and confidence scores
for word in response["pages"][0]["words"]:
    print(f"{word['text']} ({word['confidence']:.2f})")
    print(f"  Position: {word['bbox']}")

Markdown Output

python

# Get Markdown output preserving structure
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "output_format": "markdown",
        "preserve_headings": True,
        "preserve_lists": True,
        "preserve_tables": True
    }
)

print(response["markdown"])

Error Handling

Common Format Errors

Error	Cause	Solution
`unsupported_format`	File type not supported	Convert to supported format
`corrupted_file`	File is damaged	Re-upload or repair file
`password_protected`	PDF requires password	Provide password parameter
`too_large`	Exceeds size limit	Split document or upgrade plan
`low_quality`	Image quality too poor	Enhance or rescan

python

# Handle format errors
try:
    response = client.post(
        "/data/ocr",
        files={"file": open("document.pdf", "rb")}
    )
except Exception as e:
    if "password_protected" in str(e):
        # Retry with password
        response = client.post(
            "/data/ocr",
            files={"file": open("document.pdf", "rb")},
            data={"password": "secret123"}
        )

Best Practices

Use native PDFs when possible - Faster and more accurate than OCR
Scan at 300 DPI minimum - Better accuracy for OCR
Specify languages - Improves recognition accuracy
Enable preprocessing - Helps with poor quality scans
Use async for large files - Prevents timeouts
Check confidence scores - Flag low-confidence text for review

Next Steps

Mistral Document AI - OCR engine details
Table Extraction - Extract tables
Document Ingestion - Full pipeline

Supported Formats ​

Overview ​

Document Formats ​

PDF Documents ​

Image Formats ​

Office Documents ​

Plain Text and Markup ​

Email Formats ​

Image Quality Requirements ​

Minimum Requirements ​

Quality Preprocessing ​

Language Support ​

Supported Languages ​

Multi-Language Documents ​

File Size Limits ​

Handling Large Documents ​

Output Formats ​

Text Output ​

Structured Output ​

JSON Output ​

Markdown Output ​

Error Handling ​

Common Format Errors ​

Best Practices ​

Next Steps ​

Supported Formats

Overview

Document Formats

PDF Documents

Image Formats

Office Documents

Plain Text and Markup

Email Formats

Image Quality Requirements

Minimum Requirements

Quality Preprocessing

Language Support

Supported Languages

Multi-Language Documents

File Size Limits

Handling Large Documents

Output Formats

Text Output

Structured Output

JSON Output

Markdown Output

Error Handling

Common Format Errors

Best Practices

Next Steps