Skip to content

Supported Formats

Complete reference for document formats supported by GateFlow OCR.

Overview

GateFlow OCR supports a wide range of document formats for text extraction, with specialized handling for complex layouts, tables, and multi-page documents.

Document Formats

PDF Documents

FormatExtensionOCR RequiredNotes
Native PDF.pdfNoText directly extractable
Scanned PDF.pdfYesImage-based, requires OCR
Mixed PDF.pdfPartialBoth native and scanned pages
PDF/A.pdfVariesArchival format, full support
Encrypted PDF.pdfNo*Requires password
python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.gateflow.ai/v1",
    api_key="gw_prod_..."
)

# Process a PDF document
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "detect_type": True,  # Auto-detect native vs scanned
        "extract_tables": True,
        "preserve_layout": True
    }
)

print(f"Document type: {response['document_type']}")
print(f"Pages: {response['page_count']}")
print(f"OCR applied: {response['ocr_applied']}")

Image Formats

FormatExtensionBest ForMax Resolution
JPEG.jpg, .jpegPhotos, scans10000x10000
PNG.pngScreenshots, diagrams10000x10000
TIFF.tif, .tiffHigh-quality scans10000x10000
WebP.webpWeb images10000x10000
BMP.bmpLegacy scans10000x10000
HEIC.heiciOS photos10000x10000
python
# Process an image
response = client.post(
    "/data/ocr",
    files={"file": open("scan.jpg", "rb")},
    data={
        "deskew": True,       # Correct rotation
        "denoise": True,      # Remove noise
        "enhance_contrast": True
    }
)

Office Documents

FormatExtensionNative TextOCR for Images
Word.docxYesEmbedded images
Word (Legacy).docYesEmbedded images
Excel.xlsxYesEmbedded images
Excel (Legacy).xlsYesEmbedded images
PowerPoint.pptxYesSlides as needed
PowerPoint (Legacy).pptYesSlides as needed
python
# Process a Word document with embedded images
response = client.post(
    "/data/ocr",
    files={"file": open("report.docx", "rb")},
    data={
        "ocr_embedded_images": True,
        "preserve_formatting": True
    }
)

Plain Text and Markup

FormatExtensionProcessing
Plain Text.txtDirect extraction
Markdown.mdStructure preserved
HTML.html, .htmTags stripped, structure kept
XML.xmlContent extraction
CSV.csvTable structure
JSON.jsonData extraction

Email Formats

FormatExtensionAttachments
EML.emlProcessed recursively
MSG.msgProcessed recursively
MBOX.mboxMultiple emails
python
# Process email with attachments
response = client.post(
    "/data/ocr",
    files={"file": open("email.eml", "rb")},
    data={
        "process_attachments": True,
        "max_attachment_depth": 2,
        "attachment_types": ["pdf", "docx", "jpg"]
    }
)

print(f"Email body extracted")
print(f"Attachments processed: {response['attachments_count']}")

Image Quality Requirements

Minimum Requirements

ParameterMinimumRecommendedOptimal
Resolution150 DPI300 DPI600 DPI
Text height8px20px32px+
Contrast ratio2:14:17:1+
Skew angle< 15°< 5°< 1°

Quality Preprocessing

python
# Enable automatic quality enhancement
response = client.post(
    "/data/ocr",
    files={"file": open("poor_scan.jpg", "rb")},
    data={
        "preprocessing": {
            "deskew": True,
            "denoise": True,
            "binarize": True,
            "remove_borders": True,
            "enhance_resolution": True  # AI upscaling
        }
    }
)

print(f"Quality score before: {response['quality']['original']}")
print(f"Quality score after: {response['quality']['enhanced']}")

Language Support

Supported Languages

GateFlow OCR supports 100+ languages including:

CategoryLanguages
LatinEnglish, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Czech, etc.
CyrillicRussian, Ukrainian, Bulgarian, Serbian, etc.
AsianChinese (Simplified & Traditional), Japanese, Korean, Thai, Vietnamese
Middle EasternArabic, Hebrew, Persian, Urdu
South AsianHindi, Bengali, Tamil, Telugu, Gujarati
OtherGreek, Armenian, Georgian, etc.
python
# Specify language for better accuracy
response = client.post(
    "/data/ocr",
    files={"file": open("german_doc.pdf", "rb")},
    data={
        "languages": ["de", "en"],  # Primary and secondary
        "detect_language": True
    }
)

print(f"Detected languages: {response['detected_languages']}")

Multi-Language Documents

python
# Process document with multiple languages
response = client.post(
    "/data/ocr",
    files={"file": open("multilingual.pdf", "rb")},
    data={
        "languages": ["en", "es", "fr", "de"],
        "per_page_detection": True
    }
)

for page in response["pages"]:
    print(f"Page {page['number']}: {page['detected_language']}")

File Size Limits

PlanMax File SizeMax PagesMax Resolution
Free10 MB10 pages2000x2000
Pro50 MB100 pages5000x5000
Enterprise500 MB1000 pages10000x10000

Handling Large Documents

python
# Process large document in chunks
response = client.post(
    "/data/ocr",
    files={"file": open("large_doc.pdf", "rb")},
    data={
        "page_range": "1-100",  # First 100 pages
        "async": True           # Process asynchronously
    }
)

job_id = response["job_id"]

# Check status
status = client.get(f"/data/ocr/jobs/{job_id}")
print(f"Status: {status['status']}")
print(f"Progress: {status['pages_processed']}/{status['total_pages']}")

Output Formats

Text Output

python
# Get plain text output
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={"output_format": "text"}
)

print(response["text"])

Structured Output

python
# Get structured output with layout
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={"output_format": "structured"}
)

for page in response["pages"]:
    print(f"\n--- Page {page['number']} ---")
    for block in page["blocks"]:
        print(f"[{block['type']}] {block['text'][:100]}...")

JSON Output

python
# Get full JSON output
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "output_format": "json",
        "include_coordinates": True,
        "include_confidence": True
    }
)

# Response includes bounding boxes and confidence scores
for word in response["pages"][0]["words"]:
    print(f"{word['text']} ({word['confidence']:.2f})")
    print(f"  Position: {word['bbox']}")

Markdown Output

python
# Get Markdown output preserving structure
response = client.post(
    "/data/ocr",
    files={"file": open("document.pdf", "rb")},
    data={
        "output_format": "markdown",
        "preserve_headings": True,
        "preserve_lists": True,
        "preserve_tables": True
    }
)

print(response["markdown"])

Error Handling

Common Format Errors

ErrorCauseSolution
unsupported_formatFile type not supportedConvert to supported format
corrupted_fileFile is damagedRe-upload or repair file
password_protectedPDF requires passwordProvide password parameter
too_largeExceeds size limitSplit document or upgrade plan
low_qualityImage quality too poorEnhance or rescan
python
# Handle format errors
try:
    response = client.post(
        "/data/ocr",
        files={"file": open("document.pdf", "rb")}
    )
except Exception as e:
    if "password_protected" in str(e):
        # Retry with password
        response = client.post(
            "/data/ocr",
            files={"file": open("document.pdf", "rb")},
            data={"password": "secret123"}
        )

Best Practices

  1. Use native PDFs when possible - Faster and more accurate than OCR
  2. Scan at 300 DPI minimum - Better accuracy for OCR
  3. Specify languages - Improves recognition accuracy
  4. Enable preprocessing - Helps with poor quality scans
  5. Use async for large files - Prevents timeouts
  6. Check confidence scores - Flag low-confidence text for review

Next Steps

Built with reliability in mind.