Appearance
Supported Formats
Complete reference for document formats supported by GateFlow OCR.
Overview
GateFlow OCR supports a wide range of document formats for text extraction, with specialized handling for complex layouts, tables, and multi-page documents.
Document Formats
PDF Documents
| Format | Extension | OCR Required | Notes |
|---|---|---|---|
| Native PDF | .pdf | No | Text directly extractable |
| Scanned PDF | .pdf | Yes | Image-based, requires OCR |
| Mixed PDF | .pdf | Partial | Both native and scanned pages |
| PDF/A | .pdf | Varies | Archival format, full support |
| Encrypted PDF | .pdf | No* | Requires password |
python
from openai import OpenAI
client = OpenAI(
base_url="https://api.gateflow.ai/v1",
api_key="gw_prod_..."
)
# Process a PDF document
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={
"detect_type": True, # Auto-detect native vs scanned
"extract_tables": True,
"preserve_layout": True
}
)
print(f"Document type: {response['document_type']}")
print(f"Pages: {response['page_count']}")
print(f"OCR applied: {response['ocr_applied']}")Image Formats
| Format | Extension | Best For | Max Resolution |
|---|---|---|---|
| JPEG | .jpg, .jpeg | Photos, scans | 10000x10000 |
| PNG | .png | Screenshots, diagrams | 10000x10000 |
| TIFF | .tif, .tiff | High-quality scans | 10000x10000 |
| WebP | .webp | Web images | 10000x10000 |
| BMP | .bmp | Legacy scans | 10000x10000 |
| HEIC | .heic | iOS photos | 10000x10000 |
python
# Process an image
response = client.post(
"/data/ocr",
files={"file": open("scan.jpg", "rb")},
data={
"deskew": True, # Correct rotation
"denoise": True, # Remove noise
"enhance_contrast": True
}
)Office Documents
| Format | Extension | Native Text | OCR for Images |
|---|---|---|---|
| Word | .docx | Yes | Embedded images |
| Word (Legacy) | .doc | Yes | Embedded images |
| Excel | .xlsx | Yes | Embedded images |
| Excel (Legacy) | .xls | Yes | Embedded images |
| PowerPoint | .pptx | Yes | Slides as needed |
| PowerPoint (Legacy) | .ppt | Yes | Slides as needed |
python
# Process a Word document with embedded images
response = client.post(
"/data/ocr",
files={"file": open("report.docx", "rb")},
data={
"ocr_embedded_images": True,
"preserve_formatting": True
}
)Plain Text and Markup
| Format | Extension | Processing |
|---|---|---|
| Plain Text | .txt | Direct extraction |
| Markdown | .md | Structure preserved |
| HTML | .html, .htm | Tags stripped, structure kept |
| XML | .xml | Content extraction |
| CSV | .csv | Table structure |
| JSON | .json | Data extraction |
Email Formats
| Format | Extension | Attachments |
|---|---|---|
| EML | .eml | Processed recursively |
| MSG | .msg | Processed recursively |
| MBOX | .mbox | Multiple emails |
python
# Process email with attachments
response = client.post(
"/data/ocr",
files={"file": open("email.eml", "rb")},
data={
"process_attachments": True,
"max_attachment_depth": 2,
"attachment_types": ["pdf", "docx", "jpg"]
}
)
print(f"Email body extracted")
print(f"Attachments processed: {response['attachments_count']}")Image Quality Requirements
Minimum Requirements
| Parameter | Minimum | Recommended | Optimal |
|---|---|---|---|
| Resolution | 150 DPI | 300 DPI | 600 DPI |
| Text height | 8px | 20px | 32px+ |
| Contrast ratio | 2:1 | 4:1 | 7:1+ |
| Skew angle | < 15° | < 5° | < 1° |
Quality Preprocessing
python
# Enable automatic quality enhancement
response = client.post(
"/data/ocr",
files={"file": open("poor_scan.jpg", "rb")},
data={
"preprocessing": {
"deskew": True,
"denoise": True,
"binarize": True,
"remove_borders": True,
"enhance_resolution": True # AI upscaling
}
}
)
print(f"Quality score before: {response['quality']['original']}")
print(f"Quality score after: {response['quality']['enhanced']}")Language Support
Supported Languages
GateFlow OCR supports 100+ languages including:
| Category | Languages |
|---|---|
| Latin | English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Czech, etc. |
| Cyrillic | Russian, Ukrainian, Bulgarian, Serbian, etc. |
| Asian | Chinese (Simplified & Traditional), Japanese, Korean, Thai, Vietnamese |
| Middle Eastern | Arabic, Hebrew, Persian, Urdu |
| South Asian | Hindi, Bengali, Tamil, Telugu, Gujarati |
| Other | Greek, Armenian, Georgian, etc. |
python
# Specify language for better accuracy
response = client.post(
"/data/ocr",
files={"file": open("german_doc.pdf", "rb")},
data={
"languages": ["de", "en"], # Primary and secondary
"detect_language": True
}
)
print(f"Detected languages: {response['detected_languages']}")Multi-Language Documents
python
# Process document with multiple languages
response = client.post(
"/data/ocr",
files={"file": open("multilingual.pdf", "rb")},
data={
"languages": ["en", "es", "fr", "de"],
"per_page_detection": True
}
)
for page in response["pages"]:
print(f"Page {page['number']}: {page['detected_language']}")File Size Limits
| Plan | Max File Size | Max Pages | Max Resolution |
|---|---|---|---|
| Free | 10 MB | 10 pages | 2000x2000 |
| Pro | 50 MB | 100 pages | 5000x5000 |
| Enterprise | 500 MB | 1000 pages | 10000x10000 |
Handling Large Documents
python
# Process large document in chunks
response = client.post(
"/data/ocr",
files={"file": open("large_doc.pdf", "rb")},
data={
"page_range": "1-100", # First 100 pages
"async": True # Process asynchronously
}
)
job_id = response["job_id"]
# Check status
status = client.get(f"/data/ocr/jobs/{job_id}")
print(f"Status: {status['status']}")
print(f"Progress: {status['pages_processed']}/{status['total_pages']}")Output Formats
Text Output
python
# Get plain text output
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={"output_format": "text"}
)
print(response["text"])Structured Output
python
# Get structured output with layout
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={"output_format": "structured"}
)
for page in response["pages"]:
print(f"\n--- Page {page['number']} ---")
for block in page["blocks"]:
print(f"[{block['type']}] {block['text'][:100]}...")JSON Output
python
# Get full JSON output
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={
"output_format": "json",
"include_coordinates": True,
"include_confidence": True
}
)
# Response includes bounding boxes and confidence scores
for word in response["pages"][0]["words"]:
print(f"{word['text']} ({word['confidence']:.2f})")
print(f" Position: {word['bbox']}")Markdown Output
python
# Get Markdown output preserving structure
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={
"output_format": "markdown",
"preserve_headings": True,
"preserve_lists": True,
"preserve_tables": True
}
)
print(response["markdown"])Error Handling
Common Format Errors
| Error | Cause | Solution |
|---|---|---|
unsupported_format | File type not supported | Convert to supported format |
corrupted_file | File is damaged | Re-upload or repair file |
password_protected | PDF requires password | Provide password parameter |
too_large | Exceeds size limit | Split document or upgrade plan |
low_quality | Image quality too poor | Enhance or rescan |
python
# Handle format errors
try:
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")}
)
except Exception as e:
if "password_protected" in str(e):
# Retry with password
response = client.post(
"/data/ocr",
files={"file": open("document.pdf", "rb")},
data={"password": "secret123"}
)Best Practices
- Use native PDFs when possible - Faster and more accurate than OCR
- Scan at 300 DPI minimum - Better accuracy for OCR
- Specify languages - Improves recognition accuracy
- Enable preprocessing - Helps with poor quality scans
- Use async for large files - Prevents timeouts
- Check confidence scores - Flag low-confidence text for review
Next Steps
- Mistral Document AI - OCR engine details
- Table Extraction - Extract tables
- Document Ingestion - Full pipeline