Skip to content

Document Tools

Tools for processing, extracting, and managing documents.

Available Tools

ToolDescriptionPermission
document/ocrExtract text from images/PDFsdocument/ocr
document/processProcess and chunk documentsdocument/process
document/statusCheck processing statusdocument/status
document/listList processed documentsdocument/list
document/deleteDelete a documentdocument/delete

document/ocr

Extract text from images and PDFs using OCR.

Parameters

ParameterTypeRequiredDescription
filestringYesBase64-encoded file or file URL
modelstringNoOCR model (default: mistral-document-ai)
pagesarrayNoSpecific pages to process
extract_tablesbooleanNoExtract tables as structured data
languagestringNoHint for language detection

Example

python
import base64
from gateflow_mcp import MCPClient

client = MCPClient(agent_id="agent_abc123", api_key="gf-agent-...")

# Read and encode document
with open("contract.pdf", "rb") as f:
    file_b64 = base64.b64encode(f.read()).decode()

# Extract text
result = client.call_tool(
    name="document/ocr",
    arguments={
        "file": file_b64,
        "model": "mistral-document-ai",
        "extract_tables": True
    }
)

print(f"Extracted {result['pages']} pages")
print(f"Text: {result['text'][:500]}...")

# Access tables
for table in result.get("tables", []):
    print(f"Table on page {table['page']}: {table['rows']} rows")

Response

json
{
  "text": "CONSULTING AGREEMENT\n\nThis Agreement is entered into...",
  "pages": 5,
  "tables": [
    {
      "page": 2,
      "rows": 10,
      "columns": 4,
      "data": [["Item", "Qty", "Price", "Total"], ...]
    }
  ],
  "metadata": {
    "model": "mistral-document-ai",
    "processing_time_ms": 2500,
    "confidence": 0.95
  }
}

document/process

Process a document for storage and retrieval.

Parameters

ParameterTypeRequiredDescription
filestringYesBase64-encoded file
filenamestringYesOriginal filename
collectionstringNoTarget collection
chunk_sizeintegerNoCharacters per chunk (default: 1000)
chunk_overlapintegerNoOverlap between chunks (default: 200)
detect_piibooleanNoScan for PII/PHI
classificationstringNoData classification level
metadataobjectNoCustom metadata

Example

python
result = client.call_tool(
    name="document/process",
    arguments={
        "file": file_b64,
        "filename": "employee_handbook.pdf",
        "collection": "hr-documents",
        "chunk_size": 1000,
        "chunk_overlap": 200,
        "detect_pii": True,
        "classification": "internal",
        "metadata": {
            "department": "HR",
            "version": "2026"
        }
    }
)

print(f"Document ID: {result['document_id']}")
print(f"Chunks created: {result['chunks']}")
print(f"PII findings: {result['pii_count']}")

Response

json
{
  "document_id": "doc_xyz789",
  "status": "processed",
  "filename": "employee_handbook.pdf",
  "pages": 45,
  "chunks": 120,
  "characters": 95000,
  "pii_count": 5,
  "pii_types": ["email", "phone_number"],
  "collection": "hr-documents",
  "classification": "internal",
  "processed_at": "2026-02-16T10:00:00Z"
}

document/status

Check the processing status of a document.

Parameters

ParameterTypeRequiredDescription
document_idstringYesDocument identifier

Example

python
result = client.call_tool(
    name="document/status",
    arguments={"document_id": "doc_xyz789"}
)

print(f"Status: {result['status']}")
print(f"Progress: {result['progress']}%")

Response

json
{
  "document_id": "doc_xyz789",
  "status": "processing",
  "progress": 65,
  "stages": {
    "upload": "complete",
    "ocr": "complete",
    "chunking": "in_progress",
    "embedding": "pending",
    "pii_scan": "pending"
  },
  "started_at": "2026-02-16T10:00:00Z",
  "estimated_completion": "2026-02-16T10:02:00Z"
}

document/list

List documents in a collection.

Parameters

ParameterTypeRequiredDescription
collectionstringNoFilter by collection
statusstringNoFilter by status
limitintegerNoMax results (default: 20)
offsetintegerNoPagination offset

Example

python
result = client.call_tool(
    name="document/list",
    arguments={
        "collection": "hr-documents",
        "status": "processed",
        "limit": 10
    }
)

for doc in result["documents"]:
    print(f"{doc['filename']}: {doc['pages']} pages, {doc['chunks']} chunks")

Response

json
{
  "documents": [
    {
      "document_id": "doc_xyz789",
      "filename": "employee_handbook.pdf",
      "collection": "hr-documents",
      "status": "processed",
      "pages": 45,
      "chunks": 120,
      "created_at": "2026-02-16T10:00:00Z"
    }
  ],
  "total": 15,
  "limit": 10,
  "offset": 0
}

document/delete

Delete a document and its embeddings.

Parameters

ParameterTypeRequiredDescription
document_idstringYesDocument to delete

Example

python
result = client.call_tool(
    name="document/delete",
    arguments={"document_id": "doc_xyz789"}
)

print(f"Deleted: {result['deleted']}")

Response

json
{
  "document_id": "doc_xyz789",
  "deleted": true,
  "chunks_removed": 120,
  "deleted_at": "2026-02-16T11:00:00Z"
}

Supported File Types

TypeExtensionsMax Size
PDF.pdf50 MB
Word.docx, .doc50 MB
Text.txt, .md10 MB
Images.png, .jpg, .tiff20 MB
Spreadsheets.xlsx, .csv50 MB
HTML.html10 MB

Permissions

Grant document tool access:

yaml
permissions:
  tools:
    - document/ocr        # Extract text
    - document/process    # Process documents
    - document/status     # Check status
    - document/list       # List documents
    - document/delete     # Delete documents
  collections:
    - hr-documents        # Specific collection access
    - legal-contracts

Best Practices

  1. Use collections - Organize documents logically
  2. Enable PII detection - For sensitive documents
  3. Set classification - Apply appropriate data classification
  4. Add metadata - Improve searchability
  5. Check status - Poll for completion on large documents

Next Steps

Built with reliability in mind.