Skip to content

Documents API

Manage documents in the Data Pillar for RAG (Retrieval-Augmented Generation) pipelines.

Overview

The Documents API provides endpoints to:

  • Ingest documents for RAG pipelines
  • List and retrieve documents
  • Delete documents (GDPR Article 17 compliant)
  • Get document chunks for direct injection

Documents are automatically processed through:

  1. Text extraction (PDF, DOCX, TXT, MD)
  2. PII/PHI detection and optional redaction
  3. Chunking (semantic, fixed, or paragraph-based)
  4. Embedding generation
  5. Vector storage for semantic search

Endpoints

MethodEndpointDescription
POST/v1/data/documentsUpload and process document
GET/v1/data/documentsList documents
GET/v1/data/documents/:idGet document details
GET/v1/data/documents/:id/chunksGet document chunks
DELETE/v1/data/documents/:idDelete document

Upload Document

Upload and process a document for RAG.

POST /v1/data/documents
Content-Type: multipart/form-data

Request

bash
curl -X POST https://api.gateflow.ai/v1/data/documents \
  -H "Authorization: Bearer gw_prod_..." \
  -F "file=@document.pdf" \
  -F "name=Company Policies" \
  -F "data_classification=internal" \
  -F "chunking_strategy=semantic" \
  -F "chunk_size_tokens=512" \
  -F "residency_region=eu"
python
import requests

response = requests.post(
    "https://api.gateflow.ai/v1/data/documents",
    headers={"Authorization": "Bearer gw_prod_..."},
    files={"file": open("document.pdf", "rb")},
    data={
        "name": "Company Policies",
        "data_classification": "internal",
        "chunking_strategy": "semantic",
        "chunk_size_tokens": 512,
        "residency_region": "eu"
    }
)

Parameters

ParameterTypeRequiredDefaultDescription
filefileYes-Document file to upload
namestringNofilenameDocument name
data_classificationstringNointernalClassification level
chunking_strategystringNosemanticHow to split document
chunk_size_tokensintegerNo512Target chunk size (100-2000)
chunk_overlap_tokensintegerNo50Overlap between chunks (0-200)
residency_regionstringNoeuData residency region
retention_daysintegerNonullAuto-delete after N days

Data Classifications

LevelDescription
publicNo restrictions
internalInternal use only
confidentialSensitive business data
phiProtected Health Information
privilegedAttorney-client privilege

Chunking Strategies

StrategyDescription
semanticSplit on semantic boundaries (paragraphs, sections)
fixedFixed-size chunks with overlap
paragraphSplit on paragraph boundaries

Response

json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Company Policies",
  "original_filename": "document.pdf",
  "mime_type": "application/pdf",
  "file_size_bytes": 102400,
  "data_classification": "internal",
  "chunking_strategy": "semantic",
  "chunk_count": 45,
  "embedding_model": "text-embedding-3-small",
  "status": "completed",
  "error_message": null,
  "residency_region": "eu",
  "retention_days": null,
  "metadata": {},
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:15Z",
  "processing_time_seconds": 15.2
}

List Documents

List documents for the current organization.

GET /v1/data/documents

Parameters

ParameterTypeDefaultDescription
pageinteger1Page number
page_sizeinteger20Items per page (1-100)
statusstringnullFilter by status
classificationstringnullFilter by classification

Response

json
{
  "documents": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "name": "Company Policies",
      "status": "completed",
      "chunk_count": 45,
      "created_at": "2024-01-15T10:30:00Z"
    }
  ],
  "total": 150,
  "page": 1,
  "page_size": 20,
  "has_more": true
}

Get Document

Get document details by ID.

GET /v1/data/documents/:id

Get Document Chunks

Retrieve document chunks for full document injection into prompts.

GET /v1/data/documents/:id/chunks

Parameters

ParameterTypeDefaultDescription
limitinteger50Maximum chunks (1-500)
offsetinteger0Skip N chunks

Response

json
{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "document_name": "Company Policies",
  "total_chunks": 45,
  "total_tokens": 23400,
  "chunks": [
    {
      "chunk_id": "chunk-001",
      "chunk_index": 0,
      "content": "Chapter 1: Introduction...",
      "token_count": 512,
      "page_number": 1,
      "section_header": "Introduction",
      "metadata": {}
    }
  ],
  "offset": 0,
  "limit": 50,
  "has_more": false
}

Delete Document

Delete a document and all its chunks. Implements GDPR Article 17 (Right to Erasure).

DELETE /v1/data/documents/:id

Response

json
{
  "message": "Document deleted successfully",
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "chunks_deleted": 45
}

Document Status

StatusDescription
pendingAwaiting processing
processingBeing processed
completedReady for search
failedProcessing failed

Supported File Types

TypeExtensionsMax Size
PDF.pdf50MB
Word.docx, .doc25MB
Text.txt, .md10MB
HTML.html, .htm10MB

Error Codes

CodeDescription
400Invalid file type or parameters
404Document not found
413File too large
422Processing failed

See Also

Built with reliability in mind.