Structured Data Extraction

Extract structured data from documents using JSON Schema definitions. Returns validated data with confidence indicators.

Extract Data

POST /api/extract

Request Body

FieldTypeRequiredDescription
projectIdstringYesProject containing the files
fileIdstringYes*Single file to extract from
fileIdsstring[]Yes*Multiple files to extract from
schemaobjectYesJSON Schema defining fields
instructionsstringNoAdditional extraction context
modelstringNo"fast" or "accurate" (default)

*Either fileId or fileIds is required.

Schema Format

The schema follows JSON Schema specification:

{
  "type": "object",
  "properties": {
    "companyName": {
      "type": "string",
      "description": "Official registered company name"
    },
    "founders": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "title": { "type": "string" }
        }
      }
    },
    "incorporationDate": {
      "type": "string",
      "format": "date"
    }
  },
  "required": ["companyName"]
}

Example - Company Document

curl -X POST "https://api.getneji.com/api/extract" \
  -H "Authorization: Bearer sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "projectId": "proj_abc123",
    "fileId": "file_xyz789",
    "schema": {
      "type": "object",
      "properties": {
        "companyName": { "type": "string" },
        "founders": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "title": { "type": "string" }
            }
          }
        },
        "registrationNumber": { "type": "string" }
      },
      "required": ["companyName"]
    }
  }'

Example - Receipt Image

{
  "projectId": "proj_abc123",
  "fileId": "file_receipt_img",
  "schema": {
    "type": "object",
    "properties": {
      "storeName": { "type": "string" },
      "date": { "type": "string" },
      "items": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "price": { "type": "number" }
          }
        }
      },
      "total": { "type": "number" }
    },
    "required": ["storeName", "total"]
  }
}

Response - Completed

{
  "extractionId": "extr_abc123",
  "status": "completed",
  "data": {
    "companyName": "Acme Technologies Inc.",
    "founders": [
      { "name": "John Smith", "title": "CEO" },
      { "name": "Jane Doe", "title": "CTO" }
    ],
    "registrationNumber": "C1234567"
  },
  "confidence": {
    "companyName": {
      "level": "high",
      "source": "Page 1, header"
    },
    "founders": {
      "level": "medium",
      "source": "Signature block",
      "note": "Titles inferred from context"
    }
  },
  "validation": {
    "valid": true,
    "errors": [],
    "missingRequired": []
  }
}

Confidence Levels

LevelDescription
highData is clearly stated and unambiguous
mediumData requires interpretation
lowData is inferred or uncertain

Get Extraction Result

GET /api/extract/:extractionId

Poll this endpoint to get the result of an async extraction.