Structured Data Extraction
Extract structured data from documents using JSON Schema definitions. Returns validated data with confidence indicators.
Extract Data
POST /api/extract
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
| projectId | string | Yes | Project containing the files |
| fileId | string | Yes* | Single file to extract from |
| fileIds | string[] | Yes* | Multiple files to extract from |
| schema | object | Yes | JSON Schema defining fields |
| instructions | string | No | Additional extraction context |
| model | string | No | "fast" or "accurate" (default) |
*Either fileId or fileIds is required.
Schema Format
The schema follows JSON Schema specification:
{
"type": "object",
"properties": {
"companyName": {
"type": "string",
"description": "Official registered company name"
},
"founders": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"title": { "type": "string" }
}
}
},
"incorporationDate": {
"type": "string",
"format": "date"
}
},
"required": ["companyName"]
}
Example - Company Document
curl -X POST "https://api.getneji.com/api/extract" \
-H "Authorization: Bearer sk_your_key" \
-H "Content-Type: application/json" \
-d '{
"projectId": "proj_abc123",
"fileId": "file_xyz789",
"schema": {
"type": "object",
"properties": {
"companyName": { "type": "string" },
"founders": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"title": { "type": "string" }
}
}
},
"registrationNumber": { "type": "string" }
},
"required": ["companyName"]
}
}'
Example - Receipt Image
{
"projectId": "proj_abc123",
"fileId": "file_receipt_img",
"schema": {
"type": "object",
"properties": {
"storeName": { "type": "string" },
"date": { "type": "string" },
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
}
}
},
"total": { "type": "number" }
},
"required": ["storeName", "total"]
}
}
Response - Completed
{
"extractionId": "extr_abc123",
"status": "completed",
"data": {
"companyName": "Acme Technologies Inc.",
"founders": [
{ "name": "John Smith", "title": "CEO" },
{ "name": "Jane Doe", "title": "CTO" }
],
"registrationNumber": "C1234567"
},
"confidence": {
"companyName": {
"level": "high",
"source": "Page 1, header"
},
"founders": {
"level": "medium",
"source": "Signature block",
"note": "Titles inferred from context"
}
},
"validation": {
"valid": true,
"errors": [],
"missingRequired": []
}
}
Confidence Levels
| Level | Description |
|---|---|
high | Data is clearly stated and unambiguous |
medium | Data requires interpretation |
low | Data is inferred or uncertain |
Get Extraction Result
GET /api/extract/:extractionId
Poll this endpoint to get the result of an async extraction.