Structify: From Chaos to Structure in Seconds
Technical deep dive into the Structify tool. Learn schema design patterns, validation strategies, advanced use cases, error handling, and production-ready implementation patterns for transforming unstructured data.
TL;DR
- Structify transforms unstructured text into structured JSON using AI-powered extraction
- Design schemas with proper field types, validation rules, and nested structures
- Implement error handling for extraction failures, validation errors, and edge cases
- Use validation strategies: strict mode for critical data, lenient mode for exploration
- Costs 3 points per call—process 500 documents on the Starter plan (1500 points)
- Production patterns: batch processing, retry logic, quality validation, and caching
What is Structify?
AI-Powered Data Extraction
Structify is one of AppHighway''s most powerful tools for transforming unstructured text into structured data. Whether you''re parsing emails, extracting information from documents, or cleaning messy datasets, Structify uses advanced AI models to understand context and extract exactly what you need.
Key Capabilities
Common Use Cases
Schema Design Patterns
Build Effective Extraction Schemas
The quality of your results depends heavily on schema design. Here''s how to create schemas that extract exactly what you need.
1. Basic Schema Structure
Start with a simple flat schema for straightforward extraction
Example: Contact Extraction
{
"name": "string",
"email": "string",
"phone": "string",
"company": "string",
"role": "string"
}Input: Input text: ''Hi, I''m Sarah Johnson from TechCorp (sarah.j@techcorp.com, +1-555-0123). I''m the VP of Engineering.''
Output: Output: '{'name: ''Sarah Johnson'', email: ''sarah.j@techcorp.com'', phone: ''+1-555-0123'', company: ''TechCorp'', role: ''VP of Engineering'''}'
2. Field Type Definitions
Specify exact types for better validation and type safety
Example: Invoice Schema with Types
{
"invoice_number": "string",
"date": "date",
"total_amount": "number",
"currency": "string",
"is_paid": "boolean",
"line_items": "array",
"vendor": {
"name": "string",
"address": "string",
"tax_id": "string"
}
}3. Nested Object Schemas
Extract hierarchical data with nested objects
Example: Product with Nested Details
{
"product": {
"name": "string",
"sku": "string",
"price": {
"amount": "number",
"currency": "string",
"tax_included": "boolean"
},
"availability": {
"in_stock": "boolean",
"quantity": "number",
"warehouse": "string"
},
"specifications": {
"dimensions": "string",
"weight": "number",
"color": "string"
}
}
}Nested schemas keep related data organized and make downstream processing easier.
4. Array Field Patterns
Extract lists and collections from text
Simple Arrays (primitives)
{ "tags": ["string"], "prices": ["number"] }Object Arrays (structured lists)
{
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}
]
}Perfect for invoices, shopping carts, multi-item forms, and product lists.
5. Optional vs Required Fields
Mark fields as optional when they might not appear in all documents
Example: Contact with Optional Fields
{
"name": "string", // Required
"email": "string", // Required
"phone?": "string", // Optional
"company?": "string", // Optional
"role?": "string" // Optional
}Use `?` suffix or specify `required: false` in JSON Schema format.
Validation Strategies
Ensure Data Quality
Validation ensures extracted data meets your quality standards before downstream processing.
1. Strict Mode
Reject responses that don''t match the schema exactly
When: Use for critical data: financial records, legal documents, customer orders
Behavior: Returns error if any required field is missing or type mismatches occur
2. Lenient Mode
Return partial results with missing fields as null
When: Use for exploratory analysis, fuzzy matching, optional data extraction
Behavior: Returns best-effort extraction with null for missing fields
Field-Level Validation
Custom Validation Rules
Implement business logic validation after extraction
Example: Invoice Amount Validation
const result = await structifyAPI.extract(text, schema);
// Custom validation
if (result.total_amount < 0) {
throw new ValidationError('Total amount cannot be negative');
}
if (result.line_items.length === 0) {
throw new ValidationError('Invoice must have at least one line item');
}
const calculatedTotal = result.line_items
.reduce((sum, item) => sum + item.total, 0);
if (Math.abs(calculatedTotal - result.total_amount) > 0.01) {
throw new ValidationError('Line items do not match total amount');
}Error Handling Patterns
Advanced Use Cases
Real-World Implementation Patterns
1. Email Conversation Threading
Extract structured data from multi-party email threads
Challenge: Email threads contain multiple messages, quoted replies, signatures
Solution: Extract array of messages with sender, timestamp, body
{
"subject": "string",
"thread_id": "string",
"messages": [
{
"sender": { "name": "string", "email": "string" },
"timestamp": "date",
"body": "string",
"is_reply": "boolean"
}
]
}Enables sentiment analysis, response time tracking, and conversation history
2. Contract Clause Extraction
Extract specific clauses and terms from legal documents
Challenge: Contracts have complex structure, legal jargon, nested clauses
Solution: Define schema for standard clauses (payment terms, termination, liability)
{
"parties": [{ "name": "string", "role": "string" }],
"effective_date": "date",
"term": { "duration": "string", "renewal": "boolean" },
"payment_terms": { "amount": "number", "frequency": "string", "due_date": "string" },
"termination": { "notice_period": "string", "conditions": ["string"] },
"liability_cap": "number"
}Automate contract review, compare terms across vendors, flag risky clauses
3. Multi-Page Form Extraction
Extract data from scanned forms (applications, surveys, registrations)
Challenge: Forms span multiple pages, handwritten entries, checkbox fields
Solution: OCR → Text cleanup → Structify with form field schema
1. OCR with Tesseract/Cloud Vision
2. Text cleaning (remove artifacts, fix encoding)
3. Structify with checkbox handling
4. Validate extracted data
5. Flag low-confidence fields for review10x faster than manual data entry, enables bulk form processing
4. Product Catalog Migration
Migrate legacy product data from PDFs or text files to structured database
Challenge: Inconsistent formatting, missing fields, mixed units
Solution: Batch processing with schema normalization
{
"sku": "string",
"name": "string",
"category": "string",
"description": "string",
"price": { "amount": "number", "currency": "string" },
"specifications": { "weight": "number", "dimensions": "string", "color": "string" },
"inventory": { "quantity": "number", "warehouse": "string" }
}Normalize units, deduplicate SKUs, validate prices, enrich missing fields
Migrate 10,000+ products in hours instead of weeks
Production Implementation
Best Practices for Production Use
1. Batch Processing Pattern
Process multiple documents efficiently
async function batchStructify(documents, schema) {
const results = [];
const errors = [];
// Process in parallel (batch size: 10)
for (let i = 0; i < documents.length; i += 10) {
const batch = documents.slice(i, i + 10);
const promises = batch.map(async (doc) => {
try {
const result = await structifyAPI.extract(doc.text, schema);
return { id: doc.id, data: result, status: 'success' };
} catch (error) {
return { id: doc.id, error: error.message, status: 'failed' };
}
});
const batchResults = await Promise.allSettled(promises);
batchResults.forEach(result => {
if (result.status === 'fulfilled') {
if (result.value.status === 'success') {
results.push(result.value);
} else {
errors.push(result.value);
}
}
});
}
return { results, errors };
}Process 1000 documents in 15 minutes instead of 3+ hours sequentially
2. Retry Logic for Transient Failures
Handle temporary errors gracefully
async function extractWithRetry(text, schema, maxRetries = 3) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await structifyAPI.extract(text, schema);
} catch (error) {
lastError = error;
// Retry only on transient errors
if (error.code === 'RATE_LIMIT' || error.code === 'TIMEOUT') {
const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
// Don't retry validation errors
throw error;
}
}
throw lastError;
}Implement exponential backoff: 2s, 4s, 8s delays between retries
3. Quality Validation Pipeline
Validate extraction quality before downstream use
function validateExtraction(result, rules) {
const issues = [];
// Completeness check
if (!result.email || !result.name) {
issues.push('Missing required fields');
}
// Format validation
if (result.email && !isValidEmail(result.email)) {
issues.push('Invalid email format');
}
// Range validation
if (result.price && result.price < 0) {
issues.push('Price cannot be negative');
}
return { valid: issues.length === 0, issues };
}Flag low-quality extractions for manual review instead of using bad data
4. Result Caching
Cache extraction results to save points and improve performance
Hash input text + schema → Cache key
const crypto = require('crypto');
function getCacheKey(text, schema) {
const hash = crypto.createHash('sha256');
hash.update(text + JSON.stringify(schema));
return hash.digest('hex');
}
async function extractWithCache(text, schema) {
const cacheKey = getCacheKey(text, schema);
// Check cache
const cached = await cache.get(cacheKey);
if (cached) return JSON.parse(cached);
// Extract
const result = await structifyAPI.extract(text, schema);
// Cache for 24 hours
await cache.set(cacheKey, JSON.stringify(result), 86400);
return result;
}Save 70% of points on repeated extractions, 10x faster response times
5. Monitoring & Observability
Track extraction quality and performance
Alert on: success rate < 95%, extraction time > 10s, daily points > budget
Error Handling & Troubleshooting
Common Issues and Solutions
InsufficientPointsError
Cause: Account balance too low (< 3 points)
Solution: Purchase more points or implement queueing for batch processing
SchemaValidationError
Cause: Extracted data doesn''t match schema (missing required fields, type mismatch)
Solution: Switch to lenient mode, simplify schema, or improve input text quality
EmptyExtractionError
Cause: No data extracted from input text
Solution: Check if input text contains expected data, improve text preprocessing (OCR quality)
TimeoutError
Cause: Extraction took longer than 30 seconds (very large documents)
Solution: Split large documents into smaller chunks, increase timeout, or use async processing
RateLimitExceededError
Cause: Too many requests per minute (default: 60 requests/min)
Solution: Implement exponential backoff, reduce request rate, or request rate limit increase
Best Practices
1. Start Simple, Iterate
Begin with basic flat schemas and add complexity as needed
2. Use Type Definitions
Always specify field types for better validation and type safety
3. Handle Missing Fields
Design schemas with optional fields for real-world messy data
4. Validate Before Use
Never use extracted data without validation—implement quality checks
5. Cache Results
Cache extraction results for repeated documents to save points and time
6. Monitor Quality
Track success rates, field population, and validation failures over time
7. Batch Process
Process documents in parallel batches for 10x performance improvement
8. Implement Retry Logic
Handle transient failures with exponential backoff retry logic
9. Preprocess Text
Clean OCR output, fix encoding issues, and remove artifacts before extraction
10. Test with Real Data
Test schemas with production-like data to catch edge cases early
Real-World Example: Resume Parser
Complete Implementation
Scenario
HR department needs to parse 500 resumes into structured candidate records
Requirements
Extract: name, email, phone, experience, education, skills
Validate: email format, phone format, required fields present
Process: 500 resumes in under 20 minutes
Quality: 95%+ success rate, flag incomplete records for review
Implementation
Schema:
const resumeSchema = {
personal: {
name: 'string',
email: 'string',
phone: 'string',
location: 'string'
},
experience: [
{
company: 'string',
role: 'string',
duration: 'string',
description: 'string'
}
],
education: [
{
institution: 'string',
degree: 'string',
field: 'string',
year: 'number'
}
],
skills: ['string']
};Implementation:
async function parseResumes(resumeTexts) {
const results = [];
const flagged = [];
// Batch process (10 concurrent)
for (let i = 0; i < resumeTexts.length; i += 10) {
const batch = resumeTexts.slice(i, i + 10);
const promises = batch.map(async (resume) => {
try {
// Extract with caching
const data = await extractWithCache(resume.text, resumeSchema);
// Validate
const validation = validateExtraction(data);
if (validation.valid) {
return { id: resume.id, data, status: 'success' };
} else {
return { id: resume.id, data, issues: validation.issues, status: 'flagged' };
}
} catch (error) {
return { id: resume.id, error: error.message, status: 'failed' };
}
});
const batchResults = await Promise.allSettled(promises);
batchResults.forEach(result => {
if (result.status === 'fulfilled') {
if (result.value.status === 'success') {
results.push(result.value);
} else {
flagged.push(result.value);
}
}
});
}
return { results, flagged };
}Results
**Processed**: 500 resumes in 18 minutes
**Success rate**: 96.4% (482 complete, 18 flagged for review)
**Cost**: 1500 points (500 resumes × 3 points) = $15
**Time saved**: 40+ hours of manual data entry
**Quality**: 98% field accuracy on validated records
Next Steps
1. Get Your API Token
Sign up at apphighway.com/dashboard to get your API token and 100 free points
2. Design Your Schema
Define the structure you want to extract using the patterns in this guide
3. Test with Sample Data
Test your schema with representative documents to validate extraction quality
4. Implement Production Patterns
Add batch processing, retry logic, validation, and caching from this guide
5. Monitor & Optimize
Track success rates, field population, and points usage to optimize costs
Transform Unstructured Data with Confidence
Structify is a powerful tool for transforming messy, unstructured text into clean, structured data. By following the schema design patterns, validation strategies, and production best practices in this guide, you can build reliable data extraction pipelines that save hours of manual work and enable new automation workflows. Start with simple schemas, iterate based on real data, and implement quality validation to ensure production-ready results.