apiDeepDive15 min read

Structify: From Chaos to Structure in Seconds

Technical deep dive into the Structify tool. Learn schema design patterns, validation strategies, advanced use cases, error handling, and production-ready implementation patterns for transforming unstructured data.

Dr. Emily Chen•April 13, 2025•Updated April 17, 2025

TL;DR

Structify transforms unstructured text into structured JSON using AI-powered extraction
Design schemas with proper field types, validation rules, and nested structures
Implement error handling for extraction failures, validation errors, and edge cases
Use validation strategies: strict mode for critical data, lenient mode for exploration
Costs 3 points per call—process 500 documents on the Starter plan (1500 points)
Production patterns: batch processing, retry logic, quality validation, and caching

What is Structify?

AI-Powered Data Extraction

Structify is one of AppHighway''s most powerful tools for transforming unstructured text into structured data. Whether you''re parsing emails, extracting information from documents, or cleaning messy datasets, Structify uses advanced AI models to understand context and extract exactly what you need.

Key Capabilities

**Schema-Based Extraction**: Define your desired output structure with JSON Schema

**Type Inference**: Automatically detect and convert field types (strings, numbers, dates, booleans)

**Nested Objects**: Extract complex hierarchical data structures

**Array Handling**: Parse lists and collections from unstructured text

**Validation**: Built-in validation ensures extracted data matches your schema

**Multi-Language**: Works with text in 100+ languages

Common Use Cases

📧 **Email Parsing**: Extract contacts, dates, and key information from customer emails

📄 **Document Processing**: Parse invoices, receipts, contracts, and forms

🏢 **CRM Enrichment**: Extract structured contact data from email signatures

📊 **Data Migration**: Clean and structure legacy data during migrations

🔍 **Web Scraping**: Transform scraped HTML/text into structured records

🤖 **Chatbot Integration**: Extract structured intent and entities from user messages

Schema Design Patterns

Build Effective Extraction Schemas

The quality of your results depends heavily on schema design. Here''s how to create schemas that extract exactly what you need.

1. Basic Schema Structure

Start with a simple flat schema for straightforward extraction

Example: Contact Extraction

{
  "name": "string",
  "email": "string",
  "phone": "string",
  "company": "string",
  "role": "string"
}

Input: Input text: ''Hi, I''m Sarah Johnson from TechCorp (sarah.j@techcorp.com, +1-555-0123). I''m the VP of Engineering.''

Output: Output: '{'name: ''Sarah Johnson'', email: ''sarah.j@techcorp.com'', phone: ''+1-555-0123'', company: ''TechCorp'', role: ''VP of Engineering'''}'

2. Field Type Definitions

Specify exact types for better validation and type safety

**string**: Text values (names, addresses, descriptions)

**number**: Numeric values (prices, quantities, IDs)

**boolean**: True/false flags (is_active, has_discount)

**date**: ISO 8601 dates (2025-04-13T10:30:00Z)

**array**: Lists of values ([items], [tags])

**object**: Nested structures ('{'address: '{'street, city'}' '}')

Example: Invoice Schema with Types

{
  "invoice_number": "string",
  "date": "date",
  "total_amount": "number",
  "currency": "string",
  "is_paid": "boolean",
  "line_items": "array",
  "vendor": {
    "name": "string",
    "address": "string",
    "tax_id": "string"
  }
}

3. Nested Object Schemas

Extract hierarchical data with nested objects

Example: Product with Nested Details

{
  "product": {
    "name": "string",
    "sku": "string",
    "price": {
      "amount": "number",
      "currency": "string",
      "tax_included": "boolean"
    },
    "availability": {
      "in_stock": "boolean",
      "quantity": "number",
      "warehouse": "string"
    },
    "specifications": {
      "dimensions": "string",
      "weight": "number",
      "color": "string"
    }
  }
}

Nested schemas keep related data organized and make downstream processing easier.

4. Array Field Patterns

Extract lists and collections from text

Simple Arrays (primitives)

{ "tags": ["string"], "prices": ["number"] }

Object Arrays (structured lists)

{
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ]
}

Perfect for invoices, shopping carts, multi-item forms, and product lists.

5. Optional vs Required Fields

Mark fields as optional when they might not appear in all documents

**Required fields**: Core data that must be present (name, email, invoice_number)

**Optional fields**: Data that may be missing (phone, middle_name, discount_code)

Example: Contact with Optional Fields

{
  "name": "string",          // Required
  "email": "string",         // Required
  "phone?": "string",        // Optional
  "company?": "string",      // Optional
  "role?": "string"          // Optional
}

Use `?` suffix or specify `required: false` in JSON Schema format.

Validation Strategies

Ensure Data Quality

Validation ensures extracted data meets your quality standards before downstream processing.

1. Strict Mode

Reject responses that don''t match the schema exactly

When: Use for critical data: financial records, legal documents, customer orders

Behavior: Returns error if any required field is missing or type mismatches occur

2. Lenient Mode

Return partial results with missing fields as null

When: Use for exploratory analysis, fuzzy matching, optional data extraction

Behavior: Returns best-effort extraction with null for missing fields

Field-Level Validation

**Email validation**: Regex pattern matching for valid email format

**Phone validation**: International format validation (E.164)

**URL validation**: Valid HTTP/HTTPS URLs with proper encoding

**Date validation**: ISO 8601 format, reasonable date ranges

**Range validation**: Numeric values within specified min/max

**Enum validation**: Value must be from predefined list

Custom Validation Rules

Implement business logic validation after extraction

Example: Invoice Amount Validation

const result = await structifyAPI.extract(text, schema);

// Custom validation
if (result.total_amount < 0) {
  throw new ValidationError('Total amount cannot be negative');
}

if (result.line_items.length === 0) {
  throw new ValidationError('Invoice must have at least one line item');
}

const calculatedTotal = result.line_items
  .reduce((sum, item) => sum + item.total, 0);

if (Math.abs(calculatedTotal - result.total_amount) > 0.01) {
  throw new ValidationError('Line items do not match total amount');
}

Error Handling Patterns

**Missing required fields**: Retry with simplified schema or flag for manual review

**Type conversion failures**: Provide default values or fallback logic

**Malformed input**: Pre-process text (OCR cleanup, encoding fixes) before extraction

**Empty results**: Check if input text actually contains the expected data

Advanced Use Cases

Real-World Implementation Patterns

1. Email Conversation Threading

Extract structured data from multi-party email threads

Challenge: Email threads contain multiple messages, quoted replies, signatures

Solution: Extract array of messages with sender, timestamp, body

{
  "subject": "string",
  "thread_id": "string",
  "messages": [
    {
      "sender": { "name": "string", "email": "string" },
      "timestamp": "date",
      "body": "string",
      "is_reply": "boolean"
    }
  ]
}

Enables sentiment analysis, response time tracking, and conversation history

2. Contract Clause Extraction

Extract specific clauses and terms from legal documents

Challenge: Contracts have complex structure, legal jargon, nested clauses

Solution: Define schema for standard clauses (payment terms, termination, liability)

{
  "parties": [{ "name": "string", "role": "string" }],
  "effective_date": "date",
  "term": { "duration": "string", "renewal": "boolean" },
  "payment_terms": { "amount": "number", "frequency": "string", "due_date": "string" },
  "termination": { "notice_period": "string", "conditions": ["string"] },
  "liability_cap": "number"
}

Automate contract review, compare terms across vendors, flag risky clauses

3. Multi-Page Form Extraction

Extract data from scanned forms (applications, surveys, registrations)

Challenge: Forms span multiple pages, handwritten entries, checkbox fields

Solution: OCR → Text cleanup → Structify with form field schema

1. OCR with Tesseract/Cloud Vision
2. Text cleaning (remove artifacts, fix encoding)
3. Structify with checkbox handling
4. Validate extracted data
5. Flag low-confidence fields for review

10x faster than manual data entry, enables bulk form processing

4. Product Catalog Migration

Migrate legacy product data from PDFs or text files to structured database

Challenge: Inconsistent formatting, missing fields, mixed units

Solution: Batch processing with schema normalization

{
  "sku": "string",
  "name": "string",
  "category": "string",
  "description": "string",
  "price": { "amount": "number", "currency": "string" },
  "specifications": { "weight": "number", "dimensions": "string", "color": "string" },
  "inventory": { "quantity": "number", "warehouse": "string" }
}

Normalize units, deduplicate SKUs, validate prices, enrich missing fields

Migrate 10,000+ products in hours instead of weeks

Production Implementation

Best Practices for Production Use

1. Batch Processing Pattern

Process multiple documents efficiently

async function batchStructify(documents, schema) {
  const results = [];
  const errors = [];
  
  // Process in parallel (batch size: 10)
  for (let i = 0; i < documents.length; i += 10) {
    const batch = documents.slice(i, i + 10);
    
    const promises = batch.map(async (doc) => {
      try {
        const result = await structifyAPI.extract(doc.text, schema);
        return { id: doc.id, data: result, status: 'success' };
      } catch (error) {
        return { id: doc.id, error: error.message, status: 'failed' };
      }
    });
    
    const batchResults = await Promise.allSettled(promises);
    
    batchResults.forEach(result => {
      if (result.status === 'fulfilled') {
        if (result.value.status === 'success') {
          results.push(result.value);
        } else {
          errors.push(result.value);
        }
      }
    });
  }
  
  return { results, errors };
}

Process 1000 documents in 15 minutes instead of 3+ hours sequentially

2. Retry Logic for Transient Failures

Handle temporary errors gracefully

async function extractWithRetry(text, schema, maxRetries = 3) {
  let lastError;
  
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await structifyAPI.extract(text, schema);
    } catch (error) {
      lastError = error;
      
      // Retry only on transient errors
      if (error.code === 'RATE_LIMIT' || error.code === 'TIMEOUT') {
        const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      
      // Don't retry validation errors
      throw error;
    }
  }
  
  throw lastError;
}

Implement exponential backoff: 2s, 4s, 8s delays between retries

3. Quality Validation Pipeline

Validate extraction quality before downstream use

**Completeness**: Check that all required fields are populated

**Consistency**: Cross-field validation (totals match line items)

**Format**: Email/phone/URL format validation

**Range**: Values within expected ranges (price > 0, quantity < 10000)

**Duplicates**: Check for duplicate extractions from same document

function validateExtraction(result, rules) {
  const issues = [];
  
  // Completeness check
  if (!result.email || !result.name) {
    issues.push('Missing required fields');
  }
  
  // Format validation
  if (result.email && !isValidEmail(result.email)) {
    issues.push('Invalid email format');
  }
  
  // Range validation
  if (result.price && result.price < 0) {
    issues.push('Price cannot be negative');
  }
  
  return { valid: issues.length === 0, issues };
}

Flag low-quality extractions for manual review instead of using bad data

4. Result Caching

Cache extraction results to save points and improve performance

Hash input text + schema → Cache key

const crypto = require('crypto');

function getCacheKey(text, schema) {
  const hash = crypto.createHash('sha256');
  hash.update(text + JSON.stringify(schema));
  return hash.digest('hex');
}

async function extractWithCache(text, schema) {
  const cacheKey = getCacheKey(text, schema);
  
  // Check cache
  const cached = await cache.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  // Extract
  const result = await structifyAPI.extract(text, schema);
  
  // Cache for 24 hours
  await cache.set(cacheKey, JSON.stringify(result), 86400);
  
  return result;
}

Save 70% of points on repeated extractions, 10x faster response times

5. Monitoring & Observability

Track extraction quality and performance

**Success rate**: % of extractions that succeed vs fail

**Extraction time**: p50, p95, p99 latencies

**Field population**: % of extractions with all required fields

**Validation failures**: Track common validation errors

**Points usage**: Monitor daily/weekly consumption

Alert on: success rate < 95%, extraction time > 10s, daily points > budget

Error Handling & Troubleshooting

Common Issues and Solutions

InsufficientPointsError

Cause: Account balance too low (< 3 points)

Solution: Purchase more points or implement queueing for batch processing

SchemaValidationError

Cause: Extracted data doesn''t match schema (missing required fields, type mismatch)

Solution: Switch to lenient mode, simplify schema, or improve input text quality

EmptyExtractionError

Cause: No data extracted from input text

Solution: Check if input text contains expected data, improve text preprocessing (OCR quality)

TimeoutError

Cause: Extraction took longer than 30 seconds (very large documents)

Solution: Split large documents into smaller chunks, increase timeout, or use async processing

RateLimitExceededError

Cause: Too many requests per minute (default: 60 requests/min)

Solution: Implement exponential backoff, reduce request rate, or request rate limit increase

Best Practices

1. Start Simple, Iterate

Begin with basic flat schemas and add complexity as needed

2. Use Type Definitions

Always specify field types for better validation and type safety

3. Handle Missing Fields

Design schemas with optional fields for real-world messy data

4. Validate Before Use

Never use extracted data without validation—implement quality checks

5. Cache Results

Cache extraction results for repeated documents to save points and time

6. Monitor Quality

Track success rates, field population, and validation failures over time

7. Batch Process

Process documents in parallel batches for 10x performance improvement

8. Implement Retry Logic

Handle transient failures with exponential backoff retry logic

9. Preprocess Text

Clean OCR output, fix encoding issues, and remove artifacts before extraction

10. Test with Real Data

Test schemas with production-like data to catch edge cases early

Real-World Example: Resume Parser

Complete Implementation

Scenario

HR department needs to parse 500 resumes into structured candidate records

Requirements

Extract: name, email, phone, experience, education, skills

Validate: email format, phone format, required fields present

Process: 500 resumes in under 20 minutes

Quality: 95%+ success rate, flag incomplete records for review

Implementation

Schema:

const resumeSchema = {
  personal: {
    name: 'string',
    email: 'string',
    phone: 'string',
    location: 'string'
  },
  experience: [
    {
      company: 'string',
      role: 'string',
      duration: 'string',
      description: 'string'
    }
  ],
  education: [
    {
      institution: 'string',
      degree: 'string',
      field: 'string',
      year: 'number'
    }
  ],
  skills: ['string']
};

Implementation:

async function parseResumes(resumeTexts) {
  const results = [];
  const flagged = [];
  
  // Batch process (10 concurrent)
  for (let i = 0; i < resumeTexts.length; i += 10) {
    const batch = resumeTexts.slice(i, i + 10);
    
    const promises = batch.map(async (resume) => {
      try {
        // Extract with caching
        const data = await extractWithCache(resume.text, resumeSchema);
        
        // Validate
        const validation = validateExtraction(data);
        
        if (validation.valid) {
          return { id: resume.id, data, status: 'success' };
        } else {
          return { id: resume.id, data, issues: validation.issues, status: 'flagged' };
        }
      } catch (error) {
        return { id: resume.id, error: error.message, status: 'failed' };
      }
    });
    
    const batchResults = await Promise.allSettled(promises);
    
    batchResults.forEach(result => {
      if (result.status === 'fulfilled') {
        if (result.value.status === 'success') {
          results.push(result.value);
        } else {
          flagged.push(result.value);
        }
      }
    });
  }
  
  return { results, flagged };
}

Results

**Processed**: 500 resumes in 18 minutes

**Success rate**: 96.4% (482 complete, 18 flagged for review)

**Cost**: 1500 points (500 resumes × 3 points) = $15

**Time saved**: 40+ hours of manual data entry

**Quality**: 98% field accuracy on validated records

Next Steps

1. Get Your API Token

2. Design Your Schema

Define the structure you want to extract using the patterns in this guide

3. Test with Sample Data

Test your schema with representative documents to validate extraction quality

4. Implement Production Patterns

Add batch processing, retry logic, validation, and caching from this guide

5. Monitor & Optimize

Track success rates, field population, and points usage to optimize costs

Transform Unstructured Data with Confidence

Structify is a powerful tool for transforming messy, unstructured text into clean, structured data. By following the schema design patterns, validation strategies, and production best practices in this guide, you can build reliable data extraction pipelines that save hours of manual work and enable new automation workflows. Start with simple schemas, iterate based on real data, and implement quality validation to ensure production-ready results.