Automating invoice processing can significantly streamline your accounts-payable workflow, reduce manual effort, and minimize data-entry errors. In this tutorial, you’ll build a PHP application that extracts structured invoice data—vendor, totals, and line items—in near real time with AWS Textract.

Why automate invoice data extraction?

Processing invoices by hand is repetitive and error-prone. Optical Character Recognition (OCR) services such as AWS Textract automatically read documents and return machine-readable data. When you combine Textract’s purpose-built AnalyzeExpense feature with PHP and a queue-based callback architecture (SNS → SQS), you get a scalable pipeline that:

  • Cuts manual data entry by up to 90 percent.
  • Flags inconsistent totals before they reach your accounting system.
  • Scales from one to thousands of invoices a day with minimal code changes.

Set up AWS textract and the PHP SDK v3

  1. Prerequisites

    • AWS account with access to Textract, S3, SNS, and SQS.
    • PHP 8.1+ with the following extensions enabled:
      • SimpleXML (required by the SDK)
      • openssl (cryptographic operations)
      • curl >= 7.16.2 (network requests)
    • Composer 2.x.
  2. Install the SDK

composer require aws/aws-sdk-php
  1. Bootstrap the Textract client
use Aws\Textract\TextractClient;

$textractClient = new TextractClient([
    'region'  => 'us-east-1',   // Adjust to your region
    'version' => 'latest',
    // Credentials resolution order: env → shared credentials → IAM role.
]);

For production workloads, prefer IAM roles (EC2 instance profiles, ECS task roles, or Lambda execution roles) over long-lived access keys.

Understand the AnalyzeExpense API

AnalyzeExpense is optimized for invoices and receipts. Highlights:

  • Returns arrays of SummaryFields (high-level data such as VendorName, InvoiceTotal) and LineItemGroups (individual rows).
  • Synchronous call supports images/PDFs ≤ 10 MiB.
  • For larger files—multi-page PDFs up to 500 MiB—use the asynchronous StartExpenseAnalysisGetExpenseAnalysis flow.

Full API reference: https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeExpense.html

Build a synchronous invoice-upload endpoint

Suitable for small invoices when you need instant feedback.

<?php
require __DIR__ . '/vendor/autoload.php';

use Aws\Textract\TextractClient;
use Aws\Exception\AwsException;

$textract = new TextractClient([
    'region'  => 'us-east-1',
    'version' => 'latest',
]);

if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_FILES['invoice'])) {
    $tmpPath = $_FILES['invoice']['tmp_name'];
    $size    = $_FILES['invoice']['size'];

    if ($size > 10 * 1024 * 1024) { // 10 MiB
        exit('File exceeds sync limit—switching to async flow.');
    }

    try {
        $result = $textract->analyzeExpense([
            'Document' => ['Bytes' => file_get_contents($tmpPath)],
        ]);

        header('Content-Type: application/json');
        echo json_encode(parseExpense($result), JSON_PRETTY_PRINT);
    } catch (AwsException $e) {
        http_response_code(500);
        echo 'Textract error: ' . $e->getAwsErrorMessage();
    }
}

parseExpense is a small helper that converts Textract’s nested array into a compact structure—see Extract and structure invoice data.

Implement asynchronous processing with sns + sqs

High-volume or large PDFs benefit from an async flow so end-users are not blocked while Textract works in the background.

1 · upload the document to S3

$s3Key = sprintf('uploads/%s.pdf', uniqid());
$s3->putObject([
    'Bucket' => $bucket,
    'Key'    => $s3Key,
    'Body'   => fopen($localPath, 'rb'),
]);

2 · kick off StartExpenseAnalysis

$start = $textract->startExpenseAnalysis([
    'DocumentLocation' => [
        'S3Object' => ['Bucket' => $bucket, 'Name' => $s3Key],
    ],
    'NotificationChannel' => [
        'SNSTopicArn' => $snsTopicArn,
        'RoleArn'     => $roleArn,        // Textract → SNS publish permissions
    ],
    'JobTag' => 'invoice-' . uniqid(),
]);

$jobId = $start->get('JobId');

Store $jobId and the S3 object key in a database table invoice_jobs with status PENDING.

3 · configure the Notification channel

  1. Create an SNS topic textract-complete.
  2. Create an SQS queue textract-events.
  3. Subscribe the queue to the topic with RawMessageDelivery = true.
  4. Attach an IAM policy so Textract can publish to the topic.

4 · run a long-poll worker

use Aws\Sqs\SqsClient;

$sqs = new SqsClient([
    'region'  => 'us-east-1',
    'version' => 'latest',
]);

while (true) {
    $resp = $sqs->receiveMessage([
        'QueueUrl'            => $queueUrl,
        'MaxNumberOfMessages' => 10,
        'WaitTimeSeconds'     => 20,
    ]);

    foreach ($resp['Messages'] ?? [] as $msg) {
        $payload = json_decode($msg['Body'], true);
        $sns     = json_decode($payload['Message'], true);

        handleTextractEvent($sns['JobId'], $sns['Status']);

        $sqs->deleteMessage([
            'QueueUrl'      => $queueUrl,
            'ReceiptHandle' => $msg['ReceiptHandle'],
        ]);
    }
}

When Status === 'SUCCEEDED', call getExpenseAnalysis(JobId) and persist the parsed output. If it failed, move the record to a dead-letter table for manual review.

Extract and structure invoice data

AnalyzeExpense returns deeply nested arrays; flatten them for easier consumption.

function parseExpense(array $result): array
{
    $docs = $result['ExpenseDocuments'] ?? [];

    return array_map(function ($doc) {
        $summary = [];
        foreach ($doc['SummaryFields'] as $field) {
            $key   = $field['Type']['Text'] ?? 'Unknown';
            $value = $field['ValueDetection']['Text'] ?? '';
            $summary[$key] = $value;
        }

        $items = [];
        foreach ($doc['LineItemGroups'] as $group) {
            foreach ($group['LineItems'] as $li) {
                $row = [];
                foreach ($li['LineItemExpenseFields'] as $exp) {
                    $k = $exp['Type']['Text'];
                    $v = $exp['ValueDetection']['Text'] ?? '';
                    $row[$k] = $v;
                }
                $items[] = $row;
            }
        }

        return [
            'summary'    => $summary,
            'line_items' => $items,
        ];
    }, $docs);
}

Sample response snippet

[
  {
    "summary": {
      "INVOICE_RECEIPT_ID": "INV-4912",
      "VENDOR_NAME": "Acme Supplies Inc.",
      "INVOICE_TOTAL": "1,280.55",
      "DUE_DATE": "2025-07-01"
    },
    "line_items": [
      { "DESCRIPTION": "Printer paper", "QUANTITY": "10", "PRICE": "5.00", "ITEM_TOTAL": "50.00" },
      { "DESCRIPTION": "Laser toner", "QUANTITY": "3", "PRICE": "70.00", "ITEM_TOTAL": "210.00" }
    ]
  }
]

Build a lightweight dashboard

With the parsed data stored in a relational table (invoices, invoice_items), you can:

  1. List recent jobs with status chips (Pending, Processing, Failed, Done).
  2. Click into an invoice to view summary fields and a <table> of line items.
  3. Offer CSV/JSON export or a “Send to ERP” button.

For a quick start, use a micro-framework like Slim 4 or Laravel 10 with Inertia + Vue. Keep the UI minimal—your goal is to validate the pipeline, not perfect the design.

Error handling and retry strategies

Layer Strategy
AWS SDK Built-in exponential backoff for retryable 5xx/429 errors.
SQS worker Delete message only after successful processing; otherwise it becomes visible again.
SQS DLQ Configure maxReceiveCount (e.g., 5). Messages that exceed it are shunted to a dead-letter queue.
Database writes Use unique constraints (job_id) so re-processing is idempotent.
Application logs Aggregate with CloudWatch Logs or an ELK stack for searchability.

Performance optimization and cost considerations

  • Choose sync vs. async wisely: Each synchronous call blocks a PHP-FPM worker, so prefer async for anything beyond a handful of pages.
  • Batch traffic: Group SQS polls (MaxNumberOfMessages = 10, long-poll 20 seconds) to cut down on empty receives by ≈ 90 percent.
  • Lifecycle policies: Move S3 invoices to Glacier Deep Archive after 30 days if compliance allows.
  • Concurrency quotas: Textract caps concurrent asynchronous jobs (default 10). If you hit LimitExceededException, open an AWS support ticket.

Security best practices

  • Grant the Textract publishing role only sns:Publish on the specific topic.
  • Enable server-side encryption (SSE-KMS) on the S3 bucket and the SQS queue.
  • Validate MIME type (application/pdf, image/jpeg, image/png) and enforce a size limit before uploading to S3.
  • Sanitize filenames—avoid path traversal by generating UUID-based keys instead of user-supplied names.
  • Return generic error messages to clients; log full stack traces internally.

Troubleshooting cheat-sheet

Symptom Likely Cause / Fix
AccessDeniedException on StartExpenseAnalysis The role in NotificationChannel lacks sns:Publish. Attach an inline policy.
Worker never receives messages SNS topic not subscribed to SQS or RawMessageDelivery disabled.
JSON fields come back empty Low-resolution scan (< 150 DPI) or skewed pages. Re-scan or deskew with ImageMagick.
LimitExceededException Too many parallel jobs. Back-off and retry with jitter.

Handle multi-page invoices and batching

Textract paginates asynchronous results—each GetExpenseAnalysis call returns up to 1,000 blocks. Use the NextToken field to iterate until exhausted:

$pages = [];
$token = null;

do {
    $resp = $textract->getExpenseAnalysis([
        'JobId'     => $jobId,
        'NextToken' => $token,
    ]);
    $pages[] = $resp;
    $token = $resp['NextToken'] ?? null;
} while ($token !== null);

To batch-process hundreds of invoices, enqueue each S3 object key and spin up multiple worker containers. Autoscale with Kubernetes horizontal pod autoscalers based on SQS-queue depth.

References

Next steps

You can now read invoices automatically and surface reliable, structured data. The obvious follow-up is sending that data downstream—ERP import, approval workflows, or real-time anomaly detection with Lambda + Amazon EventBridge. If you need to preprocess or convert incoming files, check out our Document Processing Service that chains OCR, thumbnail creation, and format conversion in a single Robot pipeline.