Real-time invoice data extraction with PHP and AWS textract

Automating invoice processing can significantly streamline your accounts-payable workflow, reduce manual effort, and minimize data-entry errors. In this tutorial, you’ll build a PHP application that extracts structured invoice data—vendor, totals, and line items—in near real time with AWS Textract.
Why automate invoice data extraction?
Processing invoices by hand is repetitive and error-prone. Optical Character Recognition (OCR)
services such as AWS Textract automatically read documents and return machine-readable data. When
you combine Textract’s purpose-built AnalyzeExpense
feature with PHP and a queue-based callback
architecture (SNS → SQS), you get a scalable pipeline that:
- Cuts manual data entry by up to 90 percent.
- Flags inconsistent totals before they reach your accounting system.
- Scales from one to thousands of invoices a day with minimal code changes.
Set up AWS textract and the PHP SDK v3
-
Prerequisites
- AWS account with access to Textract, S3, SNS, and SQS.
- PHP 8.1+ with the following extensions enabled:
SimpleXML
(required by the SDK)openssl
(cryptographic operations)curl >= 7.16.2
(network requests)
- Composer 2.x.
-
Install the SDK
composer require aws/aws-sdk-php
- Bootstrap the Textract client
use Aws\Textract\TextractClient;
$textractClient = new TextractClient([
'region' => 'us-east-1', // Adjust to your region
'version' => 'latest',
// Credentials resolution order: env → shared credentials → IAM role.
]);
For production workloads, prefer IAM roles (EC2 instance profiles, ECS task roles, or Lambda execution roles) over long-lived access keys.
Understand the AnalyzeExpense
API
AnalyzeExpense
is optimized for invoices and receipts. Highlights:
- Returns arrays of
SummaryFields
(high-level data such as VendorName, InvoiceTotal) andLineItemGroups
(individual rows). - Synchronous call supports images/PDFs ≤ 10 MiB.
- For larger files—multi-page PDFs up to 500 MiB—use the asynchronous
StartExpenseAnalysis
→GetExpenseAnalysis
flow.
Full API reference: https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeExpense.html
Build a synchronous invoice-upload endpoint
Suitable for small invoices when you need instant feedback.
<?php
require __DIR__ . '/vendor/autoload.php';
use Aws\Textract\TextractClient;
use Aws\Exception\AwsException;
$textract = new TextractClient([
'region' => 'us-east-1',
'version' => 'latest',
]);
if ($_SERVER['REQUEST_METHOD'] === 'POST' && isset($_FILES['invoice'])) {
$tmpPath = $_FILES['invoice']['tmp_name'];
$size = $_FILES['invoice']['size'];
if ($size > 10 * 1024 * 1024) { // 10 MiB
exit('File exceeds sync limit—switching to async flow.');
}
try {
$result = $textract->analyzeExpense([
'Document' => ['Bytes' => file_get_contents($tmpPath)],
]);
header('Content-Type: application/json');
echo json_encode(parseExpense($result), JSON_PRETTY_PRINT);
} catch (AwsException $e) {
http_response_code(500);
echo 'Textract error: ' . $e->getAwsErrorMessage();
}
}
parseExpense
is a small helper that converts Textract’s nested array into a compact structure—see
Extract and structure invoice data.
Implement asynchronous processing with sns + sqs
High-volume or large PDFs benefit from an async flow so end-users are not blocked while Textract works in the background.
1 · upload the document to S3
$s3Key = sprintf('uploads/%s.pdf', uniqid());
$s3->putObject([
'Bucket' => $bucket,
'Key' => $s3Key,
'Body' => fopen($localPath, 'rb'),
]);
2 · kick off StartExpenseAnalysis
$start = $textract->startExpenseAnalysis([
'DocumentLocation' => [
'S3Object' => ['Bucket' => $bucket, 'Name' => $s3Key],
],
'NotificationChannel' => [
'SNSTopicArn' => $snsTopicArn,
'RoleArn' => $roleArn, // Textract → SNS publish permissions
],
'JobTag' => 'invoice-' . uniqid(),
]);
$jobId = $start->get('JobId');
Store $jobId
and the S3 object key in a database table invoice_jobs
with status PENDING
.
3 · configure the Notification channel
- Create an SNS topic
textract-complete
. - Create an SQS queue
textract-events
. - Subscribe the queue to the topic with
RawMessageDelivery = true
. - Attach an IAM policy so Textract can publish to the topic.
4 · run a long-poll worker
use Aws\Sqs\SqsClient;
$sqs = new SqsClient([
'region' => 'us-east-1',
'version' => 'latest',
]);
while (true) {
$resp = $sqs->receiveMessage([
'QueueUrl' => $queueUrl,
'MaxNumberOfMessages' => 10,
'WaitTimeSeconds' => 20,
]);
foreach ($resp['Messages'] ?? [] as $msg) {
$payload = json_decode($msg['Body'], true);
$sns = json_decode($payload['Message'], true);
handleTextractEvent($sns['JobId'], $sns['Status']);
$sqs->deleteMessage([
'QueueUrl' => $queueUrl,
'ReceiptHandle' => $msg['ReceiptHandle'],
]);
}
}
When Status === 'SUCCEEDED'
, call getExpenseAnalysis(JobId)
and persist the parsed output. If it
failed, move the record to a dead-letter table for manual review.
Extract and structure invoice data
AnalyzeExpense
returns deeply nested arrays; flatten them for easier consumption.
function parseExpense(array $result): array
{
$docs = $result['ExpenseDocuments'] ?? [];
return array_map(function ($doc) {
$summary = [];
foreach ($doc['SummaryFields'] as $field) {
$key = $field['Type']['Text'] ?? 'Unknown';
$value = $field['ValueDetection']['Text'] ?? '';
$summary[$key] = $value;
}
$items = [];
foreach ($doc['LineItemGroups'] as $group) {
foreach ($group['LineItems'] as $li) {
$row = [];
foreach ($li['LineItemExpenseFields'] as $exp) {
$k = $exp['Type']['Text'];
$v = $exp['ValueDetection']['Text'] ?? '';
$row[$k] = $v;
}
$items[] = $row;
}
}
return [
'summary' => $summary,
'line_items' => $items,
];
}, $docs);
}
Sample response snippet
[
{
"summary": {
"INVOICE_RECEIPT_ID": "INV-4912",
"VENDOR_NAME": "Acme Supplies Inc.",
"INVOICE_TOTAL": "1,280.55",
"DUE_DATE": "2025-07-01"
},
"line_items": [
{ "DESCRIPTION": "Printer paper", "QUANTITY": "10", "PRICE": "5.00", "ITEM_TOTAL": "50.00" },
{ "DESCRIPTION": "Laser toner", "QUANTITY": "3", "PRICE": "70.00", "ITEM_TOTAL": "210.00" }
]
}
]
Build a lightweight dashboard
With the parsed data stored in a relational table (invoices
, invoice_items
), you can:
- List recent jobs with status chips (Pending, Processing, Failed, Done).
- Click into an invoice to view summary fields and a
<table>
of line items. - Offer CSV/JSON export or a “Send to ERP” button.
For a quick start, use a micro-framework like Slim 4 or Laravel 10 with Inertia + Vue. Keep the UI minimal—your goal is to validate the pipeline, not perfect the design.
Error handling and retry strategies
Layer | Strategy |
---|---|
AWS SDK | Built-in exponential backoff for retryable 5xx/429 errors. |
SQS worker | Delete message only after successful processing; otherwise it becomes visible again. |
SQS DLQ | Configure maxReceiveCount (e.g., 5). Messages that exceed it are shunted to a dead-letter queue. |
Database writes | Use unique constraints (job_id ) so re-processing is idempotent. |
Application logs | Aggregate with CloudWatch Logs or an ELK stack for searchability. |
Performance optimization and cost considerations
- Choose sync vs. async wisely: Each synchronous call blocks a PHP-FPM worker, so prefer async for anything beyond a handful of pages.
- Batch traffic: Group SQS polls (
MaxNumberOfMessages = 10
, long-poll 20 seconds) to cut down on empty receives by ≈ 90 percent. - Lifecycle policies: Move S3 invoices to Glacier Deep Archive after 30 days if compliance allows.
- Concurrency quotas: Textract caps concurrent asynchronous jobs (default 10). If you hit
LimitExceededException
, open an AWS support ticket.
Security best practices
- Grant the Textract publishing role only
sns:Publish
on the specific topic. - Enable server-side encryption (
SSE-KMS
) on the S3 bucket and the SQS queue. - Validate MIME type (
application/pdf
,image/jpeg
,image/png
) and enforce a size limit before uploading to S3. - Sanitize filenames—avoid path traversal by generating UUID-based keys instead of user-supplied names.
- Return generic error messages to clients; log full stack traces internally.
Troubleshooting cheat-sheet
Symptom | Likely Cause / Fix |
---|---|
AccessDeniedException on StartExpenseAnalysis |
The role in NotificationChannel lacks sns:Publish . Attach an inline policy. |
Worker never receives messages | SNS topic not subscribed to SQS or RawMessageDelivery disabled. |
JSON fields come back empty | Low-resolution scan (< 150 DPI) or skewed pages. Re-scan or deskew with ImageMagick. |
LimitExceededException |
Too many parallel jobs. Back-off and retry with jitter. |
Handle multi-page invoices and batching
Textract paginates asynchronous results—each GetExpenseAnalysis
call returns up to 1,000 blocks.
Use the NextToken
field to iterate until exhausted:
$pages = [];
$token = null;
do {
$resp = $textract->getExpenseAnalysis([
'JobId' => $jobId,
'NextToken' => $token,
]);
$pages[] = $resp;
$token = $resp['NextToken'] ?? null;
} while ($token !== null);
To batch-process hundreds of invoices, enqueue each S3 object key and spin up multiple worker containers. Autoscale with Kubernetes horizontal pod autoscalers based on SQS-queue depth.
References
- AWS Textract Developer Guide — https://docs.aws.amazon.com/textract/latest/dg/
- AWS SDK for PHP v3 Source — https://github.com/aws/aws-sdk-php
- PHP-SDK Retry Configuration — https://docs.aws.amazon.com/sdk-for-php/v3/developer-guide/retries.html
Next steps
You can now read invoices automatically and surface reliable, structured data. The obvious follow-up is sending that data downstream—ERP import, approval workflows, or real-time anomaly detection with Lambda + Amazon EventBridge. If you need to preprocess or convert incoming files, check out our Document Processing Service that chains OCR, thumbnail creation, and format conversion in a single Robot pipeline.