Stream Tar archives in PHP without memory limits

Creating large tar archives in PHP can be tricky once your data set grows beyond a few hundred
megabytes. While PHP's PharData
class is stream-oriented for many operations, the overall process
still operates within PHP’s configured memory_limit
. When archiving a vast number of files or very
large individual files, you might encounter the "Allowed memory size exhausted" error, especially
with default limits like 128M
. Fortunately, we can call the system-level tar
utility with
proc_open()
and let the operating system handle the heavy lifting. The result is a constant-memory
stream that you can pipe straight to the browser, object storage, or another process, enabling
efficient PHP tar streaming.
Understand PHP memory limits with phardata
While PharData
is efficient for many operations, it may face memory constraints with very large
archives or when performing bulk operations. The main limitation comes from PHP's memory_limit
setting rather than PharData
itself. For instance, buffering, metadata handling, and opcode memory
can collectively contribute to exceeding this limit when dealing with thousands of files or
exceptionally large ones.
If you only need a write-once, read-never archive, there is often no significant benefit in keeping
the entire operation within the PHP process. Offloading the work to the system's tar
command frees
you from PHP’s memory limitations and provides streaming capabilities inherently. This makes it an
excellent solution for memory efficient tar creation.
Stream Tar output with proc_open()
proc_open()
starts an external command and exposes its standard input, output, and error streams
as PHP resources. The following helper function reads from the tar
command's stdout
and pushes
the bytes directly to the client, ensuring memory usage remains constant, which is ideal for a PHP
tar download.
<?php
function streamTarArchive(string $directory, string $downloadName = 'archive.tar'): void
{
// Sanitize the download name to prevent header injection or other issues
$safeDownloadName = basename(str_replace(['/', '\'], '', $downloadName));
if (empty($safeDownloadName)) {
$safeDownloadName = 'archive.tar';
}
$cmd = 'tar -cf - ' . escapeshellarg($directory);
$descriptorSpec = [
0 => ["pipe", "r"], // stdin
1 => ["pipe", "w"], // stdout
2 => ["pipe", "w"] // stderr
];
$pipes = [];
$process = null;
try {
$process = proc_open($cmd, $descriptorSpec, $pipes);
if (!is_resource($process)) {
throw new RuntimeException('Failed to create tar process. Is tar installed and in PATH?');
}
header('Content-Type: application/x-tar');
header('Content-Disposition: attachment; filename="' . $safeDownloadName . '"');
header('X-Content-Type-Options: nosniff'); // Security: prevent MIME-sniffing
// Disable output buffering if active
if (ob_get_level()) {
ob_end_clean();
}
// Stream the tar output
while (!feof($pipes[1])) {
$chunk = fread($pipes[1], 8192); // Read in 8KB chunks
if ($chunk === false || $chunk === '') {
// Check for read errors or empty chunks that are not EOF
if (feof($pipes[1])) break; // Legitimate EOF
// Potentially log an error here if fread returns false unexpectedly
continue;
}
echo $chunk;
flush(); // Flush output to the client
}
// Check for errors from tar
$stderrOutput = stream_get_contents($pipes[2]);
if (!empty($stderrOutput)) {
// Log tar errors but don't send them to the client if headers are already sent
error_log('tar process stderr: ' . $stderrOutput);
}
} catch (Throwable $e) {
// Log the exception
error_log('Error during tar streaming: ' . $e->getMessage());
// If headers haven't been sent, we can send an error response
if (!headers_sent()) {
// Potentially clear any headers already set if appropriate
header_remove();
http_response_code(500);
echo 'An error occurred while generating the archive.';
}
// Otherwise, the stream might be incomplete, which is hard to recover from at this point.
} finally {
// Ensure all pipes are closed
if (isset($pipes[0]) && is_resource($pipes[0])) fclose($pipes[0]);
if (isset($pipes[1]) && is_resource($pipes[1])) fclose($pipes[1]);
if (isset($pipes[2]) && is_resource($pipes[2])) fclose($pipes[2]);
// Close the process
if (is_resource($process)) {
proc_close($process);
}
}
}
This function returns a fully valid tar stream without ever touching the drive for intermediate storage. On a modern VPS, this approach happily pushes multi-gigabyte archives while PHP memory usage stays remarkably low, often below 10 MB.
Archive ci/cd logs in real time
Continuous integration (CI/CD) servers often generate numerous small log files—perfect candidates
for real-time archiving PHP streaming. The snippet below validates input, helps prevent directory
traversal, and then calls the streamTarArchive
helper. This is a common use case for PHP CI/CD
logs.
<?php
function downloadCiLogs(string $logDirIdentifier): void
{
// Example: map an identifier to an actual path
// In a real app, this might come from a config or database
$logPathMappings = [
'project-alpha-build-123' => '/var/logs/ci-cd/project-alpha/build-123',
'project-beta-deploy-45' => '/var/logs/ci-cd/project-beta/deploy-45',
];
if (!isset($logPathMappings[$logDirIdentifier])) {
throw new InvalidArgumentException('Invalid log directory identifier.');
}
$actualLogDir = $logPathMappings[$logDirIdentifier];
// Use realpath to resolve symbolic links and '..'
$realDir = realpath($actualLogDir);
$allowedRoot = realpath('/var/logs/ci-cd'); // Ensure this base path is secure
if ($realDir === false || !is_dir($realDir)) {
throw new InvalidArgumentException('Directory does not exist or is not accessible.');
}
// Check if the resolved path starts with the allowed root directory
if ($allowedRoot === false || strpos($realDir, $allowedRoot) !== 0) {
throw new RuntimeException('Access denied to the specified directory.');
}
streamTarArchive($realDir, 'ci-logs-' . $logDirIdentifier . '-' . date('Y-m-d') . '.tar');
}
// Example usage in a controller action:
try {
// Ensure $_GET['path'] is validated/sanitized before use if it's a direct path.
// Better to use an identifier as shown in downloadCiLogs.
$logIdentifier = $_GET['log_id'] ?? '';
if (empty($logIdentifier) || !preg_match('/^[a-zA-Z0-9_-]+$/', $logIdentifier)) {
throw new InvalidArgumentException('Invalid log identifier format.');
}
downloadCiLogs($logIdentifier);
} catch (InvalidArgumentException $e) {
http_response_code(400); // Bad Request
echo 'Error: ' . htmlspecialchars($e->getMessage(), ENT_QUOTES, 'UTF-8');
} catch (RuntimeException $e) {
http_response_code(403); // Forbidden or 500 Internal Server Error depending on context
echo 'Error: ' . htmlspecialchars($e->getMessage(), ENT_QUOTES, 'UTF-8');
} catch (Throwable $e) {
http_response_code(500); // Internal Server Error
error_log('Unhandled error in CI log download: ' . $e->getMessage()); // Log for admin
echo 'An unexpected error occurred. Please try again later.';
}
Report progress with server-sent events
When archiving thousands of files, providing progress feedback can greatly improve user experience.
Since tar -v
(verbose mode) prints each filename to stderr
as it's processed, we can stream
these lines as Server-Sent Events (SSE).
<?php
function sseTarProgress(string $directory): void
{
// Validate and secure the directory path as in previous examples
$realDir = realpath($directory);
$allowedRoot = realpath('/var/logs/ci-cd'); // Example allowed root
if ($realDir === false || !is_dir($realDir)) {
throw new InvalidArgumentException('Directory does not exist or is not accessible.');
}
if ($allowedRoot === false || strpos($realDir, $allowedRoot) !== 0) {
throw new RuntimeException('Access denied to the specified directory for progress reporting.');
}
header('Content-Type: text/event-stream');
header('Cache-Control: no-cache');
header('Connection: keep-alive'); // Important for SSE
// Use -v for verbose output (filenames to stderr)
// Redirect tar's stdout to /dev/null as we only care about stderr for progress
// and don't want the actual tar archive data blocking the stderr pipe.
$cmd = 'tar -cvf - ' . escapeshellarg($realDir) . ' > /dev/null';
$descriptorSpec = [
0 => ["pipe", "r"], // stdin - not used by tar -c
1 => ["pipe", "w"], // stdout - redirected to /dev/null in $cmd
2 => ["pipe", "w"] // stderr - where tar -v outputs filenames
];
$pipes = [];
$process = null;
try {
$process = proc_open($cmd, $descriptorSpec, $pipes);
if (!is_resource($process)) {
throw new RuntimeException('tar process failed to start for progress reporting.');
}
// Disable output buffering for SSE
if (ob_get_level()) {
ob_end_clean();
}
// Ensure PHP doesn't close the connection prematurely
ignore_user_abort(true);
set_time_limit(0); // Allow script to run indefinitely for long tar processes
// Read from stderr for progress
while (is_resource($pipes[2]) && !feof($pipes[2])) {
$line = fgets($pipes[2]); // Read verbose output line by line
if ($line === false || trim($line) === '') {
// Check for keep-alive or if process ended
if (!is_resource($process) || proc_get_status($process)['running'] === false) {
break;
}
// Send a comment to keep the connection alive if no output from tar
echo ": keepalive
";
flush();
usleep(250000); // Sleep for 250ms
continue;
}
echo 'data: ' . json_encode(['file' => trim($line)]) . "
";
flush(); // Flush data to the client
}
$stderrRemaining = stream_get_contents($pipes[2]); // Get any remaining stderr
if (!empty(trim($stderrRemaining))) {
// Could be an error message if tar failed after last file
error_log("sseTarProgress remaining stderr: " . trim($stderrRemaining));
}
} catch (Throwable $e) {
error_log('Error in sseTarProgress: ' . $e->getMessage());
echo "event: error
";
echo 'data: ' . json_encode(['message' => 'An error occurred during progress reporting.']) . "
";
flush();
} finally {
if (isset($pipes[0]) && is_resource($pipes[0])) fclose($pipes[0]);
if (isset($pipes[1]) && is_resource($pipes[1])) fclose($pipes[1]);
if (isset($pipes[2]) && is_resource($pipes[2])) fclose($pipes[2]);
if (is_resource($process)) {
proc_close($process);
}
// Final event to signal completion or error to the client
echo "event: complete
";
echo 'data: {"done":true}
';
flush();
}
}
The client consumes this stream via the JavaScript EventSource
API and can display a live list of
processed files or update a progress bar.
Harden proc_open()
calls
Calling shell utilities from PHP always invites command-injection bugs if not handled carefully. Keep these rules in mind:
- Validate and Sanitize Inputs: Always validate any user-provided data. For file paths, resolve
them using
realpath()
to get the canonicalized absolute pathname and to check existence. - Allowlist Directories: Maintain a strict allowlist of directories from which operations are permitted. Reject any paths not conforming to this list.
- Escape Shell Arguments: Crucially, escape every argument passed to shell commands using
escapeshellarg()
orescapeshellcmd()
as appropriate.escapeshellarg()
is generally preferred for individual arguments. Never concatenate raw input directly into a shell command string. - Principle of Least Privilege: Run the PHP (and web server) process with the minimum necessary
privileges. Avoid running as
root
. The user should only have read access to the directories being archived and execute permission for thetar
utility. - Monitor
stderr
and Exit Codes: Always checkstderr
for error messages from the external command and inspect its exit code (viaproc_close()
orproc_get_status()
) to detect failures early. Log these errors for monitoring and debugging. - Error Handling: Implement robust error handling around
proc_open()
calls to manage scenarios like the command not being found, failing to start, or exiting with an error.
The SecureTarStreamer
class example demonstrates a good approach by encapsulating path validation
and allowlist checks:
<?php
class SecureTarStreamer
{
private array $allowedRoots;
public function __construct(array $allowedRoots)
{
// Ensure allowedRoots are absolute and valid paths during construction
$this->allowedRoots = array_map(function($root) {
$realRoot = realpath($root);
if ($realRoot === false || !is_dir($realRoot)) {
throw new InvalidArgumentException("Invalid allowed root directory: {$root}");
}
return $realRoot;
}, $allowedRoots);
}
public function send(string $userSuppliedDir, string $downloadName = 'archive.tar'): void
{
$path = realpath($userSuppliedDir); // Resolve the user-supplied path
if ($path === false || !is_dir($path)) {
throw new InvalidArgumentException('Invalid or non-existent directory specified.');
}
$isAllowed = false;
foreach ($this->allowedRoots as $root) {
// Check if the resolved path starts with one of the allowed root directories
if (str_starts_with($path, $root)) {
$isAllowed = true;
break;
}
}
if (!$isAllowed) {
throw new RuntimeException('Access to the specified directory is not allowed.');
}
// Now it's safer to call the streaming function
streamTarArchive($path, $downloadName);
}
}
// Example Usage:
// $streamer = new SecureTarStreamer(['/var/www/safe_uploads', '/mnt/user_data']);
// $streamer->send($_GET['directory_to_archive']);
Compare approaches
Performance characteristics vary by use case:
PharData
: Optimal for archive manipulation (reading/writing individual files within an archive) when memory limits are not a concern or for smaller archives.proc_open
+tar
streaming: Best for creating large archives with minimal PHP memory usage (PHP archive streaming), especially for write-once scenarios like backups or log archiving. This is a highly memory efficient tar method.- Transloadit Robot: Ideal for scalable, production use where you want to offload the entire process. It offers automatic optimization, handles various formats (tar, zip, etc.), and integrates with cloud storage.
Choose the method that best matches your workload: random file access, on-premises streaming, or fully managed cloud compression.
What about interrupted downloads?
The tar format is sequential, so true byte-range resumes are not natively supported in a simple HTTP download without third-party tools or server-side logic that can index the stream. In practice, clients might not always gracefully handle interruptions of multi-gigabyte downloads. If resuming is critical for your application:
- Split Data: Consider splitting the data into multiple smaller tar files. Clients can then download these individually, and a failure only affects one part.
- Resumable Transport Protocols: Use a resumable transport mechanism. After the tar archive is
produced (even if streamed to a temporary local file first, or directly piped), you could then
serve it or upload it using protocols like
tus.io
or leverage features like S3 multipart uploads if the destination is cloud storage. - Let Transloadit Handle It: Transloadit's platform is designed for robust file processing and delivery, inherently managing many complexities of large file transfers.
Wrap-up
Using proc_open()
in combination with the system's tar
utility is a simple yet powerful pattern
for creating large archives in PHP with a near-zero memory footprint within the PHP process itself.
This technique is particularly effective for tasks like archiving PHP CI/CD logs, creating nightly
backups, or handling any large dataset that needs to be written once and streamed efficiently. It's
a great way to achieve tar without memory limits in PHP.
Need something even more hands-off? Transloadit’s 🤖 /file/compress
Robot can create .tar
(optionally gzipped) archives for you. A minimal
Assembly looks like this:
{
"steps": {
"compressed": {
"robot": "/file/compress",
"use": ":original",
"format": "tar",
"gzip": true
}
}
}
The Robot supports both tar
and zip
formats, with optional gzip compression. Give it
a try and let our infrastructure worry about memory limits, concurrency, and edge-case handling
while you focus on building features.