Skip to content

Beyond upload_file(): How to Handle S3 Like a True Engineer

Most developers treat Amazon S3 like a magic folder. They call s3.upload_file(), hope for the best, and move on. But when you’re building high-throughput systems, “hoping for the best” leads to OOM (Out of Memory) kills and corrupted data.

After digging deep into the AWS documentation and battle-testing my FastAPI services, I’ve refined a strategy for S3 that prioritizes memory efficiency, data integrity, and non-blocking I/O.

1. The Memory Problem: Streaming in Chunks

Loading a 2GB file into RAM to upload it is a beginner’s mistake. A true engineer streams data. By using a chunked approach, your memory usage stays flat whether the file is 10MB or 10GB.

2. Integrity: Server-Side MD5 Verification

How do you know the bytes that left your NIC are exactly what arrived at Amazon? You don’t-unless you use the Content-MD5 header. By calculating the MD5 hash locally and sending it with the request, S3 will reject the upload if even a single bit is flipped during transit.

3. The Async Bottleneck: Don’t Block the Loop

boto3 is a synchronous library. If you call it directly inside an async def FastAPI route, you stop the entire event loop. To stay asynchronous, you must offload the work to a ThreadPoolExecutor or use a native async library like aioboto3.

4. Latency, Jitter, and Metadata

Network calls fail. Engineers implement Exponential Backoff + Jitter to prevent crashing services during recovery. Additionally, we use Metadata to store context (like uploader IDs), saving us from expensive database lookups later.

The “Engineer’s Choice” Implementation

Here is a robust pseudo code, asynchronous wrapper for the standard boto3 client that ensures your FastAPI server stays responsive while maintaining data integrity.

# --- THE ENGINEER'S BLUEPRINT (PSEUDO-CODE) ---

async def upload_file_engineered(path, bucket, key):
    # 1. SETUP: Configure standard retries with jitter + timeouts
    s3_client = initialize_client(retries=5, backoff="exponential_with_jitter")

    # 2. INTEGRITY: Stream file through MD5 to keep RAM footprint low
    # We do this before uploading to ensure we know exactly what we are sending
    md5_hash = new_hasher("md5")
    with open(path, "rb") as stream:
        while chunk := stream.read(4096):
            md5_hash.update(chunk)

    encoded_md5 = base64_encode(md5_hash.digest())

    # 3. ASYNC WRAPPER: S3 calls are usually blocking (sync)
    # To prevent freezing the main Event Loop, we offload to a thread
    async with open(path, "rb") as file_data:
        upload_task = lambda: s3_client.put_object(
            target=bucket,
            name=key,
            payload=file_data,       # Streams directly from disk
            checksum=encoded_md5,    # S3 will verify this on their end
            metadata={"user": "kfir"} # Meaningful context
        )

        try:
            # 4. EXECUTION: Run the sync task in a non-blocking thread pool
            result = await run_in_background_thread(upload_task)
            return result

        except IntegrityError:
            # 5. ERROR HANDLING: Catch if a single bit flipped during transit
            log_critical("Data corruption detected! Local MD5 != S3 MD5")
            raise

Conclusion

Reading the documentation isn’t just about finding the right function; it’s about understanding the contract between your code and the infrastructure.

By handling chunks, verifying hashes, and managing the event loop correctly, you aren’t just moving files-you’re building a resilient system that won’t fail when the load gets heavy.

Next time you hit S3, ask yourself: Is this just an upload, or is this engineering?


References:

https://docs.aws.amazon.com/boto3/latest/reference/services/s3/bucket/put_object.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html