AWS S3: Beyond Simple Storage

For years, S3 was where files went to die.

Upload a PDF, store the key in your database, serve it back when someone clicks. That’s it. That’s the whole mental model most teams had in 2016 — and honestly, it’s the mental model that still costs people money today.

Then we started treating S3 as infrastructure instead of a folder, and everything changed. Lifecycle policies cut our storage bill. Event notifications replaced cron jobs that polled for new uploads. Presigned URLs let browsers upload directly without our servers becoming a bandwidth bottleneck. CloudFront made assets load fast enough that we stopped apologizing for them in demos.

After moving petabytes through S3, these are the patterns that turned “file storage” into a platform.

Fundamentals That Save You Later

Name Buckets Like You’ll Have to Explain Them in an Audit

# Good naming conventions
company-app-production-assets
company-app-staging-logs
company-app-backups-2016

# Avoid
my-bucket
test123
prod

Bucket names are global. “prod” is taken. “my-bucket” is taken by someone who had the same idea in 2012. Include environment, purpose, and something unique to your org.

import boto3

s3 = boto3.client('s3')

# Create bucket with proper configuration
s3.create_bucket(
    Bucket='mycompany-prod-assets',
    CreateBucketConfiguration={'LocationConstraint': 'us-west-2'}
)

# Enable versioning
s3.put_bucket_versioning(
    Bucket='mycompany-prod-assets',
    VersioningConfiguration={'Status': 'Enabled'}
)

# Enable encryption
s3.put_bucket_encryption(
    Bucket='mycompany-prod-assets',
    ServerSideEncryptionConfiguration={
        'Rules': [{
            'ApplyServerSideEncryptionByDefault': {
                'SSEAlgorithm': 'AES256'
            }
        }]
    }
)

Versioning and encryption at creation time. Retrofitting encryption after you’ve stored sensitive data is the kind of project that gets deprioritized forever.

Object Keys: Organization That Scales

S3 doesn’t have folders. It has key prefixes that look like folders. Design keys for query patterns and lifecycle rules:

# Good - enables S3 prefix optimization
uploads/2016/06/15/user-123/avatar.jpg
logs/production/2016-06-15/app-server-01.log
backups/database/daily/2016-06-15-db-snapshot.sql.gz

# Bad - all objects in same prefix
user-123-avatar.jpg
app-server-01-2016-06-15.log

Date-based prefixes let lifecycle policies target logs/ without touching uploads/. User IDs in paths make per-user cleanup possible. Flat namespaces work until you have millions of objects and need to partition anything.

Lifecycle Policies: Set It and Forget It (Your Finance Team Will Thank You)

Nobody deletes old logs manually. Nobody remembers to clean up failed multipart uploads. Lifecycle policies do both while you sleep:

{
  "Rules": [
    {
      "Id": "Move old logs to Glacier",
      "Status": "Enabled",
      "Prefix": "logs/",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    },
    {
      "Id": "Clean up incomplete multipart uploads",
      "Status": "Enabled",
      "Prefix": "",
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    },
    {
      "Id": "Delete old versions",
      "Status": "Enabled",
      "Prefix": "",
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 30
      }
    }
  ]
}

aws s3api put-bucket-lifecycle-configuration \
    --bucket mycompany-prod-assets \
    --lifecycle-configuration file://lifecycle.json

That multipart cleanup rule alone saved us from a slow leak of orphaned upload parts — invisible until the bill arrives.

Event Notifications: S3 as a Trigger, Not a Destination

Upload happens → something else runs. No polling. No “check S3 every five minutes” cron job that misses the window and doubles up.

{
  "LambdaFunctionConfigurations": [
    {
      "LambdaFunctionArn": "arn:aws:lambda:us-west-2:123456789:function:ProcessImage",
      "Events": ["s3:ObjectCreated:*"],
      "Filter": {
        "Key": {
          "FilterRules": [
            {
              "Name": "prefix",
              "Value": "uploads/images/"
            },
            {
              "Name": "suffix",
              "Value": ".jpg"
            }
          ]
        }
      }
    }
  ]
}

import boto3
from PIL import Image
import io

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Get uploaded image
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Download image
    response = s3.get_object(Bucket=bucket, Key=key)
    image_data = response['Body'].read()
    
    # Create thumbnail
    image = Image.open(io.BytesIO(image_data))
    image.thumbnail((200, 200))
    
    # Save thumbnail
    buffer = io.BytesIO()
    image.save(buffer, 'JPEG')
    buffer.seek(0)
    
    # Upload thumbnail
    thumbnail_key = key.replace('uploads/', 'thumbnails/')
    s3.put_object(
        Bucket=bucket,
        Key=thumbnail_key,
        Body=buffer,
        ContentType='image/jpeg'
    )
    
    return {
        'statusCode': 200,
        'body': f'Processed {key}'
    }

User uploads avatar → Lambda generates thumbnail → thumbnail appears before they refresh. The UX improvement is free once the plumbing exists.

Direct Browser Uploads: Get Your Servers Out of the Middle

Routing every upload through your app server works until someone uploads a 200MB video and your autoscaling group wakes up confused.

Presigned URLs let the browser talk directly to S3:

# Backend API endpoint
from flask import Flask, jsonify, request
import boto3
from datetime import timedelta

app = Flask(__name__)
s3 = boto3.client('s3')

@app.route('/api/upload-url', methods=['POST'])
def generate_upload_url():
    data = request.json
    filename = data['filename']
    content_type = data['contentType']
    
    # Generate unique key
    key = f"uploads/{user_id}/{uuid.uuid4()}/{filename}"
    
    # Generate presigned URL (valid for 5 minutes)
    presigned_url = s3.generate_presigned_url(
        'put_object',
        Params={
            'Bucket': 'mycompany-prod-assets',
            'Key': key,
            'ContentType': content_type
        },
        ExpiresIn=300
    )
    
    return jsonify({
        'uploadUrl': presigned_url,
        'key': key
    })

async function uploadFile(file) {
    // Get presigned URL from backend
    const response = await fetch('/api/upload-url', {
        method: 'POST',
        headers: {'Content-Type': 'application/json'},
        body: JSON.stringify({
            filename: file.name,
            contentType: file.type
        })
    });
    
    const {uploadUrl, key} = await response.json();
    
    // Upload directly to S3
    await fetch(uploadUrl, {
        method: 'PUT',
        body: file,
        headers: {
            'Content-Type': file.type
        }
    });
    
    return key;
}

// Usage
document.getElementById('fileInput').addEventListener('change', async (e) => {
    const file = e.target.files[0];
    const key = await uploadFile(file);
    console.log('Uploaded to:', key);
});

Your server generates the URL (authenticated, validated) and stores the key. S3 handles the bytes. Everyone’s happier, especially your bandwidth bill.

Static Website Hosting: S3 as a CDN Origin

For static sites, S3 website hosting is dead simple:

# Enable website hosting
aws s3 website s3://mycompany-website \
    --index-document index.html \
    --error-document error.html

# Set bucket policy for public read
aws s3api put-bucket-policy \
    --bucket mycompany-website \
    --policy '{
        "Version": "2012-10-17",
        "Statement": [{
            "Sid": "PublicReadGetObject",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::mycompany-website/*"
        }]
    }'

# Deploy website
aws s3 sync ./dist s3://mycompany-website \
    --delete \
    --cache-control "max-age=31536000"

aws s3 sync --delete is your deploy command. Cache-control headers prevent browsers from serving stale assets until someone hard-refreshes in a demo.

CloudFront: Because Virginia Is Far From Everyone

S3 buckets live in a region. Your users don’t. CloudFront caches at the edge:

{
  "DistributionConfig": {
    "Origins": [{
      "Id": "S3-mycompany-prod-assets",
      "DomainName": "mycompany-prod-assets.s3.amazonaws.com",
      "S3OriginConfig": {
        "OriginAccessIdentity": "origin-access-identity/cloudfront/ABCDEFG"
      }
    }],
    "DefaultCacheBehavior": {
      "TargetOriginId": "S3-mycompany-prod-assets",
      "ViewerProtocolPolicy": "redirect-to-https",
      "AllowedMethods": ["GET", "HEAD"],
      "CachedMethods": ["GET", "HEAD"],
      "ForwardedValues": {
        "QueryString": false,
        "Cookies": {"Forward": "none"}
      },
      "MinTTL": 0,
      "DefaultTTL": 86400,
      "MaxTTL": 31536000
    },
    "Enabled": true,
    "Comment": "CDN for S3 assets",
    "Aliases": ["assets.mycompany.com"],
    "ViewerCertificate": {
      "ACMCertificateArn": "arn:aws:acm:us-east-1:123456789:certificate/abc",
      "SSLSupportMethod": "sni-only",
      "MinimumProtocolVersion": "TLSv1.2_2016"
    }
  }
}

When you deploy new assets, invalidate the cache — or accept that some users see old CSS for an hour:

import boto3

cloudfront = boto3.client('cloudfront')

def invalidate_cache(distribution_id, paths):
    """Invalidate CloudFront cache for specific paths"""
    cloudfront.create_invalidation(
        DistributionId=distribution_id,
        InvalidationBatch={
            'Paths': {
                'Quantity': len(paths),
                'Items': paths
            },
            'CallerReference': str(time.time())
        }
    )

# Usage
invalidate_cache('E1234ABCD', ['/images/*', '/css/*'])

Invalidations cost money at scale. Versioned asset filenames (app.a1b2c3.js) are cheaper than wildcard invalidations on every deploy.

Multipart Upload: Large Files Without Large Headaches

import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client('s3')

# Configure multipart threshold and chunk size
config = TransferConfig(
    multipart_threshold=1024 * 25,  # 25 MB
    max_concurrency=10,
    multipart_chunksize=1024 * 25,  # 25 MB
    use_threads=True
)

# Upload large file
s3.upload_file(
    'large-file.zip',
    'mycompany-prod-assets',
    'uploads/large-file.zip',
    Config=config,
    Callback=ProgressPercentage('large-file.zip')
)

# Progress callback
class ProgressPercentage:
    def __init__(self, filename):
        self._filename = filename
        self._size = float(os.path.getsize(filename))
        self._seen_so_far = 0
        self._lock = threading.Lock()

    def __call__(self, bytes_amount):
        with self._lock:
            self._seen_so_far += bytes_amount
            percentage = (self._seen_so_far / self._size) * 100
            print(f"\r{self._filename} {percentage:.2f}% complete", end='')

Boto3 handles multipart automatically above the threshold. You just configure chunk size and concurrency. Pair this with the lifecycle rule that aborts stale multipart uploads.

Cross-Region Replication: When One Region Isn’t Enough

{
  "Role": "arn:aws:iam::123456789:role/s3-replication-role",
  "Rules": [{
    "Status": "Enabled",
    "Priority": 1,
    "Filter": {"Prefix": ""},
    "Destination": {
      "Bucket": "arn:aws:s3:::mycompany-backup-eu-west-1",
      "ReplicationTime": {
        "Status": "Enabled",
        "Time": {"Minutes": 15}
      },
      "Metrics": {
        "Status": "Enabled",
        "EventThreshold": {"Minutes": 15}
      }
    },
    "DeleteMarkerReplication": {"Status": "Enabled"}
  }]
}

Replication isn’t backup — deleted objects replicate too if you enable delete marker replication. Understand what you’re protecting against: regional outage, not accidental deletion.

Security: Public Buckets Are a Career Event

Bucket Policies That Actually Restrict

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:*",
    "Resource": [
      "arn:aws:s3:::mycompany-prod-assets",
      "arn:aws:s3:::mycompany-prod-assets/*"
    ],
    "Condition": {
      "NotIpAddress": {
        "aws:SourceIp": [
          "203.0.113.0/24",
          "198.51.100.0/24"
        ]
      }
    }
  }]
}

Deny-by-default with IP allowlists for internal buckets. Public assets go through CloudFront with OAI, not wide-open bucket policies.

Least-Privilege IAM

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:DeleteObject"
    ],
    "Resource": "arn:aws:s3:::mycompany-prod-assets/uploads/${aws:username}/*"
  }]
}

Scope uploads to per-user prefixes. The ${aws:username} variable means users can’t overwrite each other’s files even with a valid credential.

KMS Encryption for Sensitive Data

s3.put_object(
    Bucket='mycompany-prod-assets',
    Key='sensitive-data.txt',
    Body=b'secret information',
    ServerSideEncryption='aws:kms',
    SSEKMSKeyId='arn:aws:kms:us-west-2:123456789:key/abc-123'
)

AES-256 default encryption is fine for most assets. KMS adds key rotation, audit trails, and granular access control for data that would ruin your week if it leaked.

Cost Optimization: The Bill Is the Architecture Review

Storage Class Transitions

# Archive old logs to Glacier
def archive_old_logs():
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('mycompany-logs')
    
    cutoff_date = datetime.now() - timedelta(days=90)
    
    for obj in bucket.objects.filter(Prefix='logs/'):
        if obj.last_modified < cutoff_date:
            obj.copy_from(
                CopySource={'Bucket': bucket.name, 'Key': obj.key},
                StorageClass='GLACIER',
                MetadataDirective='COPY'
            )

Lifecycle policies automate this. Manual scripts are for one-off migrations or buckets you inherited from someone who left.

Intelligent Tiering

aws s3api put-bucket-intelligent-tiering-configuration \
    --bucket mycompany-prod-assets \
    --id EntireBucket \
    --intelligent-tiering-configuration '{
        "Id": "EntireBucket",
        "Status": "Enabled",
        "Tierings": [
            {
                "Days": 90,
                "AccessTier": "ARCHIVE_ACCESS"
            },
            {
                "Days": 180,
                "AccessTier": "DEEP_ARCHIVE_ACCESS"
            }
        ]
    }'

For data with unpredictable access patterns, intelligent tiering beats guessing which storage class to pick upfront.

Monitoring: S3 Is Silent Until It Isn’t

Access Logging

s3.put_bucket_logging(
    Bucket='mycompany-prod-assets',
    BucketLoggingStatus={
        'LoggingEnabled': {
            'TargetBucket': 'mycompany-logs',
            'TargetPrefix': 's3-access-logs/'
        }
    }
)

Access logs go to another bucket. Yes, that bucket also costs money. So does not knowing who downloaded what.

CloudWatch Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

def get_s3_metrics(bucket_name):
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/S3',
        MetricName='NumberOfObjects',
        Dimensions=[
            {'Name': 'BucketName', 'Value': bucket_name},
            {'Name': 'StorageType', 'Value': 'AllStorageTypes'}
        ],
        StartTime=datetime.utcnow() - timedelta(days=1),
        EndTime=datetime.utcnow(),
        Period=86400,
        Statistics=['Average']
    )
    
    return response['Datapoints']

Object count and total size trends catch the bucket that’s quietly growing because someone enabled debug logging to S3 and forgot.

Advanced Patterns: Where S3 Gets Interesting

S3 + SQS: Reliable Event Processing

Lambda timeouts and retries can lose events. SQS in the middle adds durability:

{
  "QueueConfigurations": [{
    "QueueArn": "arn:aws:sqs:us-west-2:123456789:process-uploads",
    "Events": ["s3:ObjectCreated:*"],
    "Filter": {
      "Key": {
        "FilterRules": [{
          "Name": "prefix",
          "Value": "uploads/"
        }]
      }
    }
  }]
}

Upload → S3 event → SQS → worker processes at its own pace. Backpressure handled. Retries built in.

Data Lake Layout

Organize for analytics tools that partition by path:

s3://data-lake/
├── raw/
│   ├── year=2016/
│   │   ├── month=06/
│   │   │   ├── day=15/
│   │   │   │   └── data.parquet
├── processed/
│   ├── users/
│   │   └── year=2016/month=06/day=15/
├── analytics/
│   └── reports/
│       └── daily-summary-2016-06-15.csv

Hive-style partitioning (year=2016/month=06/day=15) lets Athena, Spark, and friends skip entire prefixes during queries. Your future data team will assume you did this. Do it now.

The Real Takeaway

S3 stopped being “where we put files” and became “how we trigger workflows, serve assets globally, and tier storage costs automatically.” That’s the mindset shift.

Start with good bucket naming, encryption, and key structure. Add lifecycle policies before your first real bill. Use presigned URLs for uploads. Put CloudFront in front of anything users download. Wire event notifications when you catch yourself polling S3. Lock down access with IAM and bucket policies before someone makes a bucket public on accident.

Layer complexity as you need it. The patterns here handled billions of objects because each one solved a specific problem we actually had — not because we architected for theoretical scale on day one.

S3 best practices from mid-2016, when Lambda triggers and lifecycle automation were becoming standard patterns. Core concepts unchanged; storage classes and replication options have expanded since.