Scaling to 20M Users: Architecture Lessons Learned
Scaling from thousands to 20 million users required significant architecture changes. Here are the lessons learned from this journey.
Initial Architecture
Monolithic Application
┌─────────────────┐
│ Web Server │
│ (Node.js/PHP) │
└────────┬────────┘
│
┌────────▼────────┐
│ PostgreSQL │
│ (Single DB) │
└─────────────────┘
Problems:
- Database became bottleneck
- Single point of failure
- Hard to scale horizontally
- Long deployment cycles
Evolution: Phase 1 - Database Optimization
Read Replicas
┌─────────────┐
│ Primary DB │
└──────┬──────┘
│
├──────────┬──────────┐
│ │ │
┌──────▼──┐ ┌────▼────┐ ┌─▼─────┐
│ Replica │ │ Replica │ │Replica│
│ 1 │ │ 2 │ │ 3 │
└─────────┘ └─────────┘ └───────┘
Results:
- 3x read capacity
- Reduced primary DB load
- Better geographic distribution
Connection Pooling
// Before: Direct connections
const client = new pg.Client();
await client.connect();
// After: Connection pool
const pool = new pg.Pool({
max: 20,
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});
Results:
- Reduced connection overhead
- Better resource utilization
- 40% reduction in connection time
Phase 2: Caching Strategy
Multi-Layer Caching
┌─────────────┐
│ Client │
└──────┬──────┘
│
┌──────▼──────┐
│ CDN │ (Static assets)
└──────┬──────┘
│
┌──────▼──────┐
│ Redis │ (Application cache)
│ (In-memory)│
└──────┬──────┘
│
┌──────▼──────┐
│ Database │
└─────────────┘
Redis Implementation
// Cache user data
async function getUser(userId) {
const cacheKey = `user:${userId}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
return JSON.parse(cached);
}
// Fetch from database
const user = await db.users.findById(userId);
// Cache for 5 minutes
await redis.setex(cacheKey, 300, JSON.stringify(user));
return user;
}
Results:
- 80% cache hit rate
- 10x reduction in database queries
- Sub-millisecond response times
Cache Warming
// Pre-populate cache for popular content
async function warmCache() {
const popularUsers = await db.users.findPopular(1000);
for (const user of popularUsers) {
await redis.setex(
`user:${user.id}`,
3600,
JSON.stringify(user)
);
}
}
Phase 3: Microservices
Service Decomposition
┌──────────────┐
│ API Gateway │
└──────┬───────┘
│
┌───┴───┬────────┬────────┐
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌───▼──┐
│User │ │Order│ │Pay │ │Notif │
│Svc │ │Svc │ │Svc │ │Svc │
└──┬──┘ └──┬──┘ └──┬──┘ └──┬───┘
│ │ │ │
┌──▼──┐ ┌──▼──┐ ┌──▼──┐ ┌──▼──┐
│User │ │Order│ │Pay │ │Msg │
│DB │ │DB │ │DB │ │Queue│
└─────┘ └─────┘ └─────┘ └─────┘
Benefits:
- Independent scaling
- Technology diversity
- Faster deployments
- Fault isolation
Phase 4: Database Sharding
User Sharding Strategy
// Shard by user ID
function getShard(userId) {
const shardId = parseInt(userId) % 10; // 10 shards
return `user_db_${shardId}`;
}
async function getUser(userId) {
const shard = getShard(userId);
const db = getDatabaseConnection(shard);
return await db.users.findById(userId);
}
Results:
- 10x write capacity
- Reduced database size per shard
- Better query performance
Cross-Shard Queries
// Query across shards
async function searchUsers(query) {
const shards = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
const results = await Promise.all(
shards.map(shard =>
queryShard(shard, query)
)
);
return results.flat();
}
Phase 5: Message Queues
Async Processing
┌──────────┐
│ API │
└────┬─────┘
│
▼
┌──────────┐
│ SQS │ (Message Queue)
└────┬─────┘
│
├─────────┬─────────┐
│ │ │
┌────▼──┐ ┌───▼──┐ ┌───▼──┐
│Worker │ │Worker│ │Worker│
│ 1 │ │ 2 │ │ 3 │
└───────┘ └──────┘ └──────┘
Use Cases:
- Email sending
- Image processing
- Analytics
- Notifications
Performance Metrics
Before Optimization
- Average response time: 2.5s
- Database CPU: 95%
- Cache hit rate: 0%
- Uptime: 99.5%
After Optimization
- Average response time: 150ms
- Database CPU: 30%
- Cache hit rate: 80%
- Uptime: 99.99%
Key Decisions
1. Database Choice
PostgreSQL for:
- Complex queries
- ACID transactions
- JSON support
MongoDB for:
- High write throughput
- Flexible schema
- Horizontal scaling
2. Caching Strategy
Redis for:
- Hot data
- Session storage
- Real-time features
CDN for:
- Static assets
- Global distribution
3. Message Queue
AWS SQS for:
- Reliability
- Auto-scaling
- Managed service
4. Monitoring
Prometheus + Grafana for:
- Metrics collection
- Alerting
- Visualization
Lessons Learned
1. Start Simple
Don’t over-engineer initially. Optimize when you have real problems.
2. Measure Everything
You can’t optimize what you don’t measure. Track:
- Response times
- Error rates
- Cache hit rates
- Database performance
3. Cache Aggressively
Caching provides the biggest performance gains. Cache:
- User data
- Popular content
- Expensive queries
4. Database is Bottleneck
Optimize database first:
- Indexes
- Query optimization
- Connection pooling
- Read replicas
5. Scale Horizontally
Add more servers instead of bigger servers:
- Easier to scale
- Better fault tolerance
- Cost effective
6. Async Processing
Move heavy operations to background:
- Email sending
- Image processing
- Data aggregation
7. Monitor and Alert
Set up monitoring and alerts:
- Track key metrics
- Alert on anomalies
- Review regularly
Architecture Principles
- Stateless services - Easier to scale
- Idempotent operations - Safe retries
- Graceful degradation - Handle failures
- Circuit breakers - Prevent cascading failures
- Rate limiting - Protect resources
- Health checks - Monitor service health
- Blue-green deployments - Zero downtime
- Feature flags - Gradual rollouts
Current Architecture
┌─────────────┐
│ CDN │
│ (CloudFront)│
└──────┬──────┘
│
┌──────▼──────┐
│ API Gateway │
└──────┬──────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ User │ │ Order │ │ Pay │
│ Service │ │ Service │ │ Service │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Redis │ │ SQS │ │ SQS │
│ Cache │ │ Queue │ │ Queue │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Postgres │ │ Workers │ │ Workers │
│(Sharded)│ └──────────┘ └──────────┘
└─────────┘
Conclusion
Scaling to 20M users required:
- Database optimization (replicas, sharding)
- Aggressive caching (Redis, CDN)
- Microservices architecture
- Async processing (message queues)
- Continuous monitoring
Start simple, measure everything, and optimize based on data. The architecture evolved over time based on real needs.
Scaling lessons from December 2018, covering the journey to 20 million users.