Building Resilient Systems: Timeout, Retry, and Fallback
The incident report was maddening in its simplicity. Payment service healthy according to Kubernetes. CPU at 12%. Memory fine. Logs clean. But checkout had been failing for eleven minutes because the payment provider’s API was accepting TCP connections and then… nothing. No response. No error. Just silence.
Our HTTP client had no timeout configured. Default behavior: wait forever.
Forever, in production, is about eleven minutes—until connection pools fill up, thread pools exhaust, and the failure cascades from payment service to order service to the entire checkout flow. One hung connection took down revenue for a quarter hour.
That incident taught me something I’ve never forgotten: resilience isn’t about handling errors. It’s about handling the absence of errors when something is clearly wrong.
Timeouts, retries, and fallbacks are the three patterns I reach for first. Not because they’re fancy—because they prevent silent failures from becoming cascading disasters.
The Failure Modes You’re Actually Fighting
Before diving into patterns, understand what breaks in distributed systems:
| Failure Type | What Happens | What You See |
|---|---|---|
| Hard failure | Service returns 500, connection refused | Error logs, alerts fire |
| Slow failure | Service responds… eventually | Timeouts (if configured) |
| Silent failure | Connection accepted, no response | Nothing. Until cascade. |
| Partial failure | Service returns degraded data | Wrong data, silent corruption |
| Thundering herd | Service recovers, all clients retry at once | Immediate re-failure |
Hard failures are easy. Silent and slow failures are what kill you at 3am.
Pattern 1: Timeouts — Assume Everything Will Hang
Every external call needs a timeout. No exceptions. Not the database. Not Redis. Not the internal service “that never goes down.”
function withTimeout(promise, timeoutMs, operation = 'operation') {
return Promise.race([
promise,
new Promise((_, reject) => {
setTimeout(() => {
reject(new TimeoutError(
`${operation} timed out after ${timeoutMs}ms`
));
}, timeoutMs);
}),
]);
}
// Usage
async function getUser(userId) {
return withTimeout(
userService.fetch(userId),
3000, // 3 seconds max
`fetchUser(${userId})`
);
}
Choosing Timeout Values
This is part science, part art:
const TIMEOUTS = {
// User-facing requests: tight
checkout: 5000,
search: 2000,
// Internal service calls: moderate
userService: 3000,
inventoryService: 3000,
// Background jobs: generous
reportGeneration: 60000,
dataExport: 300000,
};
My rule of thumb: set timeout at 2-3x your p99 latency, then tune based on incidents. If your user service p99 is 800ms, start with 2000-3000ms timeout.
Too tight: false timeouts during normal load spikes. Too loose: cascading failures during actual outages.
Timeouts at Every Layer
Don’t just timeout the HTTP call—timeout the entire operation including retries:
async function checkoutWithDeadline(cart, deadlineMs = 10000) {
const deadline = Date.now() + deadlineMs;
const remainingTime = () => deadline - Date.now();
if (remainingTime() <= 0) {
throw new DeadlineExceededError('Checkout deadline exceeded');
}
const user = await withTimeout(
getUser(cart.userId),
Math.min(remainingTime(), 3000),
'getUser'
);
const inventory = await withTimeout(
checkInventory(cart.items),
Math.min(remainingTime(), 3000),
'checkInventory'
);
const payment = await withTimeout(
processPayment(cart, user),
Math.min(remainingTime(), 5000),
'processPayment'
);
return { user, inventory, payment };
}
The checkout has a 10-second deadline. Each sub-operation gets the minimum of its normal timeout and whatever time remains. No single slow call consumes the entire budget.
Pattern 2: Retry — But Not Like a Maniac
Retries save you from transient failures—a momentary network blip, a pod restart, a GC pause. Retries destroy you from permanent failures—if every client retries a failing service simultaneously, you create a DDoS against yourself.
Exponential Backoff with Jitter
async function retryWithBackoff(fn, options = {}) {
const {
maxRetries = 3,
initialDelayMs = 100,
maxDelayMs = 10000,
backoffMultiplier = 2,
jitter = true,
retryableErrors = [TimeoutError, NetworkError],
} = options;
let lastError;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error;
const isRetryable = retryableErrors.some(
ErrorType => error instanceof ErrorType
);
if (!isRetryable || attempt === maxRetries) {
throw error;
}
let delay = initialDelayMs * Math.pow(backoffMultiplier, attempt);
delay = Math.min(delay, maxDelayMs);
// Jitter: randomize delay to prevent thundering herd
if (jitter) {
delay = delay * (0.5 + Math.random() * 0.5);
}
logger.warn(`Retry attempt ${attempt + 1}/${maxRetries}`, {
error: error.message,
delayMs: Math.round(delay),
});
await sleep(delay);
}
}
throw lastError;
}
// Usage
const user = await retryWithBackoff(
() => userService.fetch(userId),
{ maxRetries: 3, initialDelayMs: 200 }
);
What to Retry (and What Not To)
Retry these:
- Timeouts
- Connection errors
- 429 (Too Many Requests) — with backoff
- 502, 503, 504 — server/gateway errors
Never retry these:
- 400 (Bad Request) — your fault, retrying won’t help
- 401, 403 — auth issues
- 404 — resource doesn’t exist
- 409 (Conflict) — business logic rejection
function isRetryable(error) {
if (error instanceof TimeoutError) return true;
if (error instanceof NetworkError) return true;
if (error.statusCode === 429) return true;
if (error.statusCode >= 500) return true;
return false;
}
Retrying a 400 Bad Request twelve times because your retry logic doesn’t check status codes? I’ve done it. It’s embarrassing.
Idempotency: The Retry Prerequisite
Retries are only safe for idempotent operations—operations that produce the same result whether you execute them once or five times.
// NOT idempotent — retrying creates duplicate charges
async function chargeCard(amount, cardToken) {
return paymentProvider.charge({ amount, cardToken });
}
// Idempotent — uses idempotency key
async function chargeCardIdempotent(amount, cardToken, idempotencyKey) {
return paymentProvider.charge({
amount,
cardToken,
idempotencyKey, // Provider deduplicates by this key
});
}
// Usage: generate key once, reuse on retry
const idempotencyKey = `order-${orderId}-payment`;
const payment = await retryWithBackoff(
() => chargeCardIdempotent(amount, cardToken, idempotencyKey)
);
GET requests are naturally idempotent. POST requests need idempotency keys. PUT and DELETE usually are, but verify.
Pattern 3: Fallback — Graceful Degradation
When retries exhaust and timeouts fire, fallback keeps the system usable—degraded, but usable.
async function getUserWithFallback(userId) {
try {
return await retryWithBackoff(
() => withTimeout(userService.fetch(userId), 3000, 'getUser'),
{ maxRetries: 2 }
);
} catch (error) {
logger.warn('User service failed, trying fallbacks', {
userId,
error: error.message,
});
// Fallback 1: Stale cache
const cached = await cache.get(`user:${userId}`);
if (cached) {
metrics.increment('fallback.cache_hit');
return { ...cached, _stale: true };
}
// Fallback 2: Default/guest user
metrics.increment('fallback.default_user');
return {
id: userId,
name: 'Guest',
email: null,
_fallback: true,
};
}
}
Fallback Strategies by Priority
I think of fallbacks as a ladder—try each rung until something works:
- Primary service — the real thing
- Retry with backoff — maybe it was transient
- Stale cache — old data beats no data
- Alternative service — read replica, backup provider
- Static/default response — hardcoded fallback
- Fail gracefully — return error with helpful message
Which rungs you implement depends on the business impact:
// Product recommendations: static fallback is fine
async function getRecommendations(userId) {
try {
return await recommendationService.get(userId);
} catch {
return STATIC_FALLBACK_RECOMMENDATIONS; // Curated bestsellers
}
}
// Payment processing: NO fallback. Fail explicitly.
async function processPayment(order) {
try {
return await paymentService.charge(order);
} catch (error) {
// Never silently substitute a fake payment
throw new PaymentFailedError('Unable to process payment', { cause: error });
}
}
Critical rule: Know which operations can degrade and which must fail hard. Recommending generic products when the ML service is down? Fine. Pretending a payment succeeded? Criminal.
Combining the Patterns
In production, these patterns stack together:
async function getProductDetails(productId) {
const cacheKey = `product:${productId}`;
// Layer 1: Cache (fastest fallback)
const cached = await cache.get(cacheKey);
if (cached) return cached;
// Layer 2: Primary service with timeout + retry
try {
const product = await retryWithBackoff(
() => withTimeout(
productService.fetch(productId),
2000,
'fetchProduct'
),
{ maxRetries: 2, initialDelayMs: 100 }
);
// Populate cache on success
await cache.setex(cacheKey, 300, JSON.stringify(product));
return product;
} catch (primaryError) {
logger.warn('Primary product service failed', {
productId,
error: primaryError.message,
});
// Layer 3: Read replica fallback
try {
const product = await withTimeout(
productServiceReadReplica.fetch(productId),
3000,
'fetchProductReplica'
);
await cache.setex(cacheKey, 60, JSON.stringify(product)); // Shorter TTL
return { ...product, _source: 'replica' };
} catch (replicaError) {
// Layer 4: Static fallback for known products
const staticProduct = STATIC_CATALOG[productId];
if (staticProduct) {
return { ...staticProduct, _source: 'static' };
}
// Layer 5: Fail with useful error
throw new ProductUnavailableError(productId, {
primaryError,
replicaError,
});
}
}
}
Cache → retry with timeout → read replica → static fallback → explicit failure. Five layers of resilience before giving up.
Circuit Breakers: When Retries Become the Problem
Retries help with transient failures. But if a service is genuinely down, retries just waste time and amplify load. Circuit breakers stop calling services that aren’t going to respond:
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeoutMs = options.resetTimeoutMs || 30000;
this.state = 'CLOSED'; // CLOSED = normal, OPEN = failing, HALF_OPEN = testing
this.failureCount = 0;
this.lastFailureTime = null;
}
async execute(fn, fallback) {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
this.state = 'HALF_OPEN';
} else {
return fallback(); // Skip call entirely
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
if (this.state === 'OPEN') {
return fallback();
}
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
logger.error('Circuit breaker OPEN', { failureCount: this.failureCount });
}
}
}
// Usage
const paymentBreaker = new CircuitBreaker({ failureThreshold: 3 });
async function processPayment(order) {
return paymentBreaker.execute(
() => paymentService.charge(order),
() => { throw new PaymentServiceUnavailableError(); }
);
}
When the circuit is OPEN, calls fail immediately—no waiting for timeouts, no retry storms. After 30 seconds, one test call goes through (HALF_OPEN). If it succeeds, circuit closes. If it fails, back to OPEN.
Libraries like opossum (Node.js) and resilience4j (Java) implement this with metrics and configuration.
Testing Resilience (Not Just Happy Path)
Patterns you don’t test don’t work when you need them. I use chaos engineering lite:
// Fault injection middleware for staging
app.use((req, res, next) => {
const faultConfig = req.headers['x-fault-injection'];
if (!faultConfig || process.env.NODE_ENV === 'production') {
return next();
}
const { type, probability = 0.5 } = JSON.parse(faultConfig);
if (Math.random() > probability) return next();
switch (type) {
case 'timeout':
// Never respond — test client timeouts
return; // Hang forever
case 'slow':
return setTimeout(next, 10000); // 10s delay
case 'error':
return res.status(503).json({ error: 'Injected failure' });
default:
next();
}
});
Regular chaos tests in staging:
- Kill a service pod, verify fallbacks activate
- Inject 5-second latency, verify timeouts fire
- Return 503 for 60 seconds, verify circuit breaker opens
- Restore service, verify circuit breaker closes
If you haven’t tested it, assume it’s broken.
Production Checklist
Before shipping any service that calls external dependencies:
- Every external call has a timeout — HTTP, database, cache, message queue
- Retries use exponential backoff with jitter — not immediate retries
- Retry logic checks error types — don’t retry 400s
- Write operations use idempotency keys — safe to retry
- Fallback strategy defined per operation — know what can degrade
- Circuit breakers on critical dependencies — fail fast when services are down
- Metrics on timeout/retry/fallback rates — you can’t fix what you can’t see
- Chaos tests in staging — verify patterns actually work
Conclusion
That eleven-minute checkout outage? A three-line timeout would have prevented it. The payment provider’s silence would have become a caught TimeoutError, retried twice, then failed gracefully with “Payment temporarily unavailable—please try again.”
Resilience patterns aren’t exciting. Nobody gets promoted for adding timeouts. But they’re the difference between a blip users never notice and an incident that wakes up the on-call engineer at 3am.
Set timeouts on everything. Retry with backoff and jitter. Fall back gracefully where business rules allow. Break circuits when services are genuinely down. Test it all in staging before production teaches you the hard way.
Further Resources:
- Release It! — Michael Nygard’s resilience bible
- Circuit Breaker Pattern — Martin Fowler
- Exponential Backoff and Jitter — AWS Architecture Blog
- resilience4j — Java resilience library
- opossum — Node.js circuit breaker
Building resilient systems from May 2022, covering timeout, retry, and fallback patterns.