Migrating from Monolith to Microservices: A Practical Approach
Migrating from Monolith to Microservices: A Practical Approach
Our monolith wasn’t a disaster. That’s almost the problem.
In 2016, we shipped fast. One repo, one database, one deployment. Features went from idea to production in days. The codebase was familiar. Onboarding meant cloning one repository and running docker-compose up. Life was good.
By early 2017, “good” had become “comfortable but cramped.” Deployments made everyone hold their breath. Scaling meant scaling everything because one feature got popular. Three teams were editing the same modules and politely passive-aggressively reviewing each other’s PRs.
Microservices promised independence. They delivered… distributed systems. With extra steps.
We spent six months migrating using the Strangler Fig pattern—gradually replacing pieces of the monolith rather than rewriting in a heroic weekend that would have ended in pizza, tears, and a rollback. Here’s the honest account.
Why We Left (And Why We Hesitated)
The pain was real:
- Deployment bottlenecks — One bad migration took down the entire app
- Scaling mismatches — User profile reads were 80% of traffic; we scaled the whole monolith for them
- Team friction — Merge conflicts in shared modules were a weekly ritual
- Technology lock-in — Python for everything, even when Go or Node would have been a better fit for specific jobs
But microservices aren’t free upgrades. They’re trade-offs:
- Function calls become network calls (with latency and failure modes)
- Distributed transactions become sagas, eventual consistency, and long debugging sessions
- “Works on my machine” becomes “works in my service mesh, probably”
- Infrastructure surface area multiplies
We migrated because the monolith’s pain exceeded the distributed systems tax. If your monolith is merely annoying, fix the monolith. If it’s actively limiting growth, read on.
The Strangler Fig Pattern: Kill It Slowly
Named after the strangler fig tree that grows around a host tree until the host dies, this pattern lets you extract services incrementally while the monolith keeps running.
Old Monolith New Architecture
┌─────────────┐ ┌─────────────┐
│ │ │ API Gateway│
│ Monolith │───────│ │
│ │ └──────┬──────┘
└─────────────┘ │
├─── User Service (new)
├─── Order Service (new)
└─── Monolith (legacy)
No big bang. No “stop the world” rewrite. Traffic routes through a gateway; new services handle what they’re ready for; the monolith handles everything else. Over months, the monolith shrinks.
Step 1: Extract a Read-Heavy Service
We started with user profiles. Why?
- Read-heavy — Mostly GET requests, low risk of data corruption during migration
- Clear boundaries — User data belongs to users; few cross-cutting concerns
- High traffic — Immediate scaling benefit
# Before: In monolith
class UserController:
def get_profile(self, user_id):
user = User.objects.get(id=user_id)
return {
'id': user.id,
'name': user.name,
'email': user.email
}
# After: Extract to service
# user-service/app.py
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/users/<user_id>', methods=['GET'])
def get_user(user_id):
user = db.get_user(user_id)
return jsonify({
'id': user.id,
'name': user.name,
'email': user.email
})
# Monolith: Call service instead
import requests
class UserController:
def get_profile(self, user_id):
response = requests.get(
f'http://user-service/users/{user_id}'
)
return response.json()
The monolith didn’t disappear—it became a client. Users hitting /profile still went through familiar code paths; that code path now proxied to a new service. We could scale user reads independently, deploy user service changes without touching orders, and roll back by flipping a feature flag to read from the local database again.
Lesson learned: start with reads. Writes involve transactions, side effects, and the haunting question “what if the network fails mid-commit?”
Step 2: Extract a Write-Heavy Service
Orders were scarier. Money involved. Inventory involved. Side effects everywhere.
# order-service/app.py
from flask import Flask, jsonify, request
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
app = Flask(__name__)
engine = create_engine('postgresql://...')
Session = sessionmaker(bind=engine)
@app.route('/orders', methods=['POST'])
def create_order():
data = request.json
session = Session()
try:
# Create order
order = Order(
user_id=data['user_id'],
total=data['total']
)
session.add(order)
session.commit()
# Publish event
event_bus.publish('order.created', {
'order_id': order.id,
'user_id': order.user_id
})
return jsonify({'id': order.id}), 201
except Exception as e:
session.rollback()
return jsonify({'error': str(e)}), 500
finally:
session.close()
The critical addition: event publishing. The monolith used to handle order creation and send confirmation emails in the same request. Now the order service commits the order and publishes order.created. Other services (notifications, analytics, inventory) subscribe.
We gave up synchronous simplicity for asynchronous resilience. An email failure no longer rolls back an order. That’s the deal you make with microservices.
Service Boundaries: The Hard Part
Everyone agrees you need “good boundaries.” Nobody agrees where to draw them until six months later when you realize you drew them wrong.
Boundaries That Worked
# User Service - Owns user data
class UserService:
def create_user(self, data):
# User creation logic
pass
def update_profile(self, user_id, data):
# Profile updates
pass
def get_user(self, user_id):
# User retrieval
pass
# Order Service - Owns order data
class OrderService:
def create_order(self, user_id, items):
# Order creation
pass
def get_order(self, order_id):
# Order retrieval
pass
Each service owns its data and its business rules. User service doesn’t reach into order tables. Order service doesn’t update user profiles. They talk via APIs and events.
The test we used: could this team own this service end-to-end? Deployment, monitoring, on-call, schema changes. If two teams would need to coordinate every change, the boundary was wrong.
Boundaries That Didn’t
# Don't do this - too granular
class EmailService:
def send_email(self, to, subject, body):
# Too small, should be part of notification service
pass
# Don't do this - too broad
class BusinessLogicService:
def do_everything(self):
# Too large, defeats purpose
pass
A standalone email microservice sounds clean until you realize every other service needs it. Now you have a critical dependency with no clear owner and latency on every notification.
A “business logic” service is just a monolith with extra network hops and worse grep.
Data: The Part Nobody Puts on Conference Slides
Database Per Service
Each service gets its own database. This is non-negotiable for true service independence:
# user-service/db.py
DATABASE_URL = 'postgresql://user-service-db/...'
# order-service/db.py
DATABASE_URL = 'postgresql://order-service-db/...'
Shared databases are shared coupling. If two services write to the same tables, you haven’t migrated—you’ve distributed a monolith’s problems across more repos.
Sharing Data Without Sharing Tables
When order service needs user info, it doesn’t query the user database. It listens for events:
# User Service publishes event
event_bus.publish('user.created', {
'user_id': user.id,
'email': user.email
})
# Order Service subscribes
@event_bus.subscribe('user.created')
def handle_user_created(event):
# Create order history for new user
create_order_history(event['user_id'])
Eventual consistency enters your vocabulary. The order service’s copy of user data might be seconds stale. Design for it. Show stale data gracefully. Don’t pretend you’re still in a single ACID transaction.
API Gateway: One Front Door
Clients shouldn’t need to know your internal service topology:
# api-gateway/app.py
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
SERVICES = {
'users': 'http://user-service:5000',
'orders': 'http://order-service:5001',
}
@app.route('/<service>/<path:path>', methods=['GET', 'POST', 'PUT', 'DELETE'])
def proxy(service, path):
if service not in SERVICES:
return jsonify({'error': 'Service not found'}), 404
service_url = SERVICES[service]
url = f'{service_url}/{path}'
response = requests.request(
method=request.method,
url=url,
headers={k: v for k, v in request.headers if k != 'Host'},
params=request.args,
json=request.get_json() if request.is_json else None
)
return jsonify(response.json()), response.status_code
The gateway handles routing, authentication, rate limiting, and request logging. Services stay internal. When you split order service into order + inventory, clients don’t change—only the gateway routing table does.
Our gateway was embarrassingly simple in March 2017. It worked. Don’t let perfect gateway architecture block extraction.
Service Discovery: Finding Each Other
Hardcoded URLs work until they don’t. We used Consul:
# Using Consul for service discovery
import consul
c = consul.Consul()
def get_service_url(service_name):
services = c.health.service(service_name)[1]
if not services:
raise Exception(f'Service {service_name} not found')
service = services[0]['Service']
return f"http://{service['Address']}:{service['Port']}"
# Register service
c.agent.service.register(
'user-service',
service_id='user-service-1',
address='user-service',
port=5000
)
In Kubernetes-era hindsight, this looks quaint. In 2017, Consul (or Eureka, or etcd) was how services found each other without hardcoding IPs that changed every deploy.
When Things Break (They Will)
Circuit Breakers
When user service goes down, the monolith shouldn’t hang forever waiting:
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
def call_user_service(user_id):
response = requests.get(
f'http://user-service/users/{user_id}',
timeout=5
)
return response.json()
# Falls back if circuit is open
def get_user_with_fallback(user_id):
try:
return call_user_service(user_id)
except CircuitBreakerError:
# Return cached data or default
return get_cached_user(user_id) or {'id': user_id, 'name': 'Unknown'}
Fail fast. Return degraded responses. Don’t cascade failures across your entire system because one service is having a bad day.
Retries (With Backoff, Please)
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def call_order_service(data):
response = requests.post(
'http://order-service/orders',
json=data,
timeout=5
)
response.raise_for_status()
return response.json()
Retry transient failures. Don’t retry forever. Exponential backoff prevents your recovery attempt from becoming a DDoS against your own order service.
Testing: Trust But Verify
Contract Tests
Services are independent deployables. Their interface is the contract:
# Contract test for user service
def test_user_service_contract():
# Test that service meets contract
response = requests.get('http://user-service/users/1')
assert response.status_code == 200
data = response.json()
# Verify contract
assert 'id' in data
assert 'name' in data
assert 'email' in data
If user service changes its response shape without telling consumers, contract tests catch it before production does.
Integration Tests
# Test service interaction
def test_order_creation_flow():
# Create user
user = create_user({'name': 'Test User'})
# Create order
order = create_order(user['id'], [{'product_id': 1, 'quantity': 2}])
# Verify order has user info
assert order['user_id'] == user['id']
Unit tests prove services work alone. Integration tests prove they work together. You need both, especially when events are involved and timing matters.
Monitoring: One Dashboard Per Service
You can’t debug what you can’t see:
from prometheus_client import Counter, Histogram
request_count = Counter('requests_total', 'Total requests', ['service', 'method'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['service'])
@app.route('/users/<user_id>')
def get_user(user_id):
with request_duration.labels('user-service').time():
request_count.labels('user-service', 'GET').inc()
# Process request
return get_user_data(user_id)
Each service exports metrics. Each service has dashboards. Each service has alerts. “The app is slow” isn’t actionable. “User service p99 latency doubled” is.
What We’d Do Differently
Hindsight is a gift. Here’s ours:
Start smaller. We tried extracting two services in parallel early on. Parallel migrations mean parallel confusion. One service at a time, fully operational, before starting the next.
Spend more time on boundaries. We redrew service lines twice. Upfront domain modeling (even lightweight event storming) would have saved weeks.
Go event-driven from day one. Retrofitting events onto services that initially used synchronous calls meant rewriting integration points. Events first, sync calls only when you genuinely need request-response.
Consider a service mesh earlier. We hand-rolled retries, circuit breakers, and tracing in every service. Istio/Linkerd weren’t production-ready for us in early 2017, but the pain was real.
Contract tests from day one. We added them after a breaking change hit production. Obvious in retrospect.
The Bottom Line
Migrating to microservices is a journey measured in months, not sprints:
- Use the Strangler Fig pattern—no big bang
- Extract read-heavy services first; earn confidence before touching writes
- Draw boundaries around business capabilities, not technical layers
- Database per service; share data via events, not shared tables
- API gateway for clients; service discovery for services
- Circuit breakers, retries, monitoring—distributed systems hygiene
Don’t migrate because microservices are fashionable. Migrate because your monolith’s specific pains outweigh the distributed systems tax.
Our six-month migration wasn’t glamorous. There was no launch day confetti. Just gradually shrinking the monolith, gradually growing confidence, and one day realizing the monolith was mostly gone.
That was worth it. We deploy independently now. We scale what needs scaling. Teams own their services.
We also have more repos, more dashboards, and infinitely more “have you checked the logs in the other service?” conversations.
That’s the deal. We took it. Mostly don’t regret it.
Migration lessons from March 2017, after six months of extracting services from our production monolith. Stack: Python/Flask services, PostgreSQL, Consul, Prometheus, hand-rolled event bus.