Observability in Microservices: Prometheus and Grafana
Observability is critical for microservices. After implementing Prometheus and Grafana in production, here’s how to set up effective monitoring.
What is Observability?
Observability consists of:
- Metrics - Quantitative measurements
- Logs - Event records
- Traces - Request flows
Prometheus Overview
Prometheus is a metrics collection and alerting system:
- Pull-based metrics
- Time-series database
- PromQL query language
- Alerting rules
Basic Setup
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'api-service'
static_configs:
- targets: ['api-service:8080']
metrics_path: '/metrics'
Docker Compose
version: '3'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
volumes:
prometheus-data:
grafana-data:
Instrumenting Applications
Node.js Example
const express = require('express');
const client = require('prom-client');
// Create registry
const register = new client.Registry();
// Add default metrics
client.collectDefaultMetrics({ register });
// Create custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
});
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
const app = express();
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = {
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode
};
httpRequestDuration.observe(labels, duration);
httpRequestTotal.inc(labels);
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(8080);
Python Example
from prometheus_client import start_http_server, Counter, Histogram, Gauge
import time
# Metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_connections = Gauge(
'active_connections',
'Active connections'
)
# Instrument your code
def handle_request(method, endpoint):
with http_request_duration.labels(method=method, endpoint=endpoint).time():
# Your code here
http_requests_total.labels(
method=method,
endpoint=endpoint,
status=200
).inc()
# Start metrics server
start_http_server(8000)
PromQL Queries
Basic Queries
# Rate of HTTP requests per second
rate(http_requests_total[5m])
# Average request duration
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m])
# 95th percentile latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m])
Advanced Queries
# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes) /
node_memory_MemTotal_bytes * 100
# Disk I/O
rate(node_disk_io_time_seconds_total[5m])
# Network traffic
rate(node_network_receive_bytes_total[5m])
Alerting Rules
# alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) /
rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is "
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is s"
- alert: ServiceDown
expr: up{job="api-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: " is down"
- name: infrastructure_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage"
- alert: LowDiskSpace
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'email'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: ''
text: ''
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'email'
email_configs:
- to: 'team@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
Grafana Dashboards
Dashboard JSON
{
"dashboard": {
"title": "API Service Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": " "
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
"legendFormat": "Error Rate"
}
]
},
{
"title": "Latency (95th percentile)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
]
}
]
}
}
Grafana Queries
# Request rate by endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Active connections
active_connections
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Service Discovery
Kubernetes Service Discovery
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Best Practices
- Use consistent labels - Standardize label names
- Avoid high cardinality - Don’t use user IDs as labels
- Set appropriate scrape intervals - Balance freshness vs load
- Use recording rules - Pre-compute expensive queries
- Monitor Prometheus itself - Track its performance
- Retention policy - Configure data retention
- Alert on alerting - Monitor alert delivery
- Document metrics - Clear help text and labels
Conclusion
Prometheus + Grafana provide:
- Comprehensive metrics collection
- Powerful query language
- Effective alerting
- Beautiful visualizations
Start with basic metrics, then add custom metrics and alerts. The patterns shown here handle production monitoring.
Observability with Prometheus and Grafana from February 2019, covering Prometheus 2.0+ features.